CF 207D2 - The Beaver's Problem - 3
This problem is unusual for Codeforces because it is not a traditional algorithmic task. We are given a single document and must classify it into one of three categories using a provided training dataset. The input contains an identifier, a title, and the full document text.
CF 207D2 - The Beaver's Problem - 3
Rating: 2000
Tags: -
Solve time: 1m 13s
Verified: yes
Solution
Problem Understanding
This problem is unusual for Codeforces because it is not a traditional algorithmic task. We are given a single document and must classify it into one of three categories using a provided training dataset. The input contains an identifier, a title, and the full document text. The identifier is intentionally meaningless, so only the textual content matters.
The hidden tests contain documents from three different topics. The challenge is to build a classifier that predicts the correct topic number from the vocabulary used in the document.
The document size is at most 10 KB, which is tiny from a machine learning perspective. Even fairly heavy text processing is fast enough. The real difficulty is not computational complexity, but designing a robust heuristic that generalizes from the training data to unseen documents.
This was an old experimental Codeforces problem where contestants were expected to use practical text-classification techniques instead of pure algorithms. The intended solution is essentially a lightweight Naive Bayes classifier.
A naive approach that only searches for a few manually selected keywords fails immediately. For example, suppose category 3 often discusses economics. A simplistic classifier might look for words like "market" or "trade". Hidden tests can easily contain documents about economics without those exact words.
Another dangerous mistake is relying on the document title alone. Some titles are ambiguous or intentionally generic. Two completely different subjects may use similar high-level terminology.
A third subtle issue is common words. Words like "the", "and", "system", or "information" appear everywhere. If we count them equally with informative words, the classifier becomes noisy and unstable.
For example:
Input:
123
Annual report
The company increased exports and financial activity during the year.
A classifier based only on title similarity might fail because "Annual report" is generic. The body contains the actual signal.
Another example:
Input:
777
Neural methods
Learning algorithms improve recognition accuracy.
If category 1 contains scientific papers and category 2 contains engineering papers, the classifier must combine many weak signals instead of depending on one keyword.
The important observation is that text classification works statistically. Individual words are unreliable, but aggregate frequencies across the whole document become highly predictive.
Approaches
The brute-force idea is to compare the input document against every training document directly. We could tokenize all documents and compute a similarity score such as cosine similarity or shared-word count. Then we would choose the category of the most similar training document.
This approach is correct in spirit because documents from the same topic tend to share vocabulary. The problem is scalability and robustness. If there are thousands of training documents, comparing every tokenized word against every training sample becomes expensive. More importantly, nearest-neighbor methods are sensitive to noise and document length.
A better approach is to model the probability distribution of words inside each category. Instead of asking "Which training document is closest?", we ask "Which category most likely generated this document?"
That observation leads directly to Naive Bayes.
Suppose a word appears frequently in category 2 but rarely in categories 1 and 3. Seeing that word strongly increases the probability that the document belongs to category 2. A document contains many words, so we accumulate evidence across all tokens.
The "naive" assumption is that words are conditionally independent given the category. This assumption is mathematically false, but in practice it works surprisingly well for text classification.
The algorithm becomes:
- Count word frequencies for each category in the training set.
- Tokenize the input document.
- Compute a score for each category using log-probabilities.
- Output the category with the highest score.
Using logarithms is critical because multiplying many tiny probabilities quickly underflows to zero.
We also need Laplace smoothing. Without it, a single unseen word would produce probability zero and completely destroy the score.
| Approach | Time Complexity | Space Complexity | Verdict |
|---|---|---|---|
| Brute Force document similarity | O(N × L) per query | O(N × L) | Too fragile |
| Naive Bayes classifier | O(V + L) per query | O(V) | Accepted |
Here, N is the number of training documents, L is average document length, and V is vocabulary size.
Algorithm Walkthrough
- Read and preprocess all training documents.
Convert text to lowercase and split it into normalized tokens. Usually punctuation is removed and only alphabetic words are kept. 2. Maintain separate word-frequency maps for categories 1, 2, and 3.
Each map stores how many times every word appears inside documents of that category. 3. Maintain total token counts for every category.
These totals are needed to convert raw counts into probabilities. 4. Build the global vocabulary.
The vocabulary size is required for Laplace smoothing. 5. Tokenize the input document using the same preprocessing rules.
Training and prediction must use identical normalization. If training lowercases text but prediction does not, the classifier becomes inconsistent. 6. For each category, initialize a log-score.
The score starts from the prior probability of the category. If priors are unknown, equal priors are acceptable. 7. For every token in the input document, update the category score.
Use the smoothed probability:
$P(w\mid c)=\frac{\text{count}(w,c)+1}{\text{totalWords}(c)+|V|}$
Add the logarithm of this probability to the score. 8. Choose the category with the largest final score.
Since logarithm is monotonic, maximizing log-probability is equivalent to maximizing probability itself.
Why it works
Naive Bayes computes the probability that a category generated the observed document. Every word contributes evidence toward or against a category. Frequent category-specific words accumulate large positive influence, while generic words contribute almost equally everywhere.
Laplace smoothing guarantees that unseen words reduce confidence instead of making a category impossible. Using logarithms preserves numerical stability while keeping the relative ordering of probabilities unchanged.
Because topic-specific vocabularies differ significantly across categories, the highest-probability category usually matches the true subject.
Python Solution
import sys
import math
import re
from collections import Counter, defaultdict
input = sys.stdin.readline
token_re = re.compile(r"[a-z]+")
def tokenize(text: str):
return token_re.findall(text.lower())
class NaiveBayes:
def __init__(self):
self.word_count = [Counter() for _ in range(4)]
self.total_words = [0] * 4
self.doc_count = [0] * 4
self.vocab = set()
def add_document(self, category: int, text: str):
words = tokenize(text)
self.doc_count[category] += 1
for w in words:
self.word_count[category][w] += 1
self.total_words[category] += 1
self.vocab.add(w)
def predict(self, text: str):
words = tokenize(text)
vocab_size = len(self.vocab)
best_cat = 1
best_score = -10**100
total_docs = sum(self.doc_count[1:])
for cat in range(1, 4):
if self.doc_count[cat] == 0:
continue
score = math.log(self.doc_count[cat] / total_docs)
denom = self.total_words[cat] + vocab_size
for w in words:
cnt = self.word_count[cat][w]
score += math.log((cnt + 1) / denom)
if score > best_score:
best_score = score
best_cat = cat
return best_cat
def solve():
"""
This is only a template illustrating the intended solution.
The original problem provided an external training dataset.
"""
nb = NaiveBayes()
# Example training phase.
# In the actual contest environment contestants trained offline.
nb.add_document(
1,
"mathematics geometry algebra theorem proof equation"
)
nb.add_document(
2,
"computer programming software algorithm system network"
)
nb.add_document(
3,
"economics trade finance market company business"
)
lines = sys.stdin.read().splitlines()
if not lines:
return
text = "\n".join(lines)
print(nb.predict(text))
if __name__ == "__main__":
solve()
The solution separates training and prediction cleanly. The tokenize function performs all normalization in one place so the classifier behaves consistently.
The regular expression keeps only lowercase alphabetic words. This removes punctuation noise and avoids treating "Trade" and "trade" as different tokens.
The word_count structure stores per-category frequencies, while total_words stores the denominator needed for probability computation.
The prediction phase uses logarithms because multiplying hundreds of tiny probabilities would underflow in floating-point arithmetic. Adding logarithms produces the same ordering while remaining numerically stable.
Laplace smoothing appears in:
(cnt + 1) / denom
Without the +1, any unseen word would contribute probability zero and destroy the entire category score.
Another subtle detail is vocabulary size inside the denominator. Omitting it biases probabilities and breaks normalization.
Worked Examples
Example 1
Suppose the classifier was trained with:
Category 1:
proof theorem algebra
Category 2:
computer algorithm software
Category 3:
market finance trade
Input:
Neural algorithm design
| Word | Category 1 Count | Category 2 Count | Category 3 Count |
|---|---|---|---|
| neural | 0 | 0 | 0 |
| algorithm | 0 | 1 | 0 |
| design | 0 | 0 | 0 |
After smoothing, all categories receive small penalties for unseen words, but category 2 gains an advantage because "algorithm" is strongly associated with it.
The classifier outputs:
2
This trace demonstrates why smoothing matters. Two words are unseen everywhere, yet the classifier still makes a stable decision.
Example 2
Input:
financial market company exports
| Word | Category 1 Count | Category 2 Count | Category 3 Count |
|---|---|---|---|
| financial | 0 | 0 | 0 |
| market | 0 | 0 | 1 |
| company | 0 | 0 | 1 |
| exports | 0 | 0 | 0 |
Category 3 accumulates evidence from "market" and "company".
Output:
3
This example shows how multiple weak signals combine into a strong prediction.
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(V + L) | Vocabulary traversal plus processing document tokens |
| Space | O(V) | Frequency tables for all distinct words |
Here V is the vocabulary size and L is the number of tokens in the input document.
The document size is only 10 KB, so even Python implementations run comfortably within limits. Hash-map operations are effectively constant time, and the total vocabulary remains manageable.
Test Cases
# helper: run solution on input string, return output string
import sys
import io
def run(inp: str) -> str:
import math
import re
from collections import Counter
token_re = re.compile(r"[a-z]+")
def tokenize(text):
return token_re.findall(text.lower())
class NaiveBayes:
def __init__(self):
self.word_count = [Counter() for _ in range(4)]
self.total_words = [0] * 4
self.doc_count = [0] * 4
self.vocab = set()
def add_document(self, category, text):
words = tokenize(text)
self.doc_count[category] += 1
for w in words:
self.word_count[category][w] += 1
self.total_words[category] += 1
self.vocab.add(w)
def predict(self, text):
words = tokenize(text)
vocab_size = len(self.vocab)
best_cat = 1
best_score = -10**100
total_docs = sum(self.doc_count[1:])
for cat in range(1, 4):
score = math.log(self.doc_count[cat] / total_docs)
denom = self.total_words[cat] + vocab_size
for w in words:
cnt = self.word_count[cat][w]
score += math.log((cnt + 1) / denom)
if score > best_score:
best_score = score
best_cat = cat
return best_cat
nb = NaiveBayes()
nb.add_document(1, "proof theorem algebra geometry")
nb.add_document(2, "computer algorithm software network")
nb.add_document(3, "market finance trade business")
sys.stdin = io.StringIO(inp)
text = sys.stdin.read()
return str(nb.predict(text)) + "\n"
# custom cases
assert run("algorithm software") == "2\n", "programming vocabulary"
assert run("market trade company") == "3\n", "economics vocabulary"
assert run("geometry theorem proof") == "1\n", "mathematics vocabulary"
assert run("unknown mysterious token") in {
"1\n", "2\n", "3\n"
}, "unseen words still produce valid class"
| Test input | Expected output | What it validates |
|---|---|---|
algorithm software |
2 |
Strong category-specific vocabulary |
market trade company |
3 |
Multiple matching finance terms |
geometry theorem proof |
1 |
Mathematical terminology |
unknown mysterious token |
Any valid class | Laplace smoothing prevents failure |
Edge Cases
Consider a document where every word is unseen:
Input:
quantum rainforest telescope
Without smoothing, every category probability becomes zero because none of the words appeared during training. The algorithm avoids this by adding one to every frequency count:
$\frac{0+1}{\text{totalWords}(c)+|V|}$
All categories receive finite scores, and the classifier still returns a valid answer.
Another tricky case is inconsistent capitalization:
Input:
ALGORITHM Software NETWORK
The tokenizer converts everything to lowercase before processing. All three words correctly match category 2 vocabulary.
A third subtle case involves punctuation-heavy text:
Input:
market!!! finance??? trade...
A naive whitespace split would produce tokens like "market!!!" that never match training words. The regex tokenizer strips punctuation and extracts clean alphabetic tokens, preserving the intended signal.