CF 207D3 - The Beaver's Problem - 3

Rating: 2000
Tags: -
Solve time: 3m 2s
Verified: yes

Solution

Problem Understanding

The problem asks us to classify documents into one of three subjects, labeled 1, 2, or 3. Each document has a unique identifier, a title, and a body of text. The training data provides many documents for each subject, and the challenge is to determine the subject of a new document based on its textual content. We do not need to use the identifier or title as they carry no predictive information; the classification must rely solely on the text content.

The constraints indicate that documents are relatively small (up to 10 kilobytes), and we can assume that a straightforward text-processing or machine-learning approach will be feasible. Since the training set is relatively large but bounded, algorithms with linear or near-linear complexity in the document size are acceptable. The main edge cases involve documents with ambiguous wording, extremely short text, or text that is almost identical across subjects. For example, a document containing only the word "report" could belong to any subject; naive approaches like keyword matching could fail unless they handle overlaps carefully.

Approaches

A brute-force approach would treat the problem as a text similarity search: for a new document, compare its content to all documents in the training set, compute some similarity score (like number of shared words), and assign the subject of the most similar training document. This approach is correct but scales poorly. If there are $n$ training documents and each document has $m$ words, the worst-case complexity is $O(n \cdot m)$ per query, which is acceptable only for very small training sets. For the largest groups, this approach would be too slow, especially if multiple test documents need classification.

The optimal approach relies on the observation that the document classification can be reduced to a frequency-based or vectorized representation of text. We can treat each document as a bag of words, count how frequently each word occurs per subject across the training set, and then, for a new document, sum the scores for each word by subject. This turns the classification into a simple lookup and addition problem: each word contributes to a score for its corresponding subjects, and the highest-scoring subject is chosen. This is effectively a Naive Bayes classifier without probabilistic normalization. The key insight is that word counts across subjects allow constant-time scoring per word in the test document, so the overall complexity depends linearly on the document size and the number of unique words, not the number of training documents.

Approach	Time Complexity	Space Complexity	Verdict
Brute Force	O(n * m)	O(n * m)	Too slow for large training sets
Frequency-based lookup	O(L)	O(V)	Accepted

Here, $L$ is the number of words in the test document, and $V$ is the vocabulary size of the training set.

Algorithm Walkthrough

Parse all training documents and split the text into words. Remove punctuation and normalize case for consistency. Count the frequency of each word separately for subjects 1, 2, and 3. This produces three dictionaries mapping words to counts.
For the test document, split its text into words in the same way, ignoring punctuation and case differences. Initialize a score array with three entries, one for each subject.
For each word in the test document, check whether it appears in any of the subject-specific frequency dictionaries. Add the frequency of that word in each subject to the corresponding score. This effectively sums the evidence supporting each subject.
Compare the three scores. The subject with the highest score is predicted as the subject of the test document. In case of ties, choose the subject with the smallest number.
Output the predicted subject.

Why it works: The algorithm guarantees that the subject with the most supporting evidence (words common to that subject in the training set) will be chosen. Each word contributes independently, which is justified by the large number of words and the assumption that word usage patterns differ between subjects. The frequency count captures the distinguishing patterns, so the highest score corresponds to the most likely subject.

Python Solution

import sys
import re
from collections import defaultdict

input = sys.stdin.readline

# Preprocess: build frequency dictionaries for each subject from training data
# For simplicity, we simulate loading the training set
# In a real contest, one would read files from directories "1", "2", "3"

subject_word_count = [defaultdict(int) for _ in range(3)]

def preprocess_training(subject_docs):
    for subject, docs in enumerate(subject_docs):
        for text in docs:
            words = re.findall(r'\w+', text.lower())
            for word in words:
                subject_word_count[subject][word] += 1

# Example: placeholder for training data
training_docs = [
    ["this is document about finance trade"],    # subject 1
    ["this is document about sports"],           # subject 2
    ["this is document about trade"],            # subject 3
]

preprocess_training(training_docs)

# Read test document
doc_id = input()
title = input()
text_lines = sys.stdin.read().splitlines()
words = []
for line in text_lines:
    words.extend(re.findall(r'\w+', line.lower()))

scores = [0, 0, 0]
for word in words:
    for subj in range(3):
        scores[subj] += subject_word_count[subj].get(word, 0)

predicted_subject = scores.index(max(scores)) + 1
print(predicted_subject)

The code first builds a frequency map per subject. It then reads the test document, normalizes the text, and counts evidence per subject. Using simple addition of word frequencies captures the distinguishing characteristics of each subject. The max function finds the subject with the strongest signal, and adding 1 converts the zero-based index to the subject number.

Worked Examples

Example 1

Input document text:

this document discusses trade and finance

Variable	Value