FOA Home | UP: Adaptive Information Retrieval


Classification

The classification task is ... classic:), and a wide range of technologies for accomplishing this task have been developed [REF110] [Carlin96] . Even in the relatively recent context of text-based domains, many techniques have been applied [Lewis94a] . Here we will focus primarily on extending the probabilistic approach of Section §5.5 , but consider some other alternatives in Section §7.5 . Andrew McCallum and colleagues have developed a software suite called RAINBOW which is very useful for experiments into text classification.

We begin by assuming that we have been given a set of classes $=\{c_{1},c_{2}, ... c_{C}\}$, and have been asked to produce a function that CLASSIFIES a new document into one of these classes. McCallum gives a good overview of the Bayesian approach applied to text: This approach assumes that the text data was generated by a parametric model, and uses training data to calculate the Bayes-optimal estimates of the model parameters. Then, equipped with these estimates, it classifies new documents using Bayes' rule to turn the generative model around and calculate the posterior probability that a class would have generated the test document in question. [McCallum98b]

The major features of the training regime are shown in Figure (figure) . During the first phase, a correspondence has been established manually between each example document $\mathbf{d}_{i}$ and some classifiction $c_{i}$ in the TRAINING SET $T$: T \equiv \{<\mathbf{d}_{i},c_{i}>\} (In this figure, the classifications are imagined to be part of a hierarchical classification system, as discussed in detail in Section §7.4.5 .) This data is used somehow (for now it's OK to think of it as magic:) to tune the set of parameters $\Theta$ specifying a particular classifier. The second and dominant phase is then to use this classifier to automatically assign document to classes in an analogous manner to those in manually classified in the training set.

We seek the POSTERIOR PROBABILITY of a particular class, given the evidence provided by a new document, $\Pr(c|d)$. The second step is then, given these probabilities for all classes, to return the one that maximizes this likelihood. Bayes Rule is typically invoked in such situations, to ``invert'' this conditional dependency: \Pr(c|\mathbf{d}) = \frac{\Pr(\mathbf{d}|c) \Pr(c)} {\sum\limits_{c} \Pr(\mathbf{d}|c) \Pr(c)}

The likelihood $can be more easily estimated if we assume classes are represented as MIXTURES of the documents features. We can think of hypothetical documents (that we imagine belonging to the class) as generated by some model, the parameters $\Theta$ of which we hope to discover. Assuming that a document is generated by first picking a particular class and then using its parameters to select features, the liklihood of a document can be computed by considering the prior probability of each class $\Pr(c | \Theta)$ and its distribution $\Pr(\mathbf{d} | c ; \Theta)$: \Pr(\mathbf{d} | \Theta) = \sum_{c} \Pr(c | \Theta) \cdot \Pr(\mathbf{d} | c ; \Theta)

These measures allow us to state precisely how well a learned model captures regularities found in the training set: The model is doing a good job iff it applies the same classifications as observed in the training data. But this criterion also highlights the dependence of any learning method on the training data $T$ used to construct it which cannot be over-emphasized. Independent of the range of hypotheses considered, and of the learning methods used to build the best possible model, our ability to INDUCTIVELY GENERALIZE from the training data to new examples (e.g., un-classified documents) depends entirely on how typical the training data is. Within COMPUTATIONAL LEARNING THEORY this dependence is captured by the assumption that training data and subsequent trials are drawn from the same distribution [REF679] [Kearns94] .

In practice, performance of a classifier on the training set is bound to be an optimistic over-estimate of how well it can generalize to previously unseen data. The most common way to guard against such over-estimates is to artificially HOLD OUT some of the training set as a separate TEST SET . Training proceeds as before on all the training data but not the test set, and then the system is evaluated on the test set which provides a more reasonable estimate as to how the system will fare. More sophisticated CROSS-VALIDATION procedures begin by partitioning the available training data into $k$ subsets. Iteratively, each partition is used as the hold out set while the remaining $\frac{k-1}{k}$ balance of $T$ is used for training. The average performance across all $k$ tests is then used as a more statistically reliable estimate of true performance. The statistical validity derived by cross-validation and related techniques become especially important when the total amount of training data is small; given the given the time and effort required to produce manual classifications this is often the case.

Subsections


Top of Page | UP: Adaptive Information Retrieval | ,FOA Home


FOA © R. K. Belew - 00-09-21