FOA Home
The classification task is ... classic:), and a wide range of
technologies for accomplishing this task have been developed [REF110] [Carlin96] . Even in the relatively
recent context of text-based domains, many techniques have been applied
[Lewis94a] . Here we will focus
primarily on extending the probabilistic approach of Section §5.5 , but consider some other alternatives
in Section §7.5 . Andrew McCallum and
colleagues have developed a software suite called RAINBOW which is
very useful for experiments into text classification.
We begin by
assuming that we have been given a set of classes $=\{c_{1},c_{2}, ...
c_{C}\}$, and have been asked to produce a function that
CLASSIFIES a new document into one of these classes. McCallum
gives a good overview of the Bayesian approach applied to text: This
approach assumes that the text data was generated by a parametric model,
and uses training data to calculate the Bayes-optimal estimates of the
model parameters. Then, equipped with these estimates, it classifies new
documents using Bayes' rule to turn the generative model around
and calculate the posterior probability that a class would have
generated the test document in question. [McCallum98b]
The major features of
the training regime are shown in Figure (figure) . During the
first phase, a correspondence has been established manually
between each example document $\mathbf{d}_{i}$ and some classifiction
$c_{i}$ in the TRAINING SET $T$: T \equiv
\{<\mathbf{d}_{i},c_{i}>\} (In this figure, the classifications are
imagined to be part of a hierarchical classification system, as
discussed in detail in Section §7.4.5
.) This data is used somehow (for now it's OK to think of it as magic:)
to tune the set of parameters $\Theta$ specifying a particular
classifier. The second and dominant phase is then to use this classifier
to automatically assign document to classes in an analogous
manner to those in manually classified in the training set.
We seek the
POSTERIOR PROBABILITY of a particular class, given the evidence
provided by a new document, $\Pr(c|d)$. The second step is then, given
these probabilities for all classes, to return the one that maximizes
this likelihood. Bayes Rule is typically invoked in such situations, to
``invert'' this conditional dependency: \Pr(c|\mathbf{d}) =
\frac{\Pr(\mathbf{d}|c) \Pr(c)} {\sum\limits_{c} \Pr(\mathbf{d}|c)
\Pr(c)}
The likelihood $can be more easily estimated if we assume
classes are represented as MIXTURES of the documents features. We
can think of hypothetical documents (that we imagine belonging to the
class) as generated by some model, the parameters $\Theta$ of which we
hope to discover. Assuming that a document is generated by first picking
a particular class and then using its parameters to select features, the
liklihood of a document can be computed by considering the prior
probability of each class $\Pr(c | \Theta)$ and its distribution
$\Pr(\mathbf{d} | c ; \Theta)$: \Pr(\mathbf{d} | \Theta) = \sum_{c}
\Pr(c | \Theta) \cdot \Pr(\mathbf{d} | c ; \Theta)
These measures allow
us to state precisely how well a learned model captures regularities
found in the training set: The model is doing a good job iff it applies
the same classifications as observed in the training data. But this
criterion also highlights the dependence of any learning method on the
training data $T$ used to construct it which cannot be over-emphasized.
Independent of the range of hypotheses considered, and of the learning
methods used to build the best possible model, our ability to
INDUCTIVELY GENERALIZE from the training data to new examples
(e.g., un-classified documents) depends entirely on how typical the
training data is. Within COMPUTATIONAL LEARNING THEORY this
dependence is captured by the assumption that training data and
subsequent trials are drawn from the same distribution [REF679] [Kearns94] .
In practice, performance of
a classifier on the training set is bound to be an optimistic
over-estimate of how well it can generalize to previously unseen data.
The most common way to guard against such over-estimates is to
artificially HOLD OUT some of the training set as a separate
TEST SET . Training proceeds as before on all the training data
but not the test set, and then the system is evaluated on the test set
which provides a more reasonable estimate as to how the system will
fare. More sophisticated CROSS-VALIDATION procedures begin by
partitioning the available training data into $k$ subsets. Iteratively,
each partition is used as the hold out set while the remaining
$\frac{k-1}{k}$ balance of $T$ is used for training. The average
performance across all $k$ tests is then used as a more statistically
reliable estimate of true performance. The statistical validity derived
by cross-validation and related techniques become especially important
when the total amount of training data is small; given the given the
time and effort required to produce manual classifications this is often
the case.
Top of Page
Classification
Subsections