FOA Home
When you are thinking about how you classifiy your Email, almost
certainly keywords contained in your Email are some of the
FEATURES you think of first. Recall, however, that the keyword
vocabulary can be very large. Using this feature space, then, individual
documentšs representations will be very SPARSE . In terms of the
vector space model of Section §3.4 ,
many of the vector elements will be zero. To use Littlestonešs lovely
expression, ``Irrelevant attributes abound'' [Littlestone88] , and so it should
come as no suprise that his learning techniques are especially
appropriate in FOA learning applications in Sectgion §7.5.3 .
Efforts to control the keyowrd
vocabulary and make the lexical features as meaningful as possible are
therefore important preconditions for good classification performance.
For example, name-tagging techniques (cf. Section §6.6.1 ) which reliably identify proper
names can provide valuable classification features. For example, a
proper name tagger would be one that was especially sophisticated about
capitalization, name order, abbreviation conventions. When both people's
proper names and institutional names (government agencies, universities,
corporations, etc.) the recognition of complex, multi-token phrases
becomes possible:
In part because of the difficult issues lexical,
keywordly-based representations entail, it is worth thinking briefly
about some of the alternatives. There are also less-obvious features we
might use to classify documents. META-DATA associated with the
document, for example information about its date and place of
publication, are one possibility. Geographic place information
associated with a document can also be useful; cf Section §6.6.1 . Finally, recall the
bibliographic citations that many document contain (cf. Section §6.1 ). The set of references one document
makes to other ones (representable as links in a graph) can be used as
the basis of classification in much the same way as its keywords.
In
summary, while keywords provide the most obvious set of features on
which classifications can be based, these result in very large and
sparse learning problems. Other features are also available, and may
also be useful. It is important to note, however, that careful Bayesian
reasoning about dependencies among keyword features is a very difficult
problem, as discussed in Section §5.5.7 . Attempting to extend this
inference to include other, heterogeneous types of features must be done
carefully.
Top of Page
Feature selection
Subsections