FOA Home
A much different approach to the text classification task has been
proposed by Cohen in his RIPPER system [Cohen96a] [Cohen96b] . The space of hypothesis
considered by Ripper is a set of Boolean rules composed over a space of
keywords. A simple example is showin in Figure (FOAref) , which
extends the obvious definition of {\tt IRELAND} to include documents
that mention violent {\tt IRA} activities as well. Like decision lists,
Ripper's rule sets are easier for human experts to interpret than a
large system of Bayesian probabilities.
Ripper is an example of a
``covering" learning algorithm; cf. Figure (figure) This means
that it iteratively forms conjunctions of Boolean predicates which
``cover" some of the positive instances of a Boolean classification
while excluding all of the negative instances. In the next iteration,
the positive instances which were covered previously are removed from
the training set, and a new conjunctive clause is formed which again
covers some more positive instances while excluding all negative ones.
Ultimately, then, the rule set will be in DISJUNCTIVE NORMAL FORM
(DNF).
Ripper also includes optimizations to simplify rules by removing
conditions which do not affect performance, and by picking conditions
which provide the most information gain [Quinlan93] . Finally, Cohen adapted
these rule learning techniques to the text domain by adding ``set valued
attributes." These special attributes collapse a document's
representation to be simply the set of words it contains. Ripper's rules
can then include tests for sets of words, rather than having to test the
presence/absence of each word individually.
Top of Page
Boolean predicates