FOA Home
We will be especially concerned with corpora which have benefited from
extensive manual indexing. For example, the articles in the Encyclopedia
Britannicahas benefitted from man- centuries of effort have been applied
to organize these textual passages into coherent indices, thesauri, and
taxonomies. This manual attention provides two advantages in the context
of machine learning.
First, the manual classification of documents to
categories can be used as training data in the context of supervised
learning §7.4 . Second, manually
constructed representations provide a kind of upper bound on what we can
hope our automatic learning techniques should build. Ultimately,
however, we can expect that the most successful applications will not
oppose manual, editorial enhancement with automatic induction but
integrate learning into the editorial process. Machine learning can
already do much of the job that has been traditionally been done by
human editors; and yet, many aspects of the editorial function will
remain beyond our learning techniques for the foreseeable future.
Harnessing machine learning as part of a EDITOR'S WORKBENCH
promises to leverage this scarce resource most effectively.
Of course,
corpora which have benefit from such careful manual attention are few
and far between. Much more typical is the textual corpus without any
manual indexing whatsoever. The third advantage, then, of those special
corpora which do have attending editorial enhancements is that if our
learning techniques can generate analogous structures on these special
collections, we can realistically expect the same techniques to generate
useful structure on other collections as well.
Top of Page
Training against manual indices