FOA Home
The promise offered by the first chapter is that many real-world
problems can be viewed as instances of the FOA problem. The proof is to
be found in concrete code - a relatively small technology base that will
prove useful in a wide array of applicatons. In this chapter we will
present a suite of software tools that together generate a ``search
engine" for a wide variety of situations. Source code is provided so
that these tools can be easily modified for applications of your own. We
will work through two different examples of IR systems, in order to
demonstrate how slight variants of the same basic code can handle both.
Compared
to the broad generalities of Chapter 1, the technical details of this
chapter will sound a much different tone. Describing a complex algorithm
requires the specification of many, sometimes tedious details. To make
the software executable on machines that are likely to be available to
you, the details are provided for several operating systems. But the
processor speeds, internal memory and harddisk sizes available on
computers is changing dramatically each year, so many of the assumptions
on which these routines are based will require constant re-evaluation.
You
will develop the software tools in three phases. The first phase will
convert an arbitrary pile of textual objects into a well-defined corpus
of documents, each containing a string of terms to be indexed. The
second phase involves building efficient data structures to ``invert"
the $Index$ relation so that, rather than seeing all the words that are
in a particular document, we can find all documents containing
particular keywords. All of these efforts are in anticipation of the
third and final phase, which matches queries against indices in order to
retrieve those that are most similar. These three major phases are
central to building any search engine.
This chapter will be most
concerned with the first two phases that together extract lexical
features. Our goal will be the extraction of a set of features worthy of
subsequent analysis. As in any cognitive science, the specification of
an appropriate LEVEL OF ANALYSIS -- whether it is the resolution
and depth of an image; the sub-phonemes of continuous speech; the speech
acts of language ... -- the specification of this atomic feature set is
the first important step.
This will involve a great deal of work, much of
it unpleasant except to those who enjoy designing efficient algorithms
and data-structures (some of us actually do enjoy this!:). The promise
is that we will, as a consequence of good software design, develop
useful tools that allow us to spend the rest of our time exploring
interesting features of language.
Top of Page
Building useful tools