FOA Home
Having now focused our attention on a particular file, and beginning and
ending locations within that file associated with a particular document,
we can consider this file segment as simply a a STREAM of
characters. Reading each and every character of each and every document,
deciding whether it is part of a meaningful token or not, and deciding
whether these tokens are worth indexing will be the most computationally
intensive aspect of the indexing chore; this is our ``inner loop." For
that reason, we will devote some real attention to making this lexical
analysis as efficient as possible.
Several general criteria will shape
our design. First, since we are assuming that our textual corpus is very
large, we will do our best to avoid duplicating this primary text. That
is, we will attempt to deal with the text in situ, and not make
a second copy for use just by the indexing and retrieval system. Thus,
we will be creating a system of pointers into locations within the
corpora directories and files.
A wide range of alternative designs are
possible even at this early stage, and so we desire as much flexibility
as possible in the specification of the lexical analyzer. A LEXICAL
ANALYZER GENERATOR , such as the
For our two example corpora and many other
situations, the stream of characters, a straight-forward analysis in
terms of simple FINITE STATE MACHINE will suffice. We will depend
on a utility written by Christopher Fox [FoxC92] . This utility simultaneously
achieves two critical goals. First, the lexical analyzer
TOKENIZES the stream of characters into a sequence of word-like
elements. At first blush this seems straightforward - a token is
anything separated by white space, where the standard definition of
white space is used. But what about hyphens? Should the hyphenated
phrase
The presence
of digits among the alphabetic characters presents another problem. Are
numbers to be allowed as tokens? Perhaps we only want to allow
``special'' numbers (e.g., {1776, 1984, 2001, 3.14159}, ...). Perhaps we
want to use rules similar to those for programming language identifiers
and require that a token begin with an alphabetic character which may
then be followed by numbers or letters.
We must also worry about the
CASE of the characters at this earliest lexical analysis stage.
Are we to treat capitalization as significant in distinguishing tokens
one from another, or not? An enormous reduction in vocabulary size is
possible if we FOLD CASE so as to treat upper and lower
characters interchangeably. But of course then we have also precluded
the possibility of many proper name analyses that may be useful for
identifying SINGULAR people, places or events (see Chapters §6 .). In some cases the semantics of the
documents make decisions about case automatic. For example, if the
documents are program source files, the language in question may or may
not treat differences in case as significant.
Top of Page
Intra-document parsing
lex tool in UNIX,
allows the specification of very complicated lexical analyzers for very
elaborate languages. The fundamental representation used by all such
algorithms is a FINITE STATE MACHINE , like that shown in Figure
(figure) . This simple representation breaks the set of
possible characters coming from a text stream into classes (drawn as
circular states), with transitions from one state to the next on the
occurrence of particular characters. By careful construction of the sets
of characters (e.g., WHITE-SPACE characters corresponding to
state $0$ in (FOAref) ), arbitrary text sequences can be
handled very efficiently.
DATA-BASE be treated as two separate tokens or as a
single one? Should a file name, like WINDOWS.EXE be treated
as a single token? Which host, directory and file elements in a full URL
like {\tt www.cs.ucsd.edu/\~{ }rik} are to be kept intact as
individuated tokens? More elaborate elements such as these can quickly
demand the sophistication of a tool like lex.
Subsections