FOA Home
We have described the lexical analyzer in terms of the job it must do
processing each and every document in the corpus because this task
confronts us first. But because our central task will be to match
these documents against subsequent users' queries, it is critical
that the identical lexical analysis be performed on the queries as was
performed on the documents. This creates several implementation
constraints (e.g., that the same code libraries are available to the
indexer and to the query processing interface), but these are minor. If
the query language is designed to support any special operators (e.g.,
Boolean combinators, proximity operators), the query's lexical analyzer
may accept a super-set of the tokens accepted by the document's
analyzer. In any case, it is imperative if queries and documents are to
be matched correctly that the same lexical analysis be applied to both
streams. Using an identical code library is the easiest way to ensure
this.
It may seem nonsensical to worry so much about processing each
character efficiently, when we assume that some other, previous process
has already identified each inter-document break - doesn't such
processing require the same computational effort, and, if so, doesn't
this make our current efficiency worries moot?
Perhaps. A conclusive
answer depends on many architecture and operating system specifics.
There are two reasons that we have made such assumptions. The first is
that the practicalities of delivering the FOA corpora and code currently
make this convenient. But the more serious reason is that the most
theoretically and intellectually interesting questions involve analysis
of operations downstream from the first stages of inter-document parsing
- how to identify tokens, how to count them, etc. If these latter
operations are made especially efficient, it means we can afford to do
more experimentation, more playfully. For a text, that is the primary
concern.
Top of Page
Summary