FOA Home
The first step is to break the corpus -- an arbitrary ``pile of
text'' -- into individually retrievable documents. This demands
that we be specific about the format of the corpus, and that we decide
how it is to be divided into individual documents. For all operating
systems we will consider, this problem can be defined more precisely in
terms of paths, directories, files, and
position within file. For any application in which the corpus
can be described by the path to its root, these tools will translate
directories/files/documents-within-files into a homogenous corpus. Of
course, there are some situations (e.g., when documents are maintained
within a database) that cannot be captured in these terms, but these
primitives do allow a wide range corpora to be specified.
Our model will
assume that many documents may be contained within a single file, and
that each document occupies a contiguous region within the file. Extend
the software to allow a document to be comprised of multiple,
non-contiguous textual fields.
Issues concerning structure within a
single document are closely related to assumptions we may or may not be
able to make about the lengths of the documents in question. Our
assumptions about how long a typical document is will recur throughout
this book. It is obvious, for example, that different document browsers
are necessary if we need to browse through an entire book rather than
look at a single paragraph. Less obvious is that the fundamental
weighting algorithms used by our indexing techniques will depend very
sensitively on the number of tokens contained in each document.
Take a
large \textbf{LaTeX}\ document and run it repeatedly through
\texttt{LaTeX2HTML}, systematically varying the logical unit of document
structure at which individual HTML pages are constructed. Discuss the
impact of these ``arbitrary'' decisions on the weight of the key words.
\end{exercise}
In this textbook we will focus primarily on two particular
test corpora, AI theses (AIT) and email; these are discussed in more
detail in Section §2.4 . Each of these
have natural notions of the individual document: In the case of the AIT
it is the thesis's abstract, and for email it is the entire message. In
both cases, more refined notions of document (the individual paragraphs
within the abstract or within the email message) are possible.
With these
assumptions, we can define our corpus simply with two files: one
specifying full path information for each file, and a second specifying
where within these files each and every message resides. A large portion
of the task of navigating a directory full of files and visiting each of
them can be accomplished using the
}
This utility allows the recursive descent through all directories from a
specified root, visiting every file contained therein.
In many cases, the
files we will be indexing will have a great deal of syntactic structural
information above and beyond the meaningful text itself. For example,
our email will often contain a great deal of mail header information, as
(loosely:) face? specified in
The basic data elements to be parsed from our two
examples, email and AIT, are shown in Figure (figure) .
Top of Page
Inter-document parsing
dirent. { The {\tt
dirent} interface began with a Berkeley Software Distribution (BSD)
specification written by Kirk McKusick in the mid-1980s. It has evolved
to be a part of the POSIX standard. Ports to various platformds (Linux,
MSDOS, MacOS) are available [Gwyn94] .
RFC822. Many word processing
systems, for example in \TeX, XML and HTML, now produce documents with a
well-defined syntax. If, for example, the documents are written in HTML,
we don't want to index pseudo-words like . In many of
these situations, FILTERS exist that can extract just the
meaningful text from surrounding header or format information; DeTeX {an
example of a useful filter for removing \LaTeX and \TeX markup.} Use of
such utilities spares us the task of parsing this elaborate structure,
but it also means that more elaborate solutions for maintaining the
difference between the document's index and the document's presentation
must be addressed.