FOA Home
In these experiments, and the rest to follow, we will consistently be
using two example corpora.
The first of these we will call the
``Artificial Intelligence Thesis" corpus. This is approximately 5,000
Ph.D. and Masters dissertations and abstracts. Virtually every
dissertation published within the last 30 years has been microfilmed by
University Microfilms, Inc.. These were selected by Suzanne Humphries.}
The corpus is a fairly exhaustive set of theses classified as AI by
University Microfilms, Inc., from the years 1987 to 1997. A histogram of
the theses distribution by year is shown in Figure (figure) .
We
will focus on a handful of characteristics of each thesis: \item
Thesis\# \item Title \item Author \item Year \item University \item
Advisor \item Language \item Abstract \item Degree ree \eit
For now, we
will lump these attributes into two categories: TEXTUAL FIELDS
and STRUCTURED ATTRIBUTES . Structured attributes are ones for
which we are able to reason more formally, using database and artificial
intelligence techniques. For now, we will concentrate on only the
textual fields. The abstract will be the primary textual element
associated with each thesis, while its title (also a textual field) will
be used as its PROXY - a synopsis of the thesis that conveys much
of its meaning in a highly abbreviated form. Proxies will prove very
important surrogates for the documents (for example when users are
presented with hitlists of retrieved documents).
The second corpus we
will study could not be provided on the CD because you must provide it
yourself - it is all of {your} email. We will assume that with disk
storage as cheaply available as it is today, you have at some point
begun to collect email. Typically some of this will be email others have
sent you, but you may have also kept a copy of all of your ``outgoing"
email as well. Many email clients support on-the-fly segregation of
email into separate folders. Minimally, this means that our procedures
for indexing email must be capable of traversing elaborate directory
structures. Later we will also consider the use of this user-generated
structure as a source for learning; cf. Section §7.4 .
The directory in which you have filed
an email message is one feature we may associate with each message;
whether it is an incoming or outgoing message is another. But of course
email also has many structured attributes associated with it, in its
HEADER . These include: \bit \item From: \item To: \item Cc:
\item Subject: \item Date: e: \eit
In general we will put off all
consideration of structured attributes associated with documents until
later chapters. For now simply note the many parallels between our two
example corpora: both have well-defined authors, well-defined
time-stamps, and excellent and obvious candidates for proxy text.
Top of Page
Example corpora