FOA Home | UP: Extracting Lexical Features


Example corpora

In these experiments, and the rest to follow, we will consistently be using two example corpora.

The first of these we will call the ``Artificial Intelligence Thesis" corpus. This is approximately 5,000 Ph.D. and Masters dissertations and abstracts. Virtually every dissertation published within the last 30 years has been microfilmed by University Microfilms, Inc.. These were selected by Suzanne Humphries.} The corpus is a fairly exhaustive set of theses classified as AI by University Microfilms, Inc., from the years 1987 to 1997. A histogram of the theses distribution by year is shown in Figure (figure) .

We will focus on a handful of characteristics of each thesis: \item Thesis\# \item Title \item Author \item Year \item University \item Advisor \item Language \item Abstract \item Degree ree \eit

For now, we will lump these attributes into two categories: TEXTUAL FIELDS and STRUCTURED ATTRIBUTES . Structured attributes are ones for which we are able to reason more formally, using database and artificial intelligence techniques. For now, we will concentrate on only the textual fields. The abstract will be the primary textual element associated with each thesis, while its title (also a textual field) will be used as its PROXY - a synopsis of the thesis that conveys much of its meaning in a highly abbreviated form. Proxies will prove very important surrogates for the documents (for example when users are presented with hitlists of retrieved documents).

The second corpus we will study could not be provided on the CD because you must provide it yourself - it is all of {your} email. We will assume that with disk storage as cheaply available as it is today, you have at some point begun to collect email. Typically some of this will be email others have sent you, but you may have also kept a copy of all of your ``outgoing" email as well. Many email clients support on-the-fly segregation of email into separate folders. Minimally, this means that our procedures for indexing email must be capable of traversing elaborate directory structures. Later we will also consider the use of this user-generated structure as a source for learning; cf. Section §7.4 .

The directory in which you have filed an email message is one feature we may associate with each message; whether it is an incoming or outgoing message is another. But of course email also has many structured attributes associated with it, in its HEADER . These include: \bit \item From: \item To: \item Cc: \item Subject: \item Date: e: \eit

In general we will put off all consideration of structured attributes associated with documents until later chapters. For now simply note the many parallels between our two example corpora: both have well-defined authors, well-defined time-stamps, and excellent and obvious candidates for proxy text.


Top of Page | UP: Extracting Lexical Features | ,FOA Home


FOA © R. K. Belew - 00-09-21