FOA Home
When ``documents'' were first introduced as part of the FOA process, it
was as one of the set of potential, pre-defined answers to users'
queries. Here we will ground this abstract view in practical terms that
can be readily applied, for example to the searches that are now so
common on the Web. Our goal will be to balance this practical
description of how search engines work today with the abstract FOA view
that goes beyond current practices to other kinds of search still to
come.
A useful working definition is that a DOCUMENT is a {\em
passage of free text}. It is composed of text, strings of characters
from an alphabet. We'll typically make the (English) assumption that
uses the Roman alphabet, Arabic numerals and standard punctuation;
complications like font styles (italics, bold), and especially non-Roman
MARKED ALPHABETS that add characters like \term{\"{a}},
\term{\c{C}}, \term{\~{N}}, \term{\ae}, etc., and the iconic characters
of Asian languages, require even more thought.
By ``free'' text we mean
it is in natural language, the sort native readers and writers use
easily. Good examples of free text might be a newspaper article, a
journal paper, a dictionary definition. Typically the text will be
grammatically well-formed language, in part because this is {written}
language, not oral. People are more careful when constructing written
artifacts that last beyond the moment. Informal texts like email
messages, on the other hand, help to point to ways that some texts can
retain the spontaneity of oral communication, for better and worse [REF803] .
Finally, we will be interested
in PASSAGES of such text, of arbitrary size. The newspaper
example makes us imagine documents of a few thousand words, but journal
articles make us think of samples ten times that large, and email
messages make us think of something only a tenth as long. We can even
think of an entire book as a single document. All such passages satisfy
our basic definition - they might be appropriate answers to a search
about some topic.
The length of the documents will prove to be a
critical issue in FOA search engine design, especially when the corpus
contains documents of widely varying lengths. The reason is, roughly,
that since longer documents are capable of discussing more topics, they
are capable of being about more. Longer documents are more likely
to be associated with more keywords, and hence more likely to be
retrieved (cf. Section §3.4.2 ).
One
possible response is to make a simple but very consequential assumption:
All documents have equal \about-ness.
In other words, if we ask the ({a
priori}) probability of any document in the corpus being considered
relevant, we will assume all are equiprobable. This would lead us to
{\em normalize} documents' indices in some way to compensate for
differing lengths. The normalization procedure is a matter of
considerable debate; we will return to consider it in depth later (cf.
§3.4.2 ).
For now, we will take a
different tack towards the issue of document length, as captured by an
alternative pair of assumptions:
The smallest unit of text with
appreciable \about-ness is the paragraph.
All manner of longer documents
are constructed out of basic paragraph atoms.
The first piece of this
argument is that the smallest sample of text that can reasonably be
expected to satisfy a FOA request is a paragraph. The claim is that a
word, even a sentence, does not by itself provide enough {context} for
any question to be answered, or ``found out about.'' But if the
paragraph has been well-constructed, as defined by conventional rules of
composition, it should answer many such questions. And unless the text
comes from James Joyce, Proust, or Lois Borges, we can expect paragraphs
to occupy about half an average screen page -- nicely viewable chunks.
Assumption
(FOAref) alludes to the range of structural relationships by
which the atomic paragraphs can typically be strung together to form
longer passages. First and foremost is simple sequential flow, the order
in which an author expects the paragraphs to be written. The sequential
nature of traditional printed media, from the first papyrus scrolls to
modern books and periodicals, has meant that a sequential ordering over
paragraphs has been dominant. It may even be that the modern human is
especially capable of understanding {\em rhetoric} of this form (cf. §6.2.3 ).
In any case, a sequential
ordering of paragraphs is just one possible way they might be related.
Other common relationships include:
\item {\em hierarchical} structure
composing paragraphs into sub-sections, sections, and chapters. \item
{\em footnotes}, embellishing the primary theme; \item {\em
bibliographic citations} to other, previous publications; \item
references to other sections of the same document; especially \item {\em
pedagogical prerequisite} relationships ensuring that conceptual
foundations are established prior to subsequent discussion;
Of course
each of these relationships has grown up within the tradition of printed
publication. Special typographical conventions (boldface, italics, sub-
and superscripting, margins, rules) have arisen to represent them and
distinguish them from sequential flow.
But new, electronic media now
available to readers (and becoming available to authors) need not follow
the same strictly linear flow. The new capabilities and problems of
traversing text in nonlinear ways -- HYPERTEXT -- have been
discussed by some visionaries [REF701]
[Nelson87] for decades. This new
technology certainly permits us to make some traversals more easily
(e.g., jumping to a cited reference with the click of a button rather
than via a trip to the library), but this same ease may make it more
difficult for an author to present a cogent argument.
For now we will not
worry about just how arguments can be formed with nonlinear hypermedia.
Assumptions(FOAref) and (FOAref) simply allow us to
infer Assumption (FOAref) : If all the documents are
paragraphs, we can expect them to have virtually uniform `aboutness'.
These too are simplifying assumptions, however. In an important sense a
scientific paper's abstract is about the same content as the rest
of the paper, and a newspaper article's first paragraph attempts to
summarize the details of the following story. These issues of a text's
LEVEL OF TREATMENT will be discussed later.
Top of Page
Documents
Subsections