FOA Home
Zipf observed that the frequency of words' occurrence varies
dramatically, and Poisson models explore deviations of these occurrence
patterns from purely random processes. We now make the first important
move towards a theory of why some words occur more frequently
and how such statistics can be exploited when building an index
automatically. Luhn, as far back as 1957, said clearly: It is hereby
proposed that the frequency of word occurrence in an article furnishes a
useful measurement of word significance. [Luhn57] That is, if a word occurs
frequently, more frequently than we would expect it to within a corpus,
then it is reflecting emphasis on the part of the author
about that topic. But the raw frequency of occurrence in a
document is only one of two critical statistics recommending good
keywords.
Consider a document taken from our AIT corpus, and imagine
using the keyword with it. By construction, virtually every document in
the AIT is about \term{ARTIFICIAL INTELLIGENCE}!? Assigning the
keyword \term{ARTIFICIAL INTELLIGENCE} to any document in AIT would be a
mistake, not because this document isn't about \term{ARTIFICIAL
INTELLIGENCE}, but because this term can not help us {\em discriminate}
one subset of our corpus as relevant to any query. If we change our
search task to looking not only in our AIT corpus but through a much
larger collection (for example, all computer industry newsletters) then
associating \term{ARTIFICIAL INTELLIGENCE} with those articles in our
AIT subcorpus becomes a good idea. This term helps to distinguish AI
documents from others.
The second critical characteristic of good indices
now becomes clear - a good index term not only characterizes a document
{absolutely}, as a feature of a document in isolation, but also allows
us to discriminate it {\em relative} to other documents in the corpus.
Hence keywords are not strictly properties of any single document, but
reflect a relationship between an individual document and the collection
from which it might be selected.
These two, countervailing considerations
suggest that the best keywords will not be the most ubiquitous,
frequently occurring terms, nor those that occur only once or twice, but
rather those occurring a moderate number of times. Using Zipf's rank
ordering of words as a baseline, Luhn hypothesized a modal function of a
word's rank he called RESOLVING POWER centered exactly at the
middle of this rank ordering. If resolving power is defined as a word's
ability to {\em discriminate} content, Luhn assumed that this quantity
is maximal at the middle and then falls off at either very high or very
low frequency extremes, as shown in Figure (figure) . The next
step is then to establish maximal and minimal occurrence
thresholds defining useful, mid-frequency index terms.
Unfortunately, Luhn's view does not provide theoretical grounds for
selecting these bounds, and so we are reduced to the engineering task of
tuning them for optimal performance.
We'll begin with the
maximal-frequency threshold, used to exclude words that occur too
frequently. For any particular corpus, it is interesting to contrast
this set of most-common words with the negative dictionary of noise
words, defined in Section §2.3.2 .
While there may often be great overlap, the negative dictionary list is
typically a list that has proven itself to be practically useful across
many different corpora, while the most frequent tokens in a particular
corpus may be quite specific to it.
Establishing the other, low-frequency
threshold is less intuitive. Assuming that our index is to be of limited
size, including a certain keyword means we must exclude some other. This
suggests that a word that occurs in exactly one document can't possibly
be used to help discriminate that document from others regularly. For
example, imagine a word -- suppose it is -- that occurs exactly once, in
a single document. If we took out that word \term{DERIVATIVE} and put in
any other word, for example \term{FOOBAR}, in terms of the word
frequency co-occurrence statistics that are the basis of all our
indexing techniques, the relationship between that document and all the
other documents in the collection will remain unchanged! In terms of
overlap between what the word \term{DERIVATIVE} \means, in the FOA sense
of what this and other documents are \about, a single word occurrence
has no rd occurrence has no ce has no \rikmeaning!
The most useful words
will be those that are not used so often as to be roughly common to all
of the documents, and not so rare so as to be (nearly) unique to any one
(or small set) of documents. We seek those keywords whose
combinatorial properties, when used in concert with one another
as part of queries, help to compare and contrast topical areas of
interest against one another.
Top of Page
Resolving power