FOA Home
Before surveying all of the ways that evaluation might be
performed, it is worthwhile sketching how it has typically been done in
the past [Cleverdon63] . In the
beginning, computers were slow, had very limited disk space and even
more limited memories; initial test corpora needed to be small, too. One
benefit of these small corpora was that it allowed at least the
possibility of having a set of test queries compared
exhaustively against each and every document in the corpus.
The
source of these test queries, and the assessment of their relevance,
varied in early experiments. For example, in the Cranfield experiments
[Lancaster68] , 1400 documents in
metallurgy were searched according to 221 queries generated by some of
the documents' authors. In Saltan's experiments with the ADI corpus, 82
papers presented at a 1963 American Documentation Institute meeting were
searched against 35 queries and evaluated by students and ``staff
experts" associated with Saltan's lab [Salton68] . Lancaster's construction of
the MEDLARS collection was similar [Lancaster69] .
As computers have
increased in capacity, reasonable evaluation has required much larger
corpora. The Text Retrieval Conference (TREC), begun in 1992 and
continuing to the present, has set a new standard for search engine
evaluation [Harman95] . The TREC
methodology is notable in several respects. First, it avoids exhaustive
assessment of all documents by using the POOLING method, a
proposal for the construction of ``ideal'' test collections that
predates TREC by decades [SparckJones76] . The basic idea is
to use each search engine independently and then ``pool'' their results
to form a set of those documents that have at least this recommendation
of potential relevance. All search engines retrieve ranked lists of $k$
potentially relevant document, and the union of these retrieved
sets are presented to human judges for relevance assessment.
In the case
of TREC, $k=100$ and the human assessors were retired security analysts,
like those that work at the National Security Agency (NSA), watching the
world's communications. Since only documents retrieved by one of the
systems being tested are evaluated there remains the possibility that
relevant documents remain undiscovered and we might worry that our
evaluations will change as new systems retrieve new documents and these
are evaluated. Recent analysis seems to suggest that at least in the
case of the TREC corpus, evaluations in fact are quite stable [Voorhees98] .
An important consequence
of this methodological convenience is that unassessed documents are
assumed to be irrelevant. This creates an unfortunate dependence on
the retrieval methods used to nominate documents, which we can expect to
be most pronounced when the methods are similar to one another. For
example, if the alternative retrieved sets are the result of
manipulating single parameters of the same basic retrieval procedure,
the resulting assesments may have overlap with, and hence be useless for
comparison of, methods producing significantly different retrieval sets.
For the TREC collection, this problem was handled by drawing the top 200
documents from a wide range of 25 methods which had little overlap [Harman95] . Vogt [Vogt98] has explored how similarities and
differences between retrieval methods can be similarly exploited as part
of combined, hybrid retrieval systems (cf. Section §7.4.4 ).
It is also possible to
sample a small subset of a corpus, submit the entire sample to
review by the human expert, and extrapolate from the number of
relevant documents found to an expected number across the entire corpus.
One famous example of this methodology is Blair \& Maron's assessment of
IBM's STAIRS retrieval system [Blair85] of the early 1980's. This
evaluation studied the real-world use of STAIRS by a legal firm as part
of a LITIGATION SUPPORT task: 40000 memos, design documents, etc.
were to be searched with respect to 51 different queries. The lawyers
themselves then agreed to evaluate documents' relevance. As they report:
To find the unretrieved relevant documents we developed sample frames
consisting of subsets of unretrieved database that we believed
to be rich in relevant documents\ldots. Random samples were taken from
these subsets, and thesamples were examined by the lawyers in a blind
evaluation; the lawyers were not aware they were evaluating sample sets
rather than retrieved sets they had personally generated. The total
number of relevant documents that existed in these subsets could then be
estimated. We sampled from subsets of the database rather than the
entire database because, for most queries, the percentage of relevant
documents in the database was less than \%2, making irt almost
impossible t have both manageable sample sizes and a high level of
confidence in the resulting Recall estimates. Of course, no
extrapolation to the entire database could be made from these Recall
calculations. Nonetheless, the estimation of the numbe of relevant
unretrieved documents in the subsets did give us a maximum value for
Recall for each request. [p. 291--293] This is a difficult methodology,
but it allows some of the best estimates of Recall available, still. And
their news was not good: on average retrievals captured only 20% of
Relevant documents!
In short, methodologies for valid search engine
evaluations require much more sophistication and care than generally
appreciated. Careful experimental design [Tague-Sutcliffe92] ,
statistical analysis [Hull93] and
presentation [Keen92] are all
critical.
Top of Page
Traditional evaluation methodologies