FOA Home
In most search engine evaluation, the assumption has been that a single
expert can be trusted to provide reliable relevance assessments. Whether
any one, ``omniscent'' individual is capable of providing reliable data
about the appropriate set of documents to be retrieved remains a
foundational issue within IR. For example, a number of papers in a
recent special issue of the Journal of the American Society for
Information Systems devoted to relevance advocated a move towards a
more ``user-centered, situational,'' view of relevance [Froehlich94] .
Our attention to the
opinions of individual users suggests the possibility of {combining}
evidence from {\em multiple} human judges. Rather than having relevance
be a Boolean determination made by a single expert, we will consider
``relevance''to be a {\em consentual, central tendancy of the searching
users' opinions}. The relevance assessments of individual users and the
resulting central tendancy of relevance is suggested by Figure
(figure) . Two features of this definition are significant.
First, consentual relevance posits a ``consumers'' perspective on what
will count as IR system success. A document's relevance to a query is
not going to be determined by an expert in the topical area, but by the
users who are doing the searching. If they find it relevant, it's
relevant, whether or not some domain expert thinks the document
``should'' have been retrieved.
Second, consentual relevance becomes a
statistical, aggregate property of multiple users' reactions rather than
a discrete feature elicited from any one individual. By making relevance
a {statistical} measure, our confidence in the relevance of a document
(with respect to a query) increases as more relevance assessment data is
collected. This reliance on statistical stability creates a strong link
between IR and machine learning (cf. Chapter §7 ). Allen's investigation into
idiosyncratic cognitive styles of browsing users [Allen92] , and Wilbur's assessment of
the reliability of RelFbk across users [Wilbur98] provide a more textured view
of how multiple relevance assesments can be compared and combined.
It
seems, however, that our move from omniscent to consentual relevance has
only made the problem of evaluation that much more difficult. Test
corpora must be large enough to provide robust tests for retrieval
methods, and multiple queries are necessary in order to evaluate the
overall performance of a search engine. Getting even a single person's
opinion about the relevance of a document to a particular query is hard,
and we are now interested in getting many! However, software like RAVe
(cf. Section §4.4 ) allows an IR
experimenter to effectively collect large numbers of relevance
assessments for an arbitrary document corpus.
Top of Page
Consensual relevance