FOA Home | UP: INDIVIDUALS' assessment: search engine performance


Consensual relevance

In most search engine evaluation, the assumption has been that a single expert can be trusted to provide reliable relevance assessments. Whether any one, ``omniscent'' individual is capable of providing reliable data about the appropriate set of documents to be retrieved remains a foundational issue within IR. For example, a number of papers in a recent special issue of the Journal of the American Society for Information Systems devoted to relevance advocated a move towards a more ``user-centered, situational,'' view of relevance [Froehlich94] .

Our attention to the opinions of individual users suggests the possibility of {combining} evidence from {\em multiple} human judges. Rather than having relevance be a Boolean determination made by a single expert, we will consider ``relevance''to be a {\em consentual, central tendancy of the searching users' opinions}. The relevance assessments of individual users and the resulting central tendancy of relevance is suggested by Figure (figure) . Two features of this definition are significant. First, consentual relevance posits a ``consumers'' perspective on what will count as IR system success. A document's relevance to a query is not going to be determined by an expert in the topical area, but by the users who are doing the searching. If they find it relevant, it's relevant, whether or not some domain expert thinks the document ``should'' have been retrieved.

Second, consentual relevance becomes a statistical, aggregate property of multiple users' reactions rather than a discrete feature elicited from any one individual. By making relevance a {statistical} measure, our confidence in the relevance of a document (with respect to a query) increases as more relevance assessment data is collected. This reliance on statistical stability creates a strong link between IR and machine learning (cf. Chapter §7 ). Allen's investigation into idiosyncratic cognitive styles of browsing users [Allen92] , and Wilbur's assessment of the reliability of RelFbk across users [Wilbur98] provide a more textured view of how multiple relevance assesments can be compared and combined.

It seems, however, that our move from omniscent to consentual relevance has only made the problem of evaluation that much more difficult. Test corpora must be large enough to provide robust tests for retrieval methods, and multiple queries are necessary in order to evaluate the overall performance of a search engine. Getting even a single person's opinion about the relevance of a document to a particular query is hard, and we are now interested in getting many! However, software like RAVe (cf. Section §4.4 ) allows an IR experimenter to effectively collect large numbers of relevance assessments for an arbitrary document corpus.


Top of Page | UP: INDIVIDUALS' assessment: search engine performance | ,FOA Home


FOA © R. K. Belew - 00-09-21