FOA Home
Figure (figure) shows the relationship between relevant (\Rel)
and retrieved (\Ret) sets as a Venn Diagram, against the backdrop of the
universe $U$ of the rest of the documents of the corpus. Obviously our
focus should be on those documents that are in the intersection
of \Rel and \Ret, and making this intersection as large as possible.
Informally, we will be most happy with a \Rel set when it best overlaps
with the \Ret set, and therefore seek evaluation measures which reflect
this. The basic relations between the sizes of these sets can
also be captured in the CONTINGENCY TABLE of Figure
(figure) $ is the number of documents not retrieved,
$\mathname{NDoc}$ is the total number of documents, $\mathname{NRel}$ is
the number of relevant documents, $\mathname{NNRel}$ is the number of
irrelevant document.}
We know we want the intersection of and \Ret sets
to be large, but large relative to what?! As mentioned in Chapter 1, if
we are most focused on the \Rel set and use it as our standard of
comparison, we'd like to know what fraction of these we've retrieved.
This ratio is called RECALL : \beq \mathname{Recall}\equiv
{{\left| \mathname{Ret}\cap\mathname{Rel} \right| }
\over{\left|\mathname{Rel}\right|}} \eeq
Anticipating the probalistic
analysis of Section §5.5 , we can think
of Recall as (an estimate of) the conditional probability that
a document will be retrieved, given that it is relevant:
$\mathname{Pr}(\mathname{Ret}|\mathname{Rel})$.
Conversely, if we instead
focus on the $$ set, we are most interested in what fraction of these
are relevant; this ratio is precision: \beq \mathname{Precision}\equiv
{{\left| \mathname{Ret}\cap\mathname{Rel} \right|}
\over{\left|\mathname{Ret}\right|}} \eeq Similarly, this is the
probability that a document will be relevant, given that it is
retrieved:$\Pr(\mathname{Rel}|\mathname{Ret})$. A closely related but
less common measure is called FALLOUT , where we (perversly!)
focus on the irrelevant documents and the fraction of them retrieved:
\beq \mathname{Fallout}\equiv
{\left|{\overline{\mathname{Ret}}\cap\mathname{Rel}}\right| \over
{\left|\overline{\mathname{Ret}}\right|}} \eeq This is
$\Pr(\mathname{Ret}|\overline\mathname{Rel})$. These two measures,
Recall and Precision, have remained the bedrock of search engine
evaluation since they were first introduced by Kent in 1955 [Kent55] [Saracevic75] .
The close relationship
between these three measures can be defined precisely, if the generality
$G$ of the query (cf. Sect. §4.3.7 )
is known: \beq \mathname{Precision}= {{\mathname{Recall}\cdot
{G}}\over{{\mathname{Recall}\cdot {G}} + {\mathname{Fallout}\cdot
(1-{G})}}} \eeq By far the most common measures of search engine
performance are just the pair of measures, precision and recall.
Ideally,
of course, we'd like a system which has both high precision and
high recall: only relevant documents and all of them. But real-world,
practical systems must select documents based on features that are only
statistically useful indicators of relevance; we can never be sure. In
this case efforts made to improve recall must retrieve more documents,
and it is likely that precision will suffer as a consequence. The best
we can hope for is some balance.
Top of Page
Basic measures