FOA Home
Perhaps the simplest model proceeds by imagining binary, independent
features; it is conventionally called (surprise!) the BINARY
INDEPENDENCE MODEL (BIM) [Robertson76] [vanR77] . First, the binary assumption is
that all the features $x_{i}$ are binary. This is not a very restrictive
assumption and is used only to simplify the derivation.
The much bigger
assumption is that the documents' features occur independently of one
another. We have discussed the problems with such an assumption before.
quotes J. H. Williams' expression of the paradox: The assumption of
independence of words in a document is usually made as a matter of
mathematical convenience. Without the assumption, many of the subsequent
mathematical relations could not be expressed. With it, many of the
conclusions should be accepted with extreme caution. [Williams65] . The key advantage it
allows is that the probability of a feature vector ${\bf x}$ becomes
simply the product of the MARGINAL PROBABILITIES of the
individual features: \Pr({\bf x} | \mathname{Rel}) = \prod_{i} \Pr( x_i
| \mathname{Rel}) Very convenient, and very unrealistic. {Cooper has
noted that we don't need quite such a restrictive notion of independence
[Cooper94] . All that is required is
to know that the ratio of ``linked'' ir/relevant feature
probabilities \[ \Pr({x_i} | \mathname{Rel})\over \Pr({x_i} |
\overline{\mathname{Rel}}) \] are independent of one another. While
slightly less restrictive, this assumption seems no more realistic.}
Applying
this decomposition to our odds calculation gives:
\mathname{Odds}(\mathname{Rel} | {\bf x}) =
\mathname{Odds}(\mathname{Rel}) \cdot \prod_i {\Pr({x_i} |
\mathname{Rel})\over \Pr({x_i} | \overline{\mathname{Rel}})}
It will be
convenient to introduce the variables $p_i$ and $q_{i}$ to capture the
probabilities that feature $x_{i}$ is present, given that a document
is/not relevant: p_i & \equiv & \Pr(x_i=1 | \mathname{Rel}) \\ q_i &
\equiv & \Pr(x_i=1 | \overline{\mathname{Rel}}) The complimentary
probabities, concerning documents in which the feature is
absent, can also be defined easily: 1-p_i & = & \Pr(x_i=0 |
\mathname{Rel}) \\ 1-q_i & = & \Pr(x_i=0 | \overline{\mathname{Rel}})
These definitions break the product into two portions, the first having
to do with those features that are present in a particular document and
the second those that are not: \mathname{Odds}(\mathname{Rel} | {\bf x})
= \mathname{Odds}(\mathname{Rel}) \cdot \prod_{x_i=1} {p_i\over
q_i}\cdot \prod_{x_i=0} {{1-p_i}\over{1-q_i}}
Recall that both queries
and documents live within the same vector space defined over the
features $x_{i}$. The two products of Equation
(FOAref) (defined in terms of presence/absence of a feature in
a document) can be further broken into four subcases depending
on whether the features do or do not occur in the query. We next make
another ``background'' assumption concerning all of those features
$x_{i}$ that are not in both the query and document of
current interest, viz., that the probability of these features being
present in relevant and irrelevant documents is equal: $p_i = q_i$. That
is, for those terms we don't care about (because they don't affect this
query/document comparison) we are happy to think that the occurence of
these terms is independent of their ir/relevance.
Consider the sets $D$
and $Q$ shown in Figure (figure) defined in terms of those
features $x_i$ present and absent in the document and query,
respectively. $q_i$ of the presence of a feature in irrelevant
documents, but there is no intended, direct connection between these two
quantities.} Regrouping the two products of Equation (FOAref)
into four products created by the two sets $D$ and $Q$, the
$\frac{p_{i}}{q_{i}}$ terms cancel except in in the intersection of the
query and document (where the feature $x_{i}$ is present in both) and in
$Q \backslash D$ , the set difference of $Q$ take-away $D$:
In the
retrieval situation we will exploit the sparseness that makes it much
more efficient to keep track of where a feature does occur ( $x_{i}=1$
)than all the places it does not ( $x_{i}=0$ ). Since the second product
is defined over all the features of $q$ except those in $d$ , if we are
careful to ``pre-multiply'' each feature in their intersection by a
reciprocal, we can then safely multiply everything in the query by the
same ratio:
The next section will show the utility of separating the last
term, which depends on features of the document in question, from the
first two which do not as part of an online retrieval calculation.
But
first, it is worthwhile considering how we might attempt to estimate
some of the required statistics. Fuhr [Fuhr92] , for example, considers the case
when we have RelFbk from a user who has evaluated each of the top
$N$ documents in an initial retrieval and has found $R$ of these to be
relevant (as well as evaluating all the $N-R$ rest of them and found
them to be irrelevant!). If a particular feature $x_{i}$ is present in
$n$ of the retrieved documents with $r$ of these relevant, then this bit
of RelFbk provides reasonable estimates for $p_{i}$ and $q_{i}$:
\widetilde{p_{i}} &=&\frac{r}{R} \\ \widetilde{q_{i}} &=&\frac{n-r}{N-R}
Top of Page
Binary Independence Model