FOA Home
With length normalization to the side, we can concentrate on the main
calculation of matching, summing the weight products of terms shared by
query $q$ and document $d$:
But this mathematical characterization hides
a number of messy details associated with actually computing it. We will
need to make efficient use of two data structures in particular. The
first is the inverted index (recall Figure (FOAref) .). The
critical feature of this organization is that we can, for each term in
the query, find the set of all doument postings associated with it. In
particular, the
Since the elements
of the sum can be re-ordered arbitrarily, we will consider a PARTIAL
RANKING (a.k.a. filtering, or pruning) algorithm that attempts to
include the dominant elements of the sum but ignore small,
inconsequential ones [Salton89] [Buckley85] [Harman90] .
The fact that both $w_{kq}$
and $w_{kd}$ vary considerably suggests that, by ordering the inclusion
of products into the sum, we may be able to truncate this process when
they become smaller than we care to consider.
We therefore begin by
sorting the terms in decreasing order of query weights $w_{kq}$.
Considering terms in this order means we can expect to accumulate the
match score beginning with its largest terms. Then, the fact that our
list of postings was similarly (and not coincidentally:) ordered by
decreasing frequency means that: \left( \forall j>i\right)
\,w_{kd_{i}}>w_{kd_{j}}
Once these weights diminish below some threshold
$$, we can quit walking down the postings list. (In fact, it may be that
the weight associated with the very first posting is too small and we
can ignore all weights associated with this term.)
The second important
data structure is an ACCUMULATOR queue in which each document's
match score is maintained. Since each query term may add an additional
term to the match score for a particular document, these accumulators
will keep a running total for each document. For moderate corpus sizes,
it may not be unreasonable to allocate an accumulator for each document,
but this can demand too much memory for very large corpora. Define
$\mathname{NAccum}$ to be the number of accumulators we are willing to
allocate. Then one obvious way to restrict this set is to only allocate
an accumulator when a document's score becomes significant, again in
comparison to some threshold $\tau_{insert}$. Since we will be
processing query terms in decreasing $w_{kq}$ order and heuristically
value the space associate with new accumulators more than the slight
extra time to run down posting lists a bit farther, we can assume
$\tau_{\mathname{insert}}>\tau _{\mathname{add}}$.
Picking appropriate
values for these two thresholds is something of a black art, but Persin
[Persin94] reports one especially
careful experiment in which both are made proportional to the most
highly matched document's accumulator $A^{*}$ (i.e., $A^{*}$ is the
maximum match score in any document's accumulator):
\tau_{\mathname{insert}} &=&\eta _{\mathname{insert}}\,\cdot \,A^{*} \\
\tau_{\mathname{add}} &=&\eta _{\mathname{add}}\,\cdot \,A^{*} Persin's
experiments suggest that values of $\eta _{\mathname{insert}
}=0.07,\,\eta _{\mathname{add}}=0.001$ give retrieval effectiveness near
that of full matches (i.e., considering all query-document term
products) while mininizing $\mathname{NAccum}$.
These two thresholds
divide the range of possible query-document term products into three
conditions: & w_{kd}\cdot w_{kq} > \tau_{\mathname{insert}} &
\mathrm{Always\ add;\ create\ new\ accumulator\ }A_{d}\mathrm{\ if\
necessary} \\ \tau_{\mathname{add}} & < w_{kd}\cdot w_{kq} \leq
\tau_{\mathname{insert}} & \mathrm{Add\ only\ if\ accumulator\
}A_{d}\mathrm{\ already\ exists} \\ & w_{kd}\cdot w_{kq} \leq
\tau_{\mathname{add}} & \mathrm{Ignore;\ move\ on\ to\ next\ query\
term}
Since we want to remain flexible with respect to both long and
short queries, we will assume that the query weights $w_{kq}$ are
pre-computed and passed to our ranking procedure.
Using our definition
for $w_{kd}$ and focusing first on the $_{\mathname{insert}}$ threshold:
& w_{kd}\cdot w_{kq} > & \tau _{\mathname{insert}} \\
\Longleftrightarrow & \left( f_{kd} \cdot \mathname{idf}_{k}\,\right)
\cdot w_{kq} > & \tau_{\mathname{insert}} \\ \Longleftrightarrow &
f_{kd}& f_{kd}\ > &
\frac{\tau_{\mathname{insert}}}{\mathname{idf}_{k}w_{kq}} \\
\Longleftrightarrow & f_{kd}\ > &
\frac{\eta_{\mathname{insert}}\,\,A^{*}}{\mathname{idf}_{k}\,w_{kq}}
\end{eqnarray} This finally becomes an operational question we can apply
with respect to each posting's frequency $f_{kd}$. Note that this
threshold must be updated every time we move to a new term of the query.
And of course, the computation of $\tau _{\mathname{add}}$ proceeds
similarly.
All the basic features of a partial ranking algorithm are now
in place, and a pseudo-code sketch is shown in Figure (FOAref)
. It also includes a few minor complications. First, a hashtable is
required to find accumulators associated with a particular
Top of Page
Computing partial match scores
freq statistic maintained with each posting
allows us to compute the weight $w_{kd}$ we need for each. But even with
these efficiencies, the need to consider every document posting for
every query term can create a serious processing demand, and one that
must be satisfied immediately - the users are waiting!
docno. Second, the set of accumulators $A_{d}$ is described
as a queue, but must be slightly trickier than most - it must maintain
them in order so that only the top $\mathname{NAccum}$ are maintained,
and must support length() queries and a pop()
function when it is full. Nondeterministic skip lists [Pugh90] are recommended for this
purpose.[Cutting97] .