FOA Home
We can immediately use this vector space for something useful, as the
source of yet another approach to the question of appropriate keyword
weightings. Recall that in Figure (FOAref) our initial
assumption was that each and every keyword was to be used as a dimension
of the vector space. Now we ask: What would happen if we removed one of
these keywords?
The first step is to extend the measure
\mathname{Sim}(\mathbf{q},\mathbf{d}) \) of document-query similarity to
measure inter-document similarities \(
\mathname{Sim}(\mathbf{d}_{i},\mathbf{d}_{j}) \) as well. Then, for an
arbitrary measure of document-document similarity (e.g., the inner
product measure mentioned above), we consider all pairs of documents,
and then the AVERAGE SIMILARITY across all of them. {Fortunately,
it turns out that there is a more efficient way to compute average
similarity than actually comparing all ${n}\choose{2}$ document pairs.
First, define the CENTROID document to be the average document;
i.e., the resulting of adding all\( \mathname{NDoc} \) vectors and
dividing it by \( \mathname{NDoc} \). Then the average similarity can
also be defined in terms of the distance of each document from this
center. \overline{\mathname{Sim}_k}&\equiv&{1\over
\mathname{NDoc}^2}\sum_{i,j} \mathname{Sim}(d_i, d_j) \\ &=
&\alpha\sum_{i=1}^{\mathname{NDoc}} \mathname{Sim}(d_i, D^*) }
Recall
that our goal is to devise a representation of documents that makes it
easiest for queries to discriminate them. Since each keyword corresponds
to a dimension, {removing} one results in a {\em compression} of the
space into $K-1$ dimensions, and we can expect that the representation
of each of the documents will change at least slightly. More precisely,
removing a dimension along which the documents varied significantly
means that vectors which were far apart in the $K$-dimensional space are
now much closer together.
This observation can be used to ask just how
useful each potential keyword is. If it is discriminating, removing it
will result in a signficant compression of the documents' vectors; if
removing it changes very little, the keyword is less helpful. Using the
average similarity as our measure of just how close together the
documents are, and asking this question for each and every keyword, we
arrive at yet another measure of keyword discrimination:
Top of Page
Keyword discrimination