FOA Home
Because of its wide-spread application, clusetering is one of the most
well-studied problems within statistics and computer science [Jain88] [ Griffiths86] . It can be stated in generic terms, in
terms of an arbitrary similarity measure between items to be clustered.
Especially in iterative applications of clustering, the relative
frequency of various clustering technique matters. The art, therefore,
in applying a clustering method in a particular application depends
greatly on the particular features of items to be clustered, the number
of partitions within which these items are to be clustered. W. Willett
has provided an excellent survey of applications of clusetering applied
to various aspects of the FOA task [Willett88] . More recently, Zamir has
considered clusterings of document ``snippets'' (roughly the same as the
paragraph units proposed in Section §2.3
) rather than on complete documents [Zamir98] .
The most typical clustering
method for application within the FOA context is single link
hierarchical clustering. This is considered an AGGLOMERATIVE
technique, because it begins with each data point considered to be in
its own cluster and then iteratively asks which to clusters should be
combined. clustering algorithms begin with the entire set of data points
considered to be in one cluster, and then attempt to partition this set
[Jain88] . They sometimes are
recommended because if they are begun with exactly $k$ seeds reduce the
complexity of comparison to $O(kN)$.}
The most typical method for
performing this task builds a MINIMUM SPANNING TREE (MST),
iteratively merging the two documents that are closest together, then
including the merged nodes as well. The result is a tree from the
finally merged root to each of the documents as leaves. Alternatively,
the complete set of inter-document similarity measures can be
progressively ``filtered'' by a gradually increasing threshold which is
used to define whether two documents are considered connected by an
edge. We can then ask for certain properties of the emerging graph, for
example stopping when all components become connected. To keep things
simple and focused on robust assumptions \vanR{59}, $O(n^2)$ algorithms
are assumed.
One early motivation for clustering algorithms was efficient
disk access: If documents were pre-clustered appropriately it would be
likely that a ``page'' of disk memory containing one document matching a
query might also contain other elements of the same cluster. That is, an
efficient physical disk allocation algorithm might try to ensure that
clustered documents were written to the same block.
This example suggests
how MST representations of the documents might also support efficient
query/document matching algorithms. In brief, the MST can be
interpretted as a ``decision tree,'' with a query beginning at the root
and comparing itself to a candidate representation (e.g., centroid) of
each cluster.
The Clustering Hypothesis suggests that if a document which
is similar to the query is relevant, then other documents similar to it
are likely to be relevant as well. By clustering neighboring documents'
NEAREST NEIGHBOR documents should be retrieved also. But as
Cutting et al. note, for a partitioning cluster to be useful in this
application requires that $k$ be very near $N$ [Cutting92] . We do not want to
retrieve more than a linear constant other documents when we retrieve
one.
ScatterGather is an intereseting example of how partitional
clustering can be applied in a provocative way [Cutting92] [Hearst92] . Beginning with a clustering
of the entire corpus, initial queries are posed by selecting some of the
clusters. Documents identified as members of these clusters form a new
sub-corpus; clustering is again applied across these, etc. Evaluating
the utility of such a novel browsing pattern is difficult, but the
ability of the ScatterGather interface (as well as a similar effort
applied to the MacOS interface called Piles [Rose93] . The most problematic feature of
the ScatterGather procedure is its sensitivity to the selection of
initial ``buckshot'' random seeds. They don't necessarily see this as a
bug: Indeed, the lack of determinism might be interpreted as a feature,
since the user then had the option of discarding an unrevealing
partition in favor of a fresh re-clustering. [Cutting92]
Top of Page
Clustering algorithms