FOA Home | UP: Clustering


Clustering algorithms

Because of its wide-spread application, clusetering is one of the most well-studied problems within statistics and computer science [Jain88] [ Griffiths86] . It can be stated in generic terms, in terms of an arbitrary similarity measure between items to be clustered. Especially in iterative applications of clustering, the relative frequency of various clustering technique matters. The art, therefore, in applying a clustering method in a particular application depends greatly on the particular features of items to be clustered, the number of partitions within which these items are to be clustered. W. Willett has provided an excellent survey of applications of clusetering applied to various aspects of the FOA task [Willett88] . More recently, Zamir has considered clusterings of document ``snippets'' (roughly the same as the paragraph units proposed in Section §2.3 ) rather than on complete documents [Zamir98] .

The most typical clustering method for application within the FOA context is single link hierarchical clustering. This is considered an AGGLOMERATIVE technique, because it begins with each data point considered to be in its own cluster and then iteratively asks which to clusters should be combined. clustering algorithms begin with the entire set of data points considered to be in one cluster, and then attempt to partition this set [Jain88] . They sometimes are recommended because if they are begun with exactly $k$ seeds reduce the complexity of comparison to $O(kN)$.}

The most typical method for performing this task builds a MINIMUM SPANNING TREE (MST), iteratively merging the two documents that are closest together, then including the merged nodes as well. The result is a tree from the finally merged root to each of the documents as leaves. Alternatively, the complete set of inter-document similarity measures can be progressively ``filtered'' by a gradually increasing threshold which is used to define whether two documents are considered connected by an edge. We can then ask for certain properties of the emerging graph, for example stopping when all components become connected. To keep things simple and focused on robust assumptions \vanR{59}, $O(n^2)$ algorithms are assumed.

One early motivation for clustering algorithms was efficient disk access: If documents were pre-clustered appropriately it would be likely that a ``page'' of disk memory containing one document matching a query might also contain other elements of the same cluster. That is, an efficient physical disk allocation algorithm might try to ensure that clustered documents were written to the same block.

This example suggests how MST representations of the documents might also support efficient query/document matching algorithms. In brief, the MST can be interpretted as a ``decision tree,'' with a query beginning at the root and comparing itself to a candidate representation (e.g., centroid) of each cluster.

The Clustering Hypothesis suggests that if a document which is similar to the query is relevant, then other documents similar to it are likely to be relevant as well. By clustering neighboring documents' NEAREST NEIGHBOR documents should be retrieved also. But as Cutting et al. note, for a partitioning cluster to be useful in this application requires that $k$ be very near $N$ [Cutting92] . We do not want to retrieve more than a linear constant other documents when we retrieve one.

ScatterGather is an intereseting example of how partitional clustering can be applied in a provocative way [Cutting92] [Hearst92] . Beginning with a clustering of the entire corpus, initial queries are posed by selecting some of the clusters. Documents identified as members of these clusters form a new sub-corpus; clustering is again applied across these, etc. Evaluating the utility of such a novel browsing pattern is difficult, but the ability of the ScatterGather interface (as well as a similar effort applied to the MacOS interface called Piles [Rose93] . The most problematic feature of the ScatterGather procedure is its sensitivity to the selection of initial ``buckshot'' random seeds. They don't necessarily see this as a bug: Indeed, the lack of determinism might be interpreted as a feature, since the user then had the option of discarding an unrevealing partition in favor of a fresh re-clustering. [Cutting92]


Top of Page | UP: Clustering | ,FOA Home


FOA © R. K. Belew - 00-09-21