This Unix tar file contains Matlab
source code for the algorithm described in the paper Using the Triangle Inequality to
Accelerate k-Means published in Proceedings of the
Twentieth International Conference on Machine Learning (ICML'03).
Note added on April 14, 2010: Do not initialize k-means with two identical centers. Otherwise, this Matlab code may terminate prematurely with an error message.
The version of the KDD'98 dataset used in the paper is available as cup98b.data.zip. This file has 95413
rows with 56 columns each. For k
= 20, starting with furthest-first initialization as described in the
paper, k-means converges in
260 iterations. The final mean squared distance of each point to
its nearest center is 12880. The Matlab software above performs
approximately 7.26 million distance calculations, instead of about 496
million without optimization. With a 3GHz Pentium IV processor,
it finishes in about 120 seconds,
compared to about 1280 seconds for a k-means
implementation with similar low-level optimizations that does not
eliminate distance calculations.
For other test datasets see here.