Fast k-means code for Matlab

 
Updated June 11, 2004


This Unix tar file contains Matlab source code for the algorithm described in the paper Using the Triangle Inequality to Accelerate k-Means published in Proceedings of the Twentieth International Conference on Machine Learning (ICML'03).

Note added on April 14, 2010:  Do not initialize k-means with two identical centers.  Otherwise, this Matlab code may terminate prematurely with an error message.

The version of the KDD'98 dataset used in the paper is available as cup98b.data.zip.  This file has 95413 rows with 56 columns each.  For k = 20, starting with furthest-first initialization as described in the paper, k-means converges in 260 iterations.  The final mean squared distance of each point to its nearest center is 12880.  The Matlab software above performs approximately 7.26 million distance calculations, instead of about 496 million without optimization.  With a 3GHz Pentium IV processor, it finishes in about 120 seconds, compared to about 1280 seconds for a k-means implementation with similar low-level optimizations that does not eliminate distance calculations.

For other test datasets see here.