Contraction Clustering (RASTER): A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time

G. Ulm, E. Gustavsson, M. Jirstrand. The 3rd International Conference on Machine Learning, Optimization and Big Data (MOD 2017), September 14 - 17, 2017, Volterra, Italy

Abstract

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering methods are infeasible due to their memory requirements or runtime complexity. Contraction Clustering (RASTER) is a linear-time algorithm for identifying density-based clusters. Its coefficient is negligible as it depends neither on input size nor the number of clusters. Its memory requirements are constant. Consequently, RASTER is suitable for big data applications where the size of the data may be huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. Our algorithm is extremely fast. In single-threaded execution on a contemporary workstation, it clusters ten million points in less than 20 seconds when using a slow interpreted programming language like Python. Furthermore, RASTER is easily parallelizable.




Photo credits: Nic McPhee