JOURNAL ARTICLE

Distributed approximate spectral clustering for large-scale datasets

Abstract

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

Keywords:
Computer science Scalability Cluster analysis Kernel (algebra) Overhead (engineering) Algorithm Big data Machine learning Artificial intelligence Data mining Mathematics Database

Metrics

19
Cited By
2.49
FWCI (Field Weighted Citation Impact)
81
Refs
0.91
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Face and Expression Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Clustering Algorithms Research
Physical Sciences →  Computer Science →  Artificial Intelligence
Data Stream Mining Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Vector quantization based approximate spectral clustering of large datasets

Kadim Taşdemi̇r

Journal:   Pattern Recognition Year: 2012 Vol: 45 (8)Pages: 3034-3044
BOOK-CHAPTER

A Spectral Clustering Method for Large-Scale Geostatistical Datasets

Francky Fouedjio

Lecture notes in computer science Year: 2017 Pages: 248-261
JOURNAL ARTICLE

Approximate spectral clustering density-based similarity for noisy datasets

Mashaan AlshammariMasahiro Takatsuka

Journal:   Pattern Recognition Letters Year: 2019 Vol: 128 Pages: 155-161
© 2026 ScienceGate Book Chapters — All rights reserved.