JOURNAL ARTICLE

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

Victor OlmanFenglou MaoHongwei WuYing Xu

Year: 2009 Journal:   IEEE/ACM Transactions on Computational Biology and Bioinformatics Vol: 6 (2)Pages: 344-352   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP.

Keywords:
Computer science Cluster analysis Bipartite graph Minimum spanning tree Bottleneck Algorithm Graph Theoretical computer science Artificial intelligence

Metrics

72
Cited By
2.86
FWCI (Field Weighted Citation Impact)
31
Refs
0.90
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Bioinformatics and Genomic Networks
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
Gene expression and cancer classification
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
Machine Learning in Bioinformatics
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology

Related Documents

BOOK-CHAPTER

Parallel k/h-Means Clustering for Large Data Sets

Kilian StoffelAbdelkader Belkoniene

Lecture notes in computer science Year: 1999 Pages: 1451-1454
JOURNAL ARTICLE

Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment

Ran JinChunhai KouRuijuan LiuYefeng Li

Journal:   Journal of Cloud Computing Advances Systems and Applications Year: 2013 Vol: 2 (1)Pages: 18-18
BOOK-CHAPTER

Clustering Large Data Sets

M. Narasimha Murty

Series in machine perception and artificial intelligence Year: 2002 Pages: 41-63
© 2026 ScienceGate Book Chapters — All rights reserved.