The mission of subspace clustering is to find hidden clusters exist in different subspaces within a dataset. In recent years, with the exponential growth of data size and data dimensions, traditional subspace clustering algorithms become inefficient as well as ineffective while extracting knowledge in the big data environment, resulting in an emergent need to design efficient parallel distributed subspace clustering algorithms to handle large multi-dimensional data with an acceptable computational cost. In this paper, we introduce MR-Mafia: a parallel mafia subspace clustering algorithm based on MapReduce. The algorithm takes advantage of MapReduce's data partitioning and task parallelism and achieves a good tradeoff between the cost for disk accesses and communication cost. The experimental results show near linear speedups and demonstrate the high scalability and great application prospects of the proposed algorithm.
Robson L. F. CordeiroCaetano TrainaAgma J. M. TrainaJulio LópezU KangChristos Faloutsos
Yaobin HeHaoyu TanWuman LuoHuajian MaoDi MaShengzhong FengJianping Fan
Tripathi AshishSharma KapilBala Manju
Cen ChenKenli LiAijia OuyangKeqin Li