Clustering is one of the most important unsupervised learning used for prediction and overcome anomalies by grouping of data. As the quantity of the data is increasing every day, it has become a troublesome job to process these data with limited computational resources. The era is in need to treat it as a Big Data problem, which requires an advance technology to store, and process the data in seamlessly distributed fashion. Apache Hadoop offers a solution for this problem by designing techniques using commodity hardware to run parallel jobs. In this paper, we have discussed an algorithm to process K-Means algorithm in Hadoop by varying the data set and cluster centers. We then draw a comparison on parallel and sequential execution, keeping the other factors same. The experimental result depicts that our algorithm can efficiently process large dataset on Hadoop environment.
Zahid AnsariAsif AfzalTanvir Habib Sardar
Haibo LiuYongbin BaiZhenhao ChenZhenfeng Zhang