Raied SalmanVojislav KecmanQi LiRobert StrackErik Test
k-means has recently been recognized as one of the best algorithms for\nclustering unsupervised data. Since k-means depends mainly on distance\ncalculation between all data points and the centers, the time cost will be high\nwhen the size of the dataset is large (for example more than 500millions of\npoints). We propose a two stage algorithm to reduce the time cost of distance\ncalculation for huge datasets. The first stage is a fast distance calculation\nusing only a small portion of the data to produce the best possible location of\nthe centers. The second stage is a slow distance calculation in which the\ninitial centers used are taken from the first stage. The fast and slow stages\nrepresent the speed of the movement of the centers. In the slow stage, the\nwhole dataset can be used to get the exact location of the centers. The time\ncost of the distance calculation for the fast stage is very low due to the\nsmall size of the training data chosen. The time cost of the distance\ncalculation for the slow stage is also minimized due to small number of\niterations. Different initial locations of the clusters have been used during\nthe test of the proposed algorithms. For large datasets, experiments show that\nthe 2-stage clustering method achieves better speed-up (1-9 times).\n
Junwei HanKun SongFeiping NieXuelong Li
Tae-Chang JeeHyun-Jin LeeYill-Byung Lee
Kun SongXiwen YaoFeiping NieXuelong LiMingliang Xu
Liang XianFuheng QuYong YangHua Cai