ZTE Communications ›› 2013, Vol. 11 ›› Issue (2): 45-50.DOI: DOI:10.3969/j.issn.1673-5188.2013.02.007
Qiwei Zhong1, Yunlong Lin1, Junyang Zou1, Kuangyan Zhu1, Qiao Wang1, and Lei Hu2
Qiwei Zhong1, Yunlong Lin1, Junyang Zou1, Kuangyan Zhu1, Qiao Wang1, and Lei Hu2
摘要: Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.