ZTE Communications ›› 2013, Vol. 11 ›› Issue (2): 45-50.DOI: DOI:10.3969/j.issn.1673-5188.2013.02.007

• Research Paper • Previous Articles     Next Articles

Parallel Spectral Clustering Based on MapReduce

Qiwei Zhong1, Yunlong Lin1, Junyang Zou1, Kuangyan Zhu1, Qiao Wang1, and Lei Hu2   

  1. 1. School of Information Science and Engineering, Southeast University, Nanjing 210096, China;
    2. ZTE Corporation, Nanjing 210012, China
  • Received:2012-09-19 Online:2013-06-25 Published:2013-06-25
  • About author:Qiwei Zhong (qwzhong1988@seu.edu.cn) received his BS degree in electronic information Engineering from Nanjing University of Science and Technology. He is currently working towards his MS degree in the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining, social network analysis, distributed processing, and data visualization.

    Yunlong Lin (linyl@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University. He is currently working towards his MS degree in the Laboratory of Netmedia and Datamining, Southeast University, Nanjing. His research interests include data mining, recommendation system, and distributed computing.

    Junyang Zou (zoujyjs@seu.edu.cn) received his BS degree in opto-electronic engineering from Nanjing University of Posts and Telecommunications. He is currently working towards his BS degree at the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining and distributed computing.

    Kuangyan Zhu (zhukuangyan@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University, Nanjing. He is currently working towards his MS degree at Southeast University. His research interests include recommendation system and distributed computing.

    Qiao Wang (qiaowang@seu.edu.cn) is a professor and doctoral tutor st Southeast University, Nanjing. He received his PhD degree in mathematics from Wuhan University. He is now the director of the Signal and Information Processing Laboratory, Southeast University. He was a visiting scientist in the Division of Engineering and Applied Science, Harvard University, from 2003 to 2004. His research interests include spectral analysis, statistical signal processing, image processing, and signal processing in applications such as biology and transportation.

    Lei Hu (hu.lei2@zte.com.cn) received his MS degree from the Laboratory of Intelligent Recognition and Image Processing, Beihang University, in 2008. He is a senior research engineer for mass data analysis projects of ZTE Corporation. His research interests include data mining, information retrieval, and social network analysis.

Parallel Spectral Clustering Based on MapReduce

Qiwei Zhong1, Yunlong Lin1, Junyang Zou1, Kuangyan Zhu1, Qiao Wang1, and Lei Hu2   

  1. 1. School of Information Science and Engineering, Southeast University, Nanjing 210096, China;
    2. ZTE Corporation, Nanjing 210012, China
  • 作者简介:Qiwei Zhong (qwzhong1988@seu.edu.cn) received his BS degree in electronic information Engineering from Nanjing University of Science and Technology. He is currently working towards his MS degree in the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining, social network analysis, distributed processing, and data visualization.

    Yunlong Lin (linyl@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University. He is currently working towards his MS degree in the Laboratory of Netmedia and Datamining, Southeast University, Nanjing. His research interests include data mining, recommendation system, and distributed computing.

    Junyang Zou (zoujyjs@seu.edu.cn) received his BS degree in opto-electronic engineering from Nanjing University of Posts and Telecommunications. He is currently working towards his BS degree at the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining and distributed computing.

    Kuangyan Zhu (zhukuangyan@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University, Nanjing. He is currently working towards his MS degree at Southeast University. His research interests include recommendation system and distributed computing.

    Qiao Wang (qiaowang@seu.edu.cn) is a professor and doctoral tutor st Southeast University, Nanjing. He received his PhD degree in mathematics from Wuhan University. He is now the director of the Signal and Information Processing Laboratory, Southeast University. He was a visiting scientist in the Division of Engineering and Applied Science, Harvard University, from 2003 to 2004. His research interests include spectral analysis, statistical signal processing, image processing, and signal processing in applications such as biology and transportation.

    Lei Hu (hu.lei2@zte.com.cn) received his MS degree from the Laboratory of Intelligent Recognition and Image Processing, Beihang University, in 2008. He is a senior research engineer for mass data analysis projects of ZTE Corporation. His research interests include data mining, information retrieval, and social network analysis.

Abstract: Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.

Key words: spectral clustering, parallel implementation, massive dataset, Hadoop MapReduce, data mining

摘要: Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.

关键词: spectral clustering, parallel implementation, massive dataset, Hadoop MapReduce, data mining