Parallel Spectral Clustering Based on MapReduce

doi:DOI:10.3969/j.issn.1673-5188.2013.02.007

ZTE Communications ›› 2013, Vol. 11 ›› Issue (2): 45-50.DOI: DOI:10.3969/j.issn.1673-5188.2013.02.007

Parallel Spectral Clustering Based on MapReduce

Qiwei Zhong¹, Yunlong Lin¹, Junyang Zou¹, Kuangyan Zhu¹, Qiao Wang¹, and Lei Hu²

1. School of Information Science and Engineering, Southeast University, Nanjing 210096, China;
2. ZTE Corporation, Nanjing 210012, China

收稿日期:2012-09-19 出版日期:2013-06-25 发布日期:2013-06-25
作者简介:Qiwei Zhong (qwzhong1988@seu.edu.cn) received his BS degree in electronic information Engineering from Nanjing University of Science and Technology. He is currently working towards his MS degree in the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining, social network analysis, distributed processing, and data visualization.

Yunlong Lin (linyl@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University. He is currently working towards his MS degree in the Laboratory of Netmedia and Datamining, Southeast University, Nanjing. His research interests include data mining, recommendation system, and distributed computing.

Junyang Zou (zoujyjs@seu.edu.cn) received his BS degree in opto-electronic engineering from Nanjing University of Posts and Telecommunications. He is currently working towards his BS degree at the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining and distributed computing.

Kuangyan Zhu (zhukuangyan@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University, Nanjing. He is currently working towards his MS degree at Southeast University. His research interests include recommendation system and distributed computing.

Qiao Wang (qiaowang@seu.edu.cn) is a professor and doctoral tutor st Southeast University, Nanjing. He received his PhD degree in mathematics from Wuhan University. He is now the director of the Signal and Information Processing Laboratory, Southeast University. He was a visiting scientist in the Division of Engineering and Applied Science, Harvard University, from 2003 to 2004. His research interests include spectral analysis, statistical signal processing, image processing, and signal processing in applications such as biology and transportation.

Lei Hu (hu.lei2@zte.com.cn) received his MS degree from the Laboratory of Intelligent Recognition and Image Processing, Beihang University, in 2008. He is a senior research engineer for mass data analysis projects of ZTE Corporation. His research interests include data mining, information retrieval, and social network analysis.

Parallel Spectral Clustering Based on MapReduce

Qiwei Zhong¹, Yunlong Lin¹, Junyang Zou¹, Kuangyan Zhu¹, Qiao Wang¹, and Lei Hu²

1. School of Information Science and Engineering, Southeast University, Nanjing 210096, China;
2. ZTE Corporation, Nanjing 210012, China

Received:2012-09-19 Online:2013-06-25 Published:2013-06-25
About author:Qiwei Zhong (qwzhong1988@seu.edu.cn) received his BS degree in electronic information Engineering from Nanjing University of Science and Technology. He is currently working towards his MS degree in the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining, social network analysis, distributed processing, and data visualization.

Yunlong Lin (linyl@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University. He is currently working towards his MS degree in the Laboratory of Netmedia and Datamining, Southeast University, Nanjing. His research interests include data mining, recommendation system, and distributed computing.

Junyang Zou (zoujyjs@seu.edu.cn) received his BS degree in opto-electronic engineering from Nanjing University of Posts and Telecommunications. He is currently working towards his BS degree at the School of Information Science and Engineering, Southeast University, Nanjing. His research interests include data mining and distributed computing.

Kuangyan Zhu (zhukuangyan@seu.edu.cn) received his BS degree from the School of Information Science and Engineering, Southeast University, Nanjing. He is currently working towards his MS degree at Southeast University. His research interests include recommendation system and distributed computing.

Qiao Wang (qiaowang@seu.edu.cn) is a professor and doctoral tutor st Southeast University, Nanjing. He received his PhD degree in mathematics from Wuhan University. He is now the director of the Signal and Information Processing Laboratory, Southeast University. He was a visiting scientist in the Division of Engineering and Applied Science, Harvard University, from 2003 to 2004. His research interests include spectral analysis, statistical signal processing, image processing, and signal processing in applications such as biology and transportation.

Lei Hu (hu.lei2@zte.com.cn) received his MS degree from the Laboratory of Intelligent Recognition and Image Processing, Beihang University, in 2008. He is a senior research engineer for mass data analysis projects of ZTE Corporation. His research interests include data mining, information retrieval, and social network analysis.

摘要/Abstract

摘要： Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.

关键词: spectral clustering, parallel implementation, massive dataset, Hadoop MapReduce, data mining

Abstract: Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.

Key words: spectral clustering, parallel implementation, massive dataset, Hadoop MapReduce, data mining

Qiwei Zhong, Yunlong Lin, Junyang Zou, Kuangyan Zhu, Qiao Wang, and Lei Hu. Parallel Spectral Clustering Based on MapReduce[J]. ZTE Communications, 2013, 11(2): 45-50.

[1]	Zhenjiang Dong, Lixia Liu, Bin Wu, and Yang Liu. MBGM: A Graph-Mining Tool Based on MapReduce and BSP[J]. ZTE Communications, 2014, 12(4): 16-22.
[2]	Shengmei Luo, Zhikun Wang, and Zhiping Wang. Big-Data Analytics: Challenges, Key Technologies and Prospects[J]. ZTE Communications, 2013, 11(2): 11-17.
[3]	Ye Li, Fan Zhang, Bo Gan, and Chengzhong Xu. A System for Detecting Refueling Behavior along Freight Trajectories and Recommending Refueling Alternatives[J]. ZTE Communications, 2013, 11(2): 55-62.

Parallel Spectral Clustering Based on MapReduce

Parallel Spectral Clustering Based on MapReduce

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics