A Hadoop Performance Prediction Model Based on Random Forest

doi:DOI:10.3969/j.issn.1673-5188.2013.02.006

ZTE Communications ›› 2013, Vol. 11 ›› Issue (2): 38-44.DOI: DOI:10.3969/j.issn.1673-5188.2013.02.006

A Hadoop Performance Prediction Model Based on Random Forest

Zhendong Bei¹, Zhibin Yu¹, Huiling Zhang¹, Chengzhong Xu^1,2, Shenzhong Feng¹, Zhenjiang Dong³, and Hengsheng Zhang³

1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
2. Wayne State University, Detroit, Michigan 48202, USA;
3. Cloud Computing and IT Institute of ZTE Corporation, Nanjing 210012, China

收稿日期:2013-05-20 出版日期:2013-06-25 发布日期:2013-06-25
作者简介:Zhendong Bei is a PhD student in computer applications at Shenzhen Institutes of Advanced Technology, China. He received his BS degree from the National University of Defense Technology, China, in 2006, and received his MS degree from Central South University, China, in 2009. His research interests include cloud computing, data mining, machine learning, and image processing.

Zhibin Yu (zb.yu@siat.ac.cn) received his PhD degree in computer science from Huazhong University of Science and Technology in 2008. He spent one year as a visiting scholar in the Laboratory of Computer Architecture, University of Texas at Austin. He is currently an associate professor at the Shenzhen Institutes of Advance Technology. His research interests include microarchitecture simulation, computer architecture, workload characterization and generation, performance evaluation, multicore architecture, and virtualization technologies. In 2005, he won first prize in the HUST Young Lecturers Teaching Contest, and in 2003, he won second prize in the teaching quality assessment of HUST. He is a member of IEEE and ACM. Huiling Zhang received her MS degree in singal and information processing from Southwest University, China, in 2011. She joined the Center for High-Performance Computing at Shenzhen Institutes of Advanced Technology and now works as a research assistant there. Her current research interests include high-performance computing, and machine learning and its applications in bioinformatics.

Chengzhong Xu (czxu@wayne.edu) received his BS degree and MSc degree in computer science and engineering from Nanjing University in 1986 and 1989. He received his PhD degree from the University of Hong Kong in 1993. His research interests include computer architecture, distributed systems, virtualization, and cloud computing. Dr. Xu is a professor of electrical and computer engineering at Wayne State University. He is also the director of the Cloud and Internet Computer Laboratory at Wayne State University. He is an IEEE senior member and ACM member. Shengzhong Feng is a professor, deputy director of the Institute of Advanced Computing and Digital Engineering, Shenzhen Institute of Advanced Technology. His research interests are parallel algorithms, grid computing, and bioinformatics. In particular, he is focused on developing novel, effective methods of modeling digital cities and applications. Before coming to SIAT, he worked in the Institute of Computing Technology, Chinese Academy of Sciences, and participated in research on the Dawning supercomputer. He graduated from the University of Science and Technology of China in 1991 and received his PhD from Beijing Institute of Technology in 1997.

Zhenjiang Dong (dong.zhenjiang@zte.com.cn) is the vice president of the Communication Services R&D Institute for Cloud Computing and IT Operation, ZTE Corporation. He received his MS degree from Harbin Institute of Technology in 1996. His research interests include cloud computing, multimedia networking, and mobile networking.

Hengsheng Zhang (zhang.hengsheng@zte.com.cn) received his bachelor’s degree from Anhui University, China. He joined ZTE in 2005, and is a pre-research engineer and senior architect. His research interests include value added services and cloud computing.

A Hadoop Performance Prediction Model Based on Random Forest

Zhendong Bei¹, Zhibin Yu¹, Huiling Zhang¹, Chengzhong Xu^1,2, Shenzhong Feng¹, Zhenjiang Dong³, and Hengsheng Zhang³

1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
2. Wayne State University, Detroit, Michigan 48202, USA;
3. Cloud Computing and IT Institute of ZTE Corporation, Nanjing 210012, China

Received:2013-05-20 Online:2013-06-25 Published:2013-06-25
About author:Zhendong Bei is a PhD student in computer applications at Shenzhen Institutes of Advanced Technology, China. He received his BS degree from the National University of Defense Technology, China, in 2006, and received his MS degree from Central South University, China, in 2009. His research interests include cloud computing, data mining, machine learning, and image processing.

Zhibin Yu (zb.yu@siat.ac.cn) received his PhD degree in computer science from Huazhong University of Science and Technology in 2008. He spent one year as a visiting scholar in the Laboratory of Computer Architecture, University of Texas at Austin. He is currently an associate professor at the Shenzhen Institutes of Advance Technology. His research interests include microarchitecture simulation, computer architecture, workload characterization and generation, performance evaluation, multicore architecture, and virtualization technologies. In 2005, he won first prize in the HUST Young Lecturers Teaching Contest, and in 2003, he won second prize in the teaching quality assessment of HUST. He is a member of IEEE and ACM. Huiling Zhang received her MS degree in singal and information processing from Southwest University, China, in 2011. She joined the Center for High-Performance Computing at Shenzhen Institutes of Advanced Technology and now works as a research assistant there. Her current research interests include high-performance computing, and machine learning and its applications in bioinformatics.

Chengzhong Xu (czxu@wayne.edu) received his BS degree and MSc degree in computer science and engineering from Nanjing University in 1986 and 1989. He received his PhD degree from the University of Hong Kong in 1993. His research interests include computer architecture, distributed systems, virtualization, and cloud computing. Dr. Xu is a professor of electrical and computer engineering at Wayne State University. He is also the director of the Cloud and Internet Computer Laboratory at Wayne State University. He is an IEEE senior member and ACM member. Shengzhong Feng is a professor, deputy director of the Institute of Advanced Computing and Digital Engineering, Shenzhen Institute of Advanced Technology. His research interests are parallel algorithms, grid computing, and bioinformatics. In particular, he is focused on developing novel, effective methods of modeling digital cities and applications. Before coming to SIAT, he worked in the Institute of Computing Technology, Chinese Academy of Sciences, and participated in research on the Dawning supercomputer. He graduated from the University of Science and Technology of China in 1991 and received his PhD from Beijing Institute of Technology in 1997.

Zhenjiang Dong (dong.zhenjiang@zte.com.cn) is the vice president of the Communication Services R&D Institute for Cloud Computing and IT Operation, ZTE Corporation. He received his MS degree from Harbin Institute of Technology in 1996. His research interests include cloud computing, multimedia networking, and mobile networking.

Hengsheng Zhang (zhang.hengsheng@zte.com.cn) received his bachelor’s degree from Anhui University, China. He joined ZTE in 2005, and is a pre-research engineer and senior architect. His research interests include value added services and cloud computing.

摘要/Abstract

摘要： MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently developed machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system’s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.

关键词: big data, cloud computing, MapReduce, Hadoop, random forest, micro-benchmark

Abstract: MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently developed machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system’s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.

Key words: big data, cloud computing, MapReduce, Hadoop, random forest, micro-benchmark

Zhendong Bei, Zhibin Yu, Huiling Zhang, Chengzhong Xu, Shenzhong Feng, Zhenjiang Dong, and Hengsheng Zhang. A Hadoop Performance Prediction Model Based on Random Forest[J]. ZTE Communications, 2013, 11(2): 38-44.

[1]	Smitha Shivshankar and Abbas Jamalipour. A Cloud Computing Perspective for Distributed Routing in Vehicular Environments[J]. ZTE Communications, 2016, 14(3): 36-44.
[2]	Yongbo Chen, Jijun Chen, Jiafeng Gan. Experimental Study on Cloud-Computing-Based Electric Power SCADA System[J]. ZTE Communications, 2015, 13(3): 33-41.
[3]	Aftab Ahmed Chandio, Nikos Tziritas, Cheng-Zhong Xu. Big-Data Processing Techniques and Their Challenges in Transport Domain[J]. ZTE Communications, 2015, 13(1): 50-59.
[4]	Zhenjiang Dong. Guest Editorial: Improving Performance of Cloud Computing and Big Data Technologies and Applications[J]. ZTE Communications, 2014, 12(4): 1-2.
[5]	Zhenjiang Dong, Lixia Liu, Bin Wu, and Yang Liu. MBGM: A Graph-Mining Tool Based on MapReduce and BSP[J]. ZTE Communications, 2014, 12(4): 16-22.
[6]	Jianqi Liu, Jiafu Wan, Shenghua He, and Yanlin Zhang. E-Healthcare Supported by Big Data[J]. ZTE Communications, 2014, 12(3): 46-52.
[7]	Yasha Chen, Jianpeng Zhao, Junmao Zhu, and Fei Yan. Formal Protection Architecture for Cloud Computing System[J]. ZTE Communications, 2014, 12(2): 63-66.
[8]	. Guest Editorial: Cloud Computing[J]. ZTE Communications, 2013, 11(4): 1-1.
[9]	Ghazanfar Ali, Jie Hu, and Bhumip Khasnabish. Software-Defined Data Center[J]. ZTE Communications, 2013, 11(4): 2-7.
[10]	Lei Yang and Jiannong Cao. Computation Partitioning in Mobile Cloud Computing: A Survey[J]. ZTE Communications, 2013, 11(4): 8-17.
[11]	Tung Nguyen and Weisong Shi. MapReduce in the Cloud: Data-Location-Aware VM Scheduling[J]. ZTE Communications, 2013, 11(4): 18-26.
[12]	Fuzhi Cang, Mingxing Zhang, Yongwei Wu, and Weimin Zheng. Preventing Data Leakage in a Cloud Environment[J]. ZTE Communications, 2013, 11(4): 27-31.
[13]	Chengzhong Xu and Zhibin Yu. Big Data:Where Dreams Take Flight[J]. ZTE Communications, 2013, 11(2): 1-2.
[14]	Yi Zhu and Zhengkun Mi. Content Centric Networking: A New Approach to Big Data Distribution[J]. ZTE Communications, 2013, 11(2): 3-10.
[15]	Shengmei Luo, Zhikun Wang, and Zhiping Wang. Big-Data Analytics: Challenges, Key Technologies and Prospects[J]. ZTE Communications, 2013, 11(2): 11-17.

A Hadoop Performance Prediction Model Based on Random Forest

A Hadoop Performance Prediction Model Based on Random Forest

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics