ZTE Communications ›› 2013, Vol. 11 ›› Issue (2): 38-44.DOI: DOI:10.3969/j.issn.1673-5188.2013.02.006

• Special Topic • Previous Articles     Next Articles

A Hadoop Performance Prediction Model Based on Random Forest

Zhendong Bei1, Zhibin Yu1, Huiling Zhang1, Chengzhong Xu1,2, Shenzhong Feng1, Zhenjiang Dong3, and Hengsheng Zhang3   

  1. 1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
    2. Wayne State University, Detroit, Michigan 48202, USA;
    3. Cloud Computing and IT Institute of ZTE Corporation, Nanjing 210012, China
  • Received:2013-05-20 Online:2013-06-25 Published:2013-06-25
  • About author:Zhendong Bei is a PhD student in computer applications at Shenzhen Institutes of Advanced Technology, China. He received his BS degree from the National University of Defense Technology, China, in 2006, and received his MS degree from Central South University, China, in 2009. His research interests include cloud computing, data mining, machine learning, and image processing.

    Zhibin Yu (zb.yu@siat.ac.cn) received his PhD degree in computer science from Huazhong University of Science and Technology in 2008. He spent one year as a visiting scholar in the Laboratory of Computer Architecture, University of Texas at Austin. He is currently an associate professor at the Shenzhen Institutes of Advance Technology. His research interests include microarchitecture simulation, computer architecture, workload characterization and generation, performance evaluation, multicore architecture, and virtualization technologies. In 2005, he won first prize in the HUST Young Lecturers Teaching Contest, and in 2003, he won second prize in the teaching quality assessment of HUST. He is a member of IEEE and ACM. Huiling Zhang received her MS degree in singal and information processing from Southwest University, China, in 2011. She joined the Center for High-Performance Computing at Shenzhen Institutes of Advanced Technology and now works as a research assistant there. Her current research interests include high-performance computing, and machine learning and its applications in bioinformatics.

    Chengzhong Xu (czxu@wayne.edu) received his BS degree and MSc degree in computer science and engineering from Nanjing University in 1986 and 1989. He received his PhD degree from the University of Hong Kong in 1993. His research interests include computer architecture, distributed systems, virtualization, and cloud computing. Dr. Xu is a professor of electrical and computer engineering at Wayne State University. He is also the director of the Cloud and Internet Computer Laboratory at Wayne State University. He is an IEEE senior member and ACM member. Shengzhong Feng is a professor, deputy director of the Institute of Advanced Computing and Digital Engineering, Shenzhen Institute of Advanced Technology. His research interests are parallel algorithms, grid computing, and bioinformatics. In particular, he is focused on developing novel, effective methods of modeling digital cities and applications. Before coming to SIAT, he worked in the Institute of Computing Technology, Chinese Academy of Sciences, and participated in research on the Dawning supercomputer. He graduated from the University of Science and Technology of China in 1991 and received his PhD from Beijing Institute of Technology in 1997.

    Zhenjiang Dong (dong.zhenjiang@zte.com.cn) is the vice president of the Communication Services R&D Institute for Cloud Computing and IT Operation, ZTE Corporation. He received his MS degree from Harbin Institute of Technology in 1996. His research interests include cloud computing, multimedia networking, and mobile networking.

    Hengsheng Zhang (zhang.hengsheng@zte.com.cn) received his bachelor’s degree from Anhui University, China. He joined ZTE in 2005, and is a pre-research engineer and senior architect. His research interests include value added services and cloud computing.

A Hadoop Performance Prediction Model Based on Random Forest

Zhendong Bei1, Zhibin Yu1, Huiling Zhang1, Chengzhong Xu1,2, Shenzhong Feng1, Zhenjiang Dong3, and Hengsheng Zhang3   

  1. 1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
    2. Wayne State University, Detroit, Michigan 48202, USA;
    3. Cloud Computing and IT Institute of ZTE Corporation, Nanjing 210012, China
  • 作者简介:Zhendong Bei is a PhD student in computer applications at Shenzhen Institutes of Advanced Technology, China. He received his BS degree from the National University of Defense Technology, China, in 2006, and received his MS degree from Central South University, China, in 2009. His research interests include cloud computing, data mining, machine learning, and image processing.

    Zhibin Yu (zb.yu@siat.ac.cn) received his PhD degree in computer science from Huazhong University of Science and Technology in 2008. He spent one year as a visiting scholar in the Laboratory of Computer Architecture, University of Texas at Austin. He is currently an associate professor at the Shenzhen Institutes of Advance Technology. His research interests include microarchitecture simulation, computer architecture, workload characterization and generation, performance evaluation, multicore architecture, and virtualization technologies. In 2005, he won first prize in the HUST Young Lecturers Teaching Contest, and in 2003, he won second prize in the teaching quality assessment of HUST. He is a member of IEEE and ACM. Huiling Zhang received her MS degree in singal and information processing from Southwest University, China, in 2011. She joined the Center for High-Performance Computing at Shenzhen Institutes of Advanced Technology and now works as a research assistant there. Her current research interests include high-performance computing, and machine learning and its applications in bioinformatics.

    Chengzhong Xu (czxu@wayne.edu) received his BS degree and MSc degree in computer science and engineering from Nanjing University in 1986 and 1989. He received his PhD degree from the University of Hong Kong in 1993. His research interests include computer architecture, distributed systems, virtualization, and cloud computing. Dr. Xu is a professor of electrical and computer engineering at Wayne State University. He is also the director of the Cloud and Internet Computer Laboratory at Wayne State University. He is an IEEE senior member and ACM member. Shengzhong Feng is a professor, deputy director of the Institute of Advanced Computing and Digital Engineering, Shenzhen Institute of Advanced Technology. His research interests are parallel algorithms, grid computing, and bioinformatics. In particular, he is focused on developing novel, effective methods of modeling digital cities and applications. Before coming to SIAT, he worked in the Institute of Computing Technology, Chinese Academy of Sciences, and participated in research on the Dawning supercomputer. He graduated from the University of Science and Technology of China in 1991 and received his PhD from Beijing Institute of Technology in 1997.

    Zhenjiang Dong (dong.zhenjiang@zte.com.cn) is the vice president of the Communication Services R&D Institute for Cloud Computing and IT Operation, ZTE Corporation. He received his MS degree from Harbin Institute of Technology in 1996. His research interests include cloud computing, multimedia networking, and mobile networking.

    Hengsheng Zhang (zhang.hengsheng@zte.com.cn) received his bachelor’s degree from Anhui University, China. He joined ZTE in 2005, and is a pre-research engineer and senior architect. His research interests include value added services and cloud computing.

Abstract: MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently developed machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system’s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.

Key words: big data, cloud computing, MapReduce, Hadoop, random forest, micro-benchmark

摘要: MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently developed machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system’s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.

关键词: big data, cloud computing, MapReduce, Hadoop, random forest, micro-benchmark