ZTE Communications ›› 2013, Vol. 11 ›› Issue (4): 18-26.doi: DOI:10.3939/j.issn.1673-5188.2013.04.003

• Special Topic • Previous Articles     Next Articles

MapReduce in the Cloud: Data-Location-Aware VM Scheduling

Tung Nguyen and Weisong Shi   

  1. Department of Computer Science, Wayne State University, Detroit, MI 48202, USA
  • Received:2013-04-22 Online:2013-12-25 Published:2013-12-25
  • About author:Tung Nguyen (tnguyen@i-a-i.com) is a research scientist at Intelligent Automation Inc. He plays a key role in many projects on data-intensive distributed processing, cloud computing, and mass-data analysis using topological features. Dr. Nguyen received his PhD degree from Wayne State University in 2012. He received his BS and MS degrees in computer science and engineering from Ho Chi Minh City University of Technology, Vietnam, in 2001 and 2006. His research interests include green computing, cloud computing, data-intensive computing, and application of cloud computing to life sciences. He has published several papers on computer science and bioinformatics and has been published in the proceedings ofOSDI and in NPC, SUSCOM , and BMCFrontiers Genetics journals. He has also been a peer reviewer at many conferences, includingEuro-Par andCollaborateCo m. His homepage is http://www.cs.wayne.edu/tung/

    Weisong Shi (weisong@wayne.edu) is an associate professor of computer science at Wayne State University. He received his BS degree in computer engineering from Xidian University in 1995. He received his PhD degree in computer engineering from the Chinese Academy of Sciences in 2000. His research interests include computer systems, mobile computing, and cloud computing. Dr. Shi has published 120 peer-reviewed journal and conference papers and has an H-index of 24. He has been the program chair and technical program committee member of numerous international conferences, including WWW and ICDCS. In 2002, he received the NSF CAREER award for outstanding PhD dissertation (China). In 2009, he received the Career Development Chair Award of Wayne State University. He has also won the Best Paper Award at ICWE’04, IPDPS’05, HPCChina’12, and IISWC’12.

Abstract: We have witnessed the fast-growing deployment of Hadoop, an open-source implementation of the MapReduce programming model, for purpose of data-intensive computing in the cloud. However, Hadoop was not originally designed to run transient jobs in which users need to move data back and forth between storage and computing facilities. As a result, Hadoop is inefficient and wastes resources when operating in the cloud. This paper discusses the inefficiency of MapReduce in the cloud. We study the causes of this inefficiency and propose a solution. Inefficiency mainly occurs during data movement. Transferring large data to computing nodes is very time-consuming and also violates the rationale of Hadoop, which is to move computation to the data. To address this issue, we developed a distributed cache system and virtual machine scheduler. We show that our prototype can improve performance significantly when running different applications.

Key words: cloud, MapReduce, VM scheduling, data location, Hadoop