ZTE Communications ›› 2013, Vol. 11 ›› Issue (2): 30-37.doi: DOI:10.3969/j.issn.1673-5188.2013.02.005

• Special Topic • Previous Articles     Next Articles

SPBD: Streamlining Big-Data Processing in Cloud Environments

Tung Nguyen, Jingwen Zhang, and Weisong Shi   

  1. Department of Computer Science, Wayne State University, Detroit, MI 48202, USA
  • Received:2013-04-30 Online:2013-06-25 Published:2013-06-25
  • About author:Tung Nguyen (tnguyen@i-a-i.com) is a research scientist at Intelligent Automation Inc. He plays a key role in many projects on data-intensive distributed processing, cloud computing, and massive data analysis using topological features. Dr. Nguyen received his PhD degree from Wayne State University in 2012. He received his BS and MS degrees in computer science and engineering from Ho Chi Minh City University of Technology, Vietnam, in 2001 and 2006. His research interests include green computing, cloud computing, data-intensive computing, and application of cloud computing to life sciences. He has published several papers on computer science and bioinformatics and has been published in the proceedings of OSDI and in NPC, SUSCOM, and BMC Frontiers Genetics journals. He has also been a peer reviewer at many conferences, including Euro-Par and CollaborateCom. His homepage is http://www.cs.wayne.edu/tung/.

    Jingwen Zhang (jingwen.zhang@wayne.edu) received her BS degree in computer science from Xidian University, China. She is currently a PhD student in the Department of Computer Science, Wayne State Uni?versity. Her research interests include cloud computing and big-data analysis.

    Weisong Shi (weisong@wayne.edu) is an associate professor of computer science at Wayne State University. He received his BS degree in computer engineering from Xidian University in 1995. He received his PhD degree in computer engineering from the Chinese Academy of Sciences in 2000. His research interests include computer systems, mobile computing, and cloud computing. Dr. Shi has published 120 peer-reviewed journal and conference papers and has an H-index of 24. He has been the program chair and technical program committee member of numerous international conferences, including WWW and ICDCS. In 2002, he received the NSF CAREER award for outstanding PhD dissertation (China). In 2009, he received the Career Development Chair Award of Wayne State University. He has also won "Best Paper Award" at ICWE’04, IPDPS’05, HPCChina’12, and IISWC’12.

Abstract: Many applications, such as those in genomics, are designed for one machine. This is not problematic if the input data set is small and can fit into the memory of a single powerful machine. However, the application and its algorithms are limited by the capacity and performance of the machine (the application cannot run in parallel). A single machine cannot handle very large data sets. In recent research, cloud computing and MapReduce have been used together to store and process big data. There are three main steps in handling data in the cloud: 1) the user uploads the data, 2) the data is processed, and 3) results are returned. When the size of the data reaches a certain scale, transmission time becomes the dominant factor; however, most research to date has only been focused on reducing the processing time. Also, it is generally assumed that the data is already stored in the cloud. This assumption does not hold because many organizations now store their data locally. In this paper, we propose SPBD (pronounced“speed”) to minimize overall user wait time. We abstract overall processing time as an optimization problem and derive the optimal solution. When evaluated on our private cloud platform, SPBD is shown to reduce user wait time by up to 34% for a traditional WordCount application and up to 31% for a metagenomic application.

Key words: bigdata, genomics, NGS, MapReduce, cloud