ZTE Communications

The whole issue of ZTE Communications June 2013, Vol. 11 No. 2

2013, 11(2): 0.

Asbtract ( 57 )

PDF (2474KB) ( 84 )

Related Articles | Metrics

Big Data:Where Dreams Take Flight

Chengzhong Xu and Zhibin Yu

2013, 11(2): 1-2.

Asbtract ( )

PDF (237KB) ( )

Related Articles | Metrics

From academia to industry, big data has become a buzzword in information technology. The US Federal Government is paying much attention to the big-data revolution. In 2012, fourteen US government departments allocated funds to 87 big-data projects [1]. Europe has the second largest amount of data [2], and most universities and research institutes have already established big-data research programs. In Asia, especially in China, central and local governments have been setting aside funds for their own big-data programs. The big-data related 973 Projects in China are good examples of this. Industry players have been following in the footsteps of big-data pioneers such as Google, Facebook, Twitter, and Baidu, and more and more companies are rushing into the big-data business. Companies have been analyzing the purchasing behavior of huge numbers of customers and have been devising more attractive plans and policies. Big data is already an important part of the $64 billion database and data analytics market [3]. Indeed, big data will open up commercial opportunities comparable in scale to those created by enterprise software of the late 1980s, the internet of the 1990s, and the social media explosion today.
However, what is big data? It has been defined in many different ways. We prefer to define big data as data sets that are too big for current information technologies to capture, transmit, store, process, or visualize. Although this definition is simple, it encompasses computing complexity theory, computer architecture, operating system, programming model, database technologies, algorithms, and applications. People from different fields have dramatically different understandings of big data, which is why there is so much excitement and conjecture surrounding it.
In this special issue, we present papers that discuss big-data technology from different perspectives. These are not only high-level surveys but also reports on initial results from big-data projects. Communication infrastructure is one of the most important aspects of big data. Yi Zhu and Zhengkun Mi from Nanjing University of Posts and Telecommunications discuss content-centric networking, which is seen as a promising approach to big-data distribution. They propose a networking architecture for processing big data, and this architecture is fundamentally different from TCP/IP. Shengmei Luo et al. from the Cloud Computing & IT Institute of ZTE Corporation present a survey of big-data analytics. They analyze challenges related to storage, data-mining algorithms, and programming models for big data. They also predict opportunities in the big-data era. Although there are many potential business opportunities in big data, security is of the utmost importance for users and cannot be overlooked. Ruixuan Li et al. from Huazhong University of Science and Technology provide an overview of data security and privacy-preservation for cloud storage. They carefully investigate confidentiality, data integrity, and data availability. They also propose a feasible solution to current security problems. Shigang Chen et al. from the University of Florida delve more deeply into data integrity. They propose a novel authenticated data structure called Cloud Merkle B+ tree that supports dynamic operations such as insertion, deletion and modification. CMBT lowers overhead fromO (n ) toO (logn ).
Moving to big data applications, algorithms oriented towards a single machine are not necessarily efficient in big-data platforms because many machines need to run concurrently for the same task. Weisong Shi et al. from Wayne State University design a mechanism called SPBD that reduces the response time of big-data systems. This mechanism is very feasible in practice. Zhendong Bei et al. report their experiences with big-data applications that use MapReduce/Hadoop. They confirm that manually tuning up to 190 Hadoop configuration parameters is extremely time consuming, if at all possible. They then propose an automatic performance prediction scheme based on random forest to determine the best configuration parameter combinations. Their experimental results show that their scheme can predict the performance of Hadoop systems very accurately.
Challenges and opportunities exist together in the big-data era. We believe most of these challenges will be overcome and opportunities will be realized. Big data is a field where dreams will take flight.

Content Centric Networking: A New Approach to Big Data Distribution

Yi Zhu and Zhengkun Mi

2013, 11(2): 3-10. doi:DOI: 10.3969/j.issn.1673-5188.2013.02.001

Asbtract ( )

PDF (600KB) ( )

Related Articles | Metrics

In this paper, we explore network architecture and key technologies for content-centric networking (CCN), an emerging networking technology in the big-data era. We describe the structure and operation mechanism of a CCN node. Then we discuss mobility management, routing strategy, and caching policy in CCN. For better network performance, we propose a probability cache replacement policy that is based on content popularity. We also propose and evaluate a probability cache with evicted copy-up decision policy.

Big-Data Analytics: Challenges, Key Technologies and Prospects

Shengmei Luo, Zhikun Wang, and Zhiping Wang

2013, 11(2): 11-17. doi:DOI: 10.3969/j.issn.1673-5188.2013.02.002

Asbtract ( )

PDF (451KB) ( )

Related Articles | Metrics

With the rapid development of the internet, internet of things, mobile internet, and cloud computing, the amount of data in circulation has grown rapidly. More social information has contributed to the growth of big data, and data has become a core asset. Big data is challenging in terms of effective storage, efficient computation and analysis, and deep data mining. In this paper, we discuss the significance of big data and discuss key technologies and problems in big-data analytics. We also discuss the future prospects of big-data analytics.

Data Security and Privacy in Cloud Storage

Xinhua Dong, Ruixuan Li, Wanwan Zhou, Dongjie Liao, and Shuoyi Zhao

2013, 11(2): 18-23. doi:DOI:10.3969/j.issn.1673-5188.2013.02.003

Asbtract ( )

PDF (379KB) ( )

Related Articles | Metrics

In this paper, we survey data security and privacy problems created by cloud storage applications and propose a cloud storage security architecture. We discuss state-of-the-art techniques for ensuring the privacy and security of data stored in the cloud. We discuss policies for access control and data integrity, availability, and privacy. We also discuss several key solutions proposed in current literature and point out future research directions.

An Efficient Dynamic Proof of Retrievability Scheme

Zhen Mo, Yian Zhou, and Shigang Chen

2013, 11(2): 24-29. doi:DOI:10.3969/j.issn.1673-5188.2013.02.004

Asbtract ( )

PDF (385KB) ( )

Related Articles | Metrics

Data security is a significant issue in cloud storage systems. After outsourcing data to cloud servers, clients lose physical control over the data. To guarantee clients that their data is intact on the server side, some mechanism is needed for clients to periodically check the integrity of their data. Proof of retrievability (PoR) is designed to ensure data integrity. However, most prior PoR schemes focus on static data, and existing dynamic PoR is inefficient. In this paper, we propose a new version of dynamic PoR that is based on a B+ tree and a Merkle hash tree. We propose a novel authenticated data structure, called Cloud Merkle B+ tree (CMBT). By combining CMBT with the BLS signature, dynamic operations such as insertion, deletion, and modification are supported. Compared with existing PoR schemes, our scheme improves worst-case overhead from O(n ) to O(logn ).

SPBD: Streamlining Big-Data Processing in Cloud Environments

Tung Nguyen, Jingwen Zhang, and Weisong Shi

2013, 11(2): 30-37. doi:DOI:10.3969/j.issn.1673-5188.2013.02.005

Asbtract ( )

PDF (402KB) ( )

Related Articles | Metrics

Many applications, such as those in genomics, are designed for one machine. This is not problematic if the input data set is small and can fit into the memory of a single powerful machine. However, the application and its algorithms are limited by the capacity and performance of the machine (the application cannot run in parallel). A single machine cannot handle very large data sets. In recent research, cloud computing and MapReduce have been used together to store and process big data. There are three main steps in handling data in the cloud: 1) the user uploads the data, 2) the data is processed, and 3) results are returned. When the size of the data reaches a certain scale, transmission time becomes the dominant factor; however, most research to date has only been focused on reducing the processing time. Also, it is generally assumed that the data is already stored in the cloud. This assumption does not hold because many organizations now store their data locally. In this paper, we propose SPBD (pronounced“speed”) to minimize overall user wait time. We abstract overall processing time as an optimization problem and derive the optimal solution. When evaluated on our private cloud platform, SPBD is shown to reduce user wait time by up to 34% for a traditional WordCount application and up to 31% for a metagenomic application.

A Hadoop Performance Prediction Model Based on Random Forest

Zhendong Bei, Zhibin Yu, Huiling Zhang, Chengzhong Xu, Shenzhong Feng, Zhenjiang Dong, and Hengsheng Zhang

2013, 11(2): 38-44. doi:DOI:10.3969/j.issn.1673-5188.2013.02.006

Asbtract ( )

PDF (455KB) ( )

Related Articles | Metrics

MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently developed machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system’s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.

Parallel Spectral Clustering Based on MapReduce

Qiwei Zhong, Yunlong Lin, Junyang Zou, Kuangyan Zhu, Qiao Wang, and Lei Hu

2013, 11(2): 45-50. doi:DOI:10.3969/j.issn.1673-5188.2013.02.007

Asbtract ( )

PDF (444KB) ( )

Related Articles | Metrics

Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern clustering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social science and biology. With the size of databases soaring, clustering algorithms have scaling computational time and memory use. In this paper, we propose a parallel spectral clustering implementation based on MapReduce. Both the computation and data storage are distributed, which solves the scalability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark networks and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.

Spam Filtering: Online Naive Bayes Based on TONE

Guanglu Sun, Hongyue Sun, Yingcai Ma, and Yuewu Shen

2013, 11(2): 51-54. doi:DOI:10.3969/j.issn.1673-5188.2013.02.008

Asbtract ( )

PDF (330KB) ( )

Related Articles | Metrics

The naive Bayes (NB) model has been successfully used to tackle spam, and is very accurate. However, there is still room for improvement. We use a train on or near error (TONE) method in online NB to enhance the performance of NB and reduce the number of training emails. We conducted an experiment to determine the performance of the improved algorithm by plotting (1-ROCA)% curves. The results show that the proposed method improves the performance of original NB.

A System for Detecting Refueling Behavior along Freight Trajectories and Recommending Refueling Alternatives

Ye Li, Fan Zhang, Bo Gan, and Chengzhong Xu

2013, 11(2): 55-62. doi:DOI:10.3969/j.issn.1673-5188.2013.02.009

Asbtract ( )

PDF (407KB) ( )

Related Articles | Metrics

Smart refueling can reduce costs and lower the possibility of an emergency. Refueling intelligence can only be obtained by mining historical refueling behaviors from big data; however, without devices, such as fuel tank cursors, and cooperation from drivers, these behaviors are hard to detect. Thus, detecting refueling behaviors from big data derived from easy-to-approach trajectories is one of the most efficient retrieve evidences for research of refueling behaviors. In this paper, we describe a complete procedure for detecting refueling behavior in big data derived from freight trajectories. This procedure involves the integration of spatial data mining and machine-learning techniques. The key part of the methodology is a pattern detector that extends the naive Bayes classifier. By drawing on the spatial and temporal characteristics of freight trajectories, refueling behaviors can be identified with high accuracy. Further, we present a refueling prediction and recommendation system to show how our refueling detector can be used practically in big data. Our experiments on real trajectories show that our refueling detector is accurate, and the system performs well.

Table of Content