ZTE Communications ›› 2017, Vol. 15 ›› Issue (S2): 52-57.doi: 10.3969/j.issn.1673-5188.2017.S2.009

• Research Paper • Previous Articles     Next Articles

Random Forest Based Very Fast Decision Tree Algorithm for Data Stream

DONG Zhenjiang1, LUO Shengmei1, WEN Tao1, ZHANG Fayang2, LI Lingjuan2   

  1. 1. ZTE Corporation, Nanjing 210012, China
    2. School of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
  • Received:2016-12-02 Online:2017-12-25 Published:2020-04-16
  • About author:DONG Zhenjiang (dong.zhenjiang@zte.com.cn) received his M.S. degree in telecommunication and electronics from Harbin Instituted of Technology in 1996. He is the deputy head of the Service Institute of ZTE Corporation. His research interests include cloud computing and the mobile Internet.|LUO Shengmei (luo.shengmei@zte.com.cn) received his M.S. degree in telecommunication and electronics from Harbin Instituted of Technology in 1996. He is now a chief architect at ZTE Corporation. His research interests include big data, cloud computing and network storage.|WEN Tao (wen.tao1@zte.com.cn) received his M.S. degree in computer science from Nanjing University of Aeronautics and Astronautics, China. He is a senior pre-research engineer at the Cloud Computing & IT Institute of ZTE Corporation. His research interests include cloud computing and big data technologies.|ZHANG Fayang (13041105@njupt.edu.cn ) is pursuing his master degree at School of Computer, Nanjing University of Posts and Telecommunications, China. His research interests include cloud computing, data mining, and big data technologies.|LI Lingjuan (lilj@njupt.edu.cn) is a full professor of School of Computer, Nanjing University of Posts and Telecommunications, China. She received her Ph.D. degree in computer application technology from Soochow University, China. Her research interests include cloud computing, data mining, information security, and big data technologies.
  • Supported by:
    This work was supported by ZTE Industry-Academia-Research Cooperation Funds and National Natural Science Foundation of China under Grant(61302158);This work was supported by ZTE Industry-Academia-Research Cooperation Funds and National Natural Science Foundation of China under Grant(61571238)

Abstract:

The Very Fast Decision Tree (VFDT) algorithm is a classification algorithm for data streams. When processing large amounts of data, VFDT requires less time than traditional decision tree algorithms. However, when training samples become fewer, the label values of VFDT leaf nodes will have more errors, and the classification ability of single VFDT decision tree is limited. The Random Forest algorithm is a combinational classifier with high prediction accuracy and noise-tolerant ability. It is constituted by multiple decision trees and can make up for the shortage of single decision tree. In this paper, in order to improve the classification accuracy on data streams, the Random Forest algorithm is integrated into the process of tree building of the VFDT algorithm, and a new Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed. The RFVFDT algorithm adopts the decision tree building criterion of a Random Forest classifier, and improves Random Forest algorithm with sliding window to meet the unboundedness of data streams and avoid process delay and data loss. Experimental results of the classification of KDD CUP data sets show that the classification accuracy of RFVFDT algorithm is higher than that of VFDT. The less the samples are, the more obvious the advantage is. RFVFDT is fast when running in the multi-thread mode.

Key words: data stream, data classification, Random Forest algorithm, VFDT algorithm