ZTE Communications ›› 2012, Vol. 10 ›› Issue (4): 45-53.

• Research Paper • Previous Articles     Next Articles

Parallel Web Mining System Based on Cloud Platform

Shengmei Luo1, Qing He2, Lixia Liu1, Xiang Ao2 , 3, Ning Li2 , 3, Fuzhen Zhuang2   

  1. 1. Pre-Research department of ZTE, Nanjing,210012,China;
    2.Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences. Beijing, 100190,China;
    3.Graduate University of Chinese Academy of Sciences, Beijing, 100190,China
  • Received:2012-07-27 Online:2012-12-25 Published:2012-12-25
  • About author:Shengmei Luo is chief architect at ZTE Corporation and professor at Nanjing University of Post and Telecommunications. He has been awarded prizes for scientific and technological progress, holds many patents, and has also written papers that have been published in a number of core communication journals. He is a member of the China Cloud Computing Committee. He graduated from Harbin Institute of Technology in 1996 and has been involved in telecommunication network and services development and planning for many years.

    Qing He is a professor in the Institute of Computing Technology, Chinese Academy of Sciences. He is also a professor at the Graduate University of the Chinese Academy of Sciences. He received his BS degree in mathematics from Hebei Normal University, Shijiazhang, in 1985. He received is MS degree in mathematics from Zhengzhou University in 1987. He received his PhD degree in fuzzy mathematics and AI from Beijing Normal University in 2000. From 1987 to 1997, he has worked at Hebei University of Science and Technology. He is currently a doctoral tutor at the Institute of Computing and Technology, Chinese Academy of Sciences. His interests include data mining, machine learning, classification, and fuzzy clustering.

    Lixia Liu is a senior engineer in the pre-research department of ZTE, and she received the M.S degree from Ocean University of China in 2008.Her research interests include natural language processing , text mining, data mining, machine learning, mathematical statistics and cloud computing.

    Xiang Ao is a PhD candidate student in the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include machine learning, data mining and cloud computing.

    Ning Li is a PhD candidate student in the Institute of Computing Technology, Chinese Academy of Sciences. Her research interests include machine learning, data mining and cloud computing.

    Fuzhen Zhuang is an assistant professor in the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include machine learning, data mining, distributed classification and clustering, natural language processing. He has published several papers in some prestigious refereed journals and conference proceedings, such as IEEE Transactions on Knowledge and Data Engineering, Neurocomputing, ACM CIKM, SIAM SDM and IEEE ICDM.
  • Supported by:
    This work is supported by the National Natural Science Foundation of China (No. 61175052,60975039, 61203297, 60933004, 61035003), National High-tech R&D Program of China (863 Program) (No.2012AA011003).

Parallel Web Mining System Based on Cloud Platform

Shengmei Luo1, Qing He2, Lixia Liu1, Xiang Ao2 , 3, Ning Li2 , 3, Fuzhen Zhuang2   

  1. 1. Pre-Research department of ZTE, Nanjing,210012,China;
    2.Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences. Beijing, 100190,China;
    3.Graduate University of Chinese Academy of Sciences, Beijing, 100190,China
  • 作者简介:Shengmei Luo is chief architect at ZTE Corporation and professor at Nanjing University of Post and Telecommunications. He has been awarded prizes for scientific and technological progress, holds many patents, and has also written papers that have been published in a number of core communication journals. He is a member of the China Cloud Computing Committee. He graduated from Harbin Institute of Technology in 1996 and has been involved in telecommunication network and services development and planning for many years.

    Qing He is a professor in the Institute of Computing Technology, Chinese Academy of Sciences. He is also a professor at the Graduate University of the Chinese Academy of Sciences. He received his BS degree in mathematics from Hebei Normal University, Shijiazhang, in 1985. He received is MS degree in mathematics from Zhengzhou University in 1987. He received his PhD degree in fuzzy mathematics and AI from Beijing Normal University in 2000. From 1987 to 1997, he has worked at Hebei University of Science and Technology. He is currently a doctoral tutor at the Institute of Computing and Technology, Chinese Academy of Sciences. His interests include data mining, machine learning, classification, and fuzzy clustering.

    Lixia Liu is a senior engineer in the pre-research department of ZTE, and she received the M.S degree from Ocean University of China in 2008.Her research interests include natural language processing , text mining, data mining, machine learning, mathematical statistics and cloud computing.

    Xiang Ao is a PhD candidate student in the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include machine learning, data mining and cloud computing.

    Ning Li is a PhD candidate student in the Institute of Computing Technology, Chinese Academy of Sciences. Her research interests include machine learning, data mining and cloud computing.

    Fuzhen Zhuang is an assistant professor in the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include machine learning, data mining, distributed classification and clustering, natural language processing. He has published several papers in some prestigious refereed journals and conference proceedings, such as IEEE Transactions on Knowledge and Data Engineering, Neurocomputing, ACM CIKM, SIAM SDM and IEEE ICDM.
  • 基金资助:
    This work is supported by the National Natural Science Foundation of China (No. 61175052,60975039, 61203297, 60933004, 61035003), National High-tech R&D Program of China (863 Program) (No.2012AA011003).

Abstract: Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients. Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.

Key words: web mining, large scale, high volume, high dimension, cloud computing

摘要: Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet. In real-world applications, there is an urgent need for machine-learning algorithms to be able to handle large-scale, high-dimensional text data. Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients. Recently, it has aroused much interest in industry and academia. Most previous works on cloud platforms only focus on the parallel algorithms for structured data. In this paper, we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallel web crawler; parallel text extract, transform and load (ETL) and modeling; and parallel text mining and application subsystems. The complete system enables variable real-world web-mining applications for mass data.

关键词: web mining, large scale, high volume, high dimension, cloud computing