ZTE Communications ›› 2025, Vol. 23 ›› Issue (4): 110-119.DOI: 10.12142/ZTECOM.202504012

• Research Papers • Previous Articles    

A Root Cause Analysis Framework for Microservice Systems with Multimodal Data

LI Yingke1, HAN Jing2(), SUN Yongqian1, SHI Binpeng1, GONG Zican2   

  1. 1.Nankai University, Tianjin 300071, China
    2.ZTE Corporation, Shenzhen 518057, China
  • Received:2024-01-26 Online:2025-12-25 Published:2025-12-22
  • About author:LI Yingke received her BS degree in software engineering from the School of Information Engineering, Minzu University of China in 2018. She is currently pursuing her master’s degree at the College of Software, Nankai University, China. Her research interests include anomaly detection and failure diagnosis.
    HAN Jing (han.jing28@zte.com.cn) received her master’s degree from Nanjing University of Aeronautics and Astronautics, China. She has been with ZTE Corporation since 2000. She had been engaged in 3G/4G key technologies, from 2000 to 2016, and has become a technical director responsible for intelligent operation of cloud platforms and wireless networks since 2016. Her research interests include machine learning, data mining, and signal processing.
    SUN Yongqian received his BS degree in statistical specialty from Northwestern Polytechnical University, China in 2012, and PhD degree in computer science from Tsinghua University, China in 2018. He is currently an assistant professor with the College of Software, Nankai University, China. His research focuses on anomaly detection, root cause analysis, and failure diagnosis in service management.
    SHI Binpeng received his BE degree in software engineering from the College of Software, Nankai University, China in 2023, where he is currently pursuing his master’s degree. His research interests include anomaly detection and failure diagnosis.
    GONG Zican received his master’s degree in professional computing and artificial intelligence from the Australian National University in 2019. He has been a machine learning engineer at ZTE Corporation since 2020. His research interests include machine learning, professional computing and system architecture.
  • Supported by:
    ZTE Industry-University-Institute Cooperation Funds(HC-CN-20221123003)

Abstract:

In recent years, microservice architecture has gained increasing popularity. However, due to the complex and dynamically changing nature of microservice systems, failure detection has become more challenging. Traditional root cause analysis methods mostly rely on a single modality of data, which is insufficient to cover all failure information. Existing multimodal methods require collecting high-quality labeled samples and often face challenges in classifying unknown failure categories. To address these challenges, this paper proposes a root cause analysis framework based on a masked graph autoencoder (GAE). The main process involves feature extraction, feature dimensionality reduction based on GAE, and online clustering combined with expert input. The method is experimentally evaluated on two public datasets and compared with two baseline methods, demonstrating significant advantages even with 16% labeled samples.

Key words: root cause analysis, multimodal data, self-supervised learning, online clustering