ZTE Communications ›› 2025, Vol. 23 ›› Issue (4): 110-119.DOI: 10.12142/ZTECOM.202504012
• Research Papers • Previous Articles
LI Yingke1, HAN Jing2(
), SUN Yongqian1, SHI Binpeng1, GONG Zican2
Received:2024-01-26
Online:2025-12-25
Published:2025-12-22
About author:LI Yingke received her BS degree in software engineering from the School of Information Engineering, Minzu University of China in 2018. She is currently pursuing her master’s degree at the College of Software, Nankai University, China. Her research interests include anomaly detection and failure diagnosis.Supported by:LI Yingke, HAN Jing, SUN Yongqian, SHI Binpeng, GONG Zican. A Root Cause Analysis Framework for Microservice Systems with Multimodal Data[J]. ZTE Communications, 2025, 23(4): 110-119.
Add to citation manager EndNote|Ris|BibTeX
URL: https://zte.magtechjournal.com/EN/10.12142/ZTECOM.202504012
| Dataset | D1 | D2 | ||
|---|---|---|---|---|
| Training Set | Testing Set | Training Set | Testing Set | |
| Number of failure Cases | 139 | 35 | 127 | 32 |
| Number of system microservices | 10 | 9 | ||
| Number of microservice instances | 20 | 18 | ||
| Number of involved failure categories | 5 | 5 | ||
Table 1 Datasets’ details
| Dataset | D1 | D2 | ||
|---|---|---|---|---|
| Training Set | Testing Set | Training Set | Testing Set | |
| Number of failure Cases | 139 | 35 | 127 | 32 |
| Number of system microservices | 10 | 9 | ||
| Number of microservice instances | 20 | 18 | ||
| Number of involved failure categories | 5 | 5 | ||
| Dataset | D1 | D2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Approach | Precision | Recall | F1-Score | Labeled Ratio | Precision | Recall | F1-Score | Labeled Ratio |
| DiagFusion | 0.955 | 0.239 | 0.382 | 0.161 | 0.879 | 0.424 | 0.572 | 0.163 |
| CloudRCA | 0.160 | 0.154 | 0.191 | 0.161 | 0.382 | 0.367 | 0.361 | 0.163 |
| Our method | 0.833 | 0.879 | 0.851 | 0.161 | 0.839 | 0.824 | 0.824 | 0.163 |
Table 2 Comparison of proposed framework with two baselines
| Dataset | D1 | D2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Approach | Precision | Recall | F1-Score | Labeled Ratio | Precision | Recall | F1-Score | Labeled Ratio |
| DiagFusion | 0.955 | 0.239 | 0.382 | 0.161 | 0.879 | 0.424 | 0.572 | 0.163 |
| CloudRCA | 0.160 | 0.154 | 0.191 | 0.161 | 0.382 | 0.367 | 0.361 | 0.163 |
| Our method | 0.833 | 0.879 | 0.851 | 0.161 | 0.839 | 0.824 | 0.824 | 0.163 |
| Dataset | 22 AIOps Dataset | ||
|---|---|---|---|
| Approach | Precision | Recall | F1-Score |
| OM w/o trace data | 0.733 | 0.799 | 0.754 |
| OM w/o log data | 0.710 | 0.747 | 0.713 |
| OM w/o metric data | 0.256 | 0.506 | 0.340 |
| OM w/o GAE | 0.738 | 0.678 | 0.625 |
| OM | 0.833 | 0.879 | 0.851 |
Table 3 Verification of main part of proposed framework without trace data, metric data, log data, or GAE
| Dataset | 22 AIOps Dataset | ||
|---|---|---|---|
| Approach | Precision | Recall | F1-Score |
| OM w/o trace data | 0.733 | 0.799 | 0.754 |
| OM w/o log data | 0.710 | 0.747 | 0.713 |
| OM w/o metric data | 0.256 | 0.506 | 0.340 |
| OM w/o GAE | 0.738 | 0.678 | 0.625 |
| OM | 0.833 | 0.879 | 0.851 |
| [1] | YI X X, ZHANG N H, LIU Y C, et al. Key technologies for intelligent computing power network [J]. ZTE technology journal, 2025, 31(2): 31–38. DOI: 10.12142/ZTETJ.202502005 |
| [2] | WU H Q. Reflections on AI-empowered network reconstruction [J]. ZTE technology journal, 2025, 31(1): 1–3. DOI: 10.12142/ZTETJ.202501001 |
| [3] | ZHANG S L, XIA S B, FAN W Z, et al. Failure diagnosis in microservice systems: a comprehensive survey and analysis [J]. ACM transactions on software engineering and methodology, 2025, (Just Accepted). DOI: 10.1145/3715005 |
| [4] | JIN M X, LV A R, ZHU Y P, et al. An anomaly detection algorithm for microservice architecture based on robust principal component analysis [J]. IEEE access, 2020, 8: 226397–226408 |
| [5] | YU G B, CHEN P F, CHEN H Y, et al. MicroRank: end-to-end latency issue localization with extended spectrum analysis in microservice environments [C]//Proceedings of the Web Conference 2021. ACM, 2021: 3087–3098. DOI: 10.1145/3442381.3449905 |
| [6] | LIN Q W, ZHANG H Y, LOU J G, et al. Log clustering based problem identification for online service systems [C]//The 38th International Conference on Software Engineering Companion (ICSE-C). IEEE, 2016: 102–111 |
| [7] | YUAN Y, SHI W C, LIANG B, et al. An approach to cloud execution failure diagnosis based on exception logs in OpenStack [C]//Proceedings of IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 2019: 124–131. DOI: 10.1109/cloud.2019.00031 |
| [8] | ZHANG X, XU Y, QIN S, et al. Onion: identifying incident-indicating logs for cloud systems [C]//The 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 2021: 1253–1263. DOI: 10.1145/3468264.3473919 |
| [9] | DU M, LI F F, ZHENG G N, et al. DeepLog: anomaly detection and diagnosis from system logs through deep learning [C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2017: 1285–1298. DOI: 10.1145/3133956.3134015 |
| [10] | MA M, LIN W L, PAN D S, et al. Self-adaptive root cause diagnosis for large-scale microservice architecture [J]. IEEE transactions on services computing, 2022, 15(3): 1399–1410. DOI: 10.1109/TSC.2020.2993251 |
| [11] | LIN J J, CHEN P F, ZHENG Z B. Microscope: pinpoint performance issues with causal graphs in micro-service environments [M]//Service-oriented computing. Cham: Springer International Publishing, 2018: 3–20. DOI: 10.1007/978-3-030-03596-9_1 |
| [12] | MA M, XU J M, WANG Y, et al. AutoMAP: diagnose your microservice-based web applications automatically [C]//Proceedings of The Web Conference 2020. ACM, 2020: 246–258. DOI: 10.1145/3366423.3380111 |
| [13] | ZHANG S, JIN P, LIN Z, et al. Robust root cause analysis of microservice system through multimodal data [J]. IEEE transactions on services computing, 2023:1–14 |
| [14] | SUN Y Q, SHI B P, MAO M Y, et al. ART: a unified unsupervised framework for incident management in microservice systems [C]//Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. ACM, 2024: 1183–1194. DOI: 10.1145/3691620.3695495 |
| [15] | SUN Y Q, LIN Z H, SHI B P, et al. Interpretable failure localization for microservice systems based on graph autoencoder [J]. ACM transactions on software engineering and methodology, 2025, 34(2): 52. DOI: 10.1145/3695999 |
| [16] | LEE C, YANG T Y, CHEN Z B, et al. Eadro: an end-to-end troubleshooting framework for microservices on multi-source data [C]//The 45th International Conference on Software Engineering (ICSE). IEEE, 2023: 1750–1762. DOI: 10.1109/ICSE48619.2023.00150 |
| [17] | ZHANG Y Y, GUAN Z X, QIAN H J, et al. CloudRCA: a root cause analysis framework for cloud computing platforms [C]//The 30th ACM International Conference on Information & Knowledge Management. ACM, 2021: 4373–4382. DOI: 10.1145/3459637.3481903 |
| [18] | WANG C, PAN S R, LONG G D, et al. Mgae: marginalized graph autoencoder for graph clustering [C]//Conference on Information and Knowledge Management. ACM, 2017: 889–898 |
| [19] | HOU C J, JIA T, WU Y F, et al. Diagnosing performance issues in microservices with heterogeneous data source [C]//Proceedings of IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking. IEEE, 2021: 493–500. DOI: 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00074 |
| [20] | HOU Z Y, LIU X, CEN Y K, et al. GraphMAE: self-supervised masked graph autoencoders [C]//The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2022: 594–604. DOI: 10.1145/3534678.3539321 |
| [21] | SIGELMAN B H, BARROSO L A, BURROWS M, et al. Dapper, a large-scale distributed systems tracing infrastructure [EB/OL]. [2024-01-26] |
| [22] | Opentracing. Opentracing [EB/OL]. [2024-01-26]. |
| [23] | SIFFER A, FOUQUE P A, TERMIER A, et al. Anomaly detection in streams with extreme value theory [C]//The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017: 1067–1075. DOI: 10.1145/3097983.3098144 |
| [24] | KIPF T N, WEWELLING M. Variational graph auto-encoders [EB/OL]. (2016-11-21) [2024-01-26]. |
| [25] | HE P J, ZHU J M, ZHENG Z B, et al. Drain: an online log parsing approach with fixed depth tree [C]//International Conference on Web Services (ICWS). IEEE, 2017: 33–40. DOI: 10.1109/ICWS.2017.13 |
| [26] | PARK J, LEE M, CHANG H J, et al. Symmetric graph convolutional autoencoder for unsupervised graph representation learning. (2018-09-27) [2024-01-26]. |
| [1] | LYU Xiaomeng, CHEN Hao, WU Zhenyu, HAN Junhua, GUO Huifeng. Alarm-Based Root Cause Analysis Based on Weighted Fault Propagation Topology for Distributed Information Network [J]. ZTE Communications, 2022, 20(3): 77-84. |
| [2] | LIU Zheng, LI Tao, WANG Junchang. A Survey on Event Mining for ICT Network Infrastructure Management [J]. ZTE Communications, 2016, 14(2): 47-55. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||