ZTE Communications ›› 2022, Vol. 20 ›› Issue (3): 77-84.DOI: 10.12142/ZTECOM.202203010
收稿日期:
2021-10-31
出版日期:
2022-09-13
发布日期:
2022-09-14
LYU Xiaomeng1(), CHEN Hao1, WU Zhenyu1, HAN Junhua2, GUO Huifeng2
Received:
2021-10-31
Online:
2022-09-13
Published:
2022-09-14
About author:
LYU Xiaomeng (Supported by:
. [J]. ZTE Communications, 2022, 20(3): 77-84.
LYU Xiaomeng, CHEN Hao, WU Zhenyu, HAN Junhua, GUO Huifeng. Alarm-Based Root Cause Analysis Based on Weighted Fault Propagation Topology for Distributed Information Network[J]. ZTE Communications, 2022, 20(3): 77-84.
Timestamp | System | Host | Alarm content | Is_root |
---|---|---|---|---|
2019/6/14 1:14 | SYS_5 | Host_14 | I/O wait load exceeds 10% for 15 minutes | 0 |
2019/6/14 1:14 | SYS_4 | Host _9 | The log generates ERROR information | 0 |
2019/6/14 1:14 | SYS_9 | Host _92 | On CPU Steal Time lasts 5 minutes over 10% | 0 |
2019/6/14 1:14 | SYS_9 | Host _75 | Free swap space is less than 50% | 0 |
2019/6/14 1:14 | SYS_5 | Host _60 | The communication on port 80 is abnormal | 1 |
2019/6/14 1:14 | SYS_5 | Host _76 | The upper I/O wait load is greater than 50% | 0 |
2019/6/14 1:14 | SYS_4 | Host _23 | Ping packet loss rate is 100%, and the server breaks down | 0 |
2019/6/14 1:14 | SYS_9 | Host _75 | The Slot00 status of the hard disk is failed | 0 |
2019/6/14 1:14 | SYS_9 | Host _60 | Number of FullGC: 32 (greater than threshold: 10) | 0 |
2019/6/14 1:14 | SYS_5 | Host _97 | Average heap memory usage: 94.61% (greater than threshold: 90%) | 0 |
2019/6/14 1:14 | SYS_5 | Host _32 | Average FullGC time: 2 118 ms (greater than threshold: 1 000 ms) | 0 |
2019/6/14 1:14 | SYS_4 | Host _3 | Nic traffic unknown | 0 |
Table 1 Examples of alarms generated during a fault in Dataset A
Timestamp | System | Host | Alarm content | Is_root |
---|---|---|---|---|
2019/6/14 1:14 | SYS_5 | Host_14 | I/O wait load exceeds 10% for 15 minutes | 0 |
2019/6/14 1:14 | SYS_4 | Host _9 | The log generates ERROR information | 0 |
2019/6/14 1:14 | SYS_9 | Host _92 | On CPU Steal Time lasts 5 minutes over 10% | 0 |
2019/6/14 1:14 | SYS_9 | Host _75 | Free swap space is less than 50% | 0 |
2019/6/14 1:14 | SYS_5 | Host _60 | The communication on port 80 is abnormal | 1 |
2019/6/14 1:14 | SYS_5 | Host _76 | The upper I/O wait load is greater than 50% | 0 |
2019/6/14 1:14 | SYS_4 | Host _23 | Ping packet loss rate is 100%, and the server breaks down | 0 |
2019/6/14 1:14 | SYS_9 | Host _75 | The Slot00 status of the hard disk is failed | 0 |
2019/6/14 1:14 | SYS_9 | Host _60 | Number of FullGC: 32 (greater than threshold: 10) | 0 |
2019/6/14 1:14 | SYS_5 | Host _97 | Average heap memory usage: 94.61% (greater than threshold: 90%) | 0 |
2019/6/14 1:14 | SYS_5 | Host _32 | Average FullGC time: 2 118 ms (greater than threshold: 1 000 ms) | 0 |
2019/6/14 1:14 | SYS_4 | Host _3 | Nic traffic unknown | 0 |
Timestamp | NE | Duration | System Type | Code | Severity | Alarm Type | Root |
---|---|---|---|---|---|---|---|
2020/2/27 10:01 | 4 167 | 1 000 | 4 198 | 964 | 1 | 0 | 0 |
2020/2/27 10:01 | 4 715 | 12 000 | 4 590 | 18 956 | 2 | 3 | 0 |
2020/2/27 10:01 | 4 167 | 15 000 | 4 197 | 43 | 4 | 0 | 1 |
2020/2/27 10:01 | 4 167 | 11 000 | 4 590 | 18 956 | 4 | 3 | 0 |
2020/2/27 10:01 | 4 166 | 5 000 | 4 590 | 18 956 | 3 | 3 | 0 |
2020/2/27 10:01 | 4 595 | 10 000 | 4 198 | 964 | 1 | 0 | 0 |
2020/2/27 10:01 | 5 496 | 6 000 | 4 590 | 18 956 | 3 | 1 | 0 |
2020/2/27 10:01 | 5 497 | 5 000 | 4 197 | 43 | 4 | 4 | 0 |
Table 2 Examples of alarms generated during a fault in Dataset B
Timestamp | NE | Duration | System Type | Code | Severity | Alarm Type | Root |
---|---|---|---|---|---|---|---|
2020/2/27 10:01 | 4 167 | 1 000 | 4 198 | 964 | 1 | 0 | 0 |
2020/2/27 10:01 | 4 715 | 12 000 | 4 590 | 18 956 | 2 | 3 | 0 |
2020/2/27 10:01 | 4 167 | 15 000 | 4 197 | 43 | 4 | 0 | 1 |
2020/2/27 10:01 | 4 167 | 11 000 | 4 590 | 18 956 | 4 | 3 | 0 |
2020/2/27 10:01 | 4 166 | 5 000 | 4 590 | 18 956 | 3 | 3 | 0 |
2020/2/27 10:01 | 4 595 | 10 000 | 4 198 | 964 | 1 | 0 | 0 |
2020/2/27 10:01 | 5 496 | 6 000 | 4 590 | 18 956 | 3 | 1 | 0 |
2020/2/27 10:01 | 5 497 | 5 000 | 4 197 | 43 | 4 | 4 | 0 |
Feature | Meaning |
---|---|
information entropy | Average IDF of the system node |
max_number/min | Maximum number of alarms per minute |
node_num | Number of nodes in same systems |
alert_count | Total number of alarms |
alert_type_num | Number of alarm types |
start_time | Relative start time |
time_duration | Time span (minutes) |
serious_num | Number of serious type alarms |
Table 3 Features used for feature importance analysis in Dataset A
Feature | Meaning |
---|---|
information entropy | Average IDF of the system node |
max_number/min | Maximum number of alarms per minute |
node_num | Number of nodes in same systems |
alert_count | Total number of alarms |
alert_type_num | Number of alarm types |
start_time | Relative start time |
time_duration | Time span (minutes) |
serious_num | Number of serious type alarms |
Metrics | Dataset A | Dataset B | |||||||
---|---|---|---|---|---|---|---|---|---|
WFPT-RCA | 0.90 | 0.92 | 0.96 | 0.927 | 0.89 | 0.95 | 1.00 | 0.947 | |
WFPT-RCA (no ASG) | 0.64 | 0.70 | 0.84 | 0.727 | 0.53 | 0.63 | 0.74 | 0.633 | |
WFPT-RCA (no feature analysis) | 0.28 | 0.54 | 0.90 | 0.573 | — | — | — | — | |
MicroRCA | 0.84 | 0.92 | 0.94 | 0.900 | 0.79 | 0.84 | 0.89 | 0.840 | |
Microscope | 0.82 | 0.88 | 0.90 | 0.867 | 0.74 | 0.79 | 0.84 | 0.790 | |
Association rules | 0.36 | 0.56 | 0.78 | 0.567 | 0.47 | 0.58 | 0.63 | 0.560 |
Table 4 Performance in Datasets A and B
Metrics | Dataset A | Dataset B | |||||||
---|---|---|---|---|---|---|---|---|---|
WFPT-RCA | 0.90 | 0.92 | 0.96 | 0.927 | 0.89 | 0.95 | 1.00 | 0.947 | |
WFPT-RCA (no ASG) | 0.64 | 0.70 | 0.84 | 0.727 | 0.53 | 0.63 | 0.74 | 0.633 | |
WFPT-RCA (no feature analysis) | 0.28 | 0.54 | 0.90 | 0.573 | — | — | — | — | |
MicroRCA | 0.84 | 0.92 | 0.94 | 0.900 | 0.79 | 0.84 | 0.89 | 0.840 | |
Microscope | 0.82 | 0.88 | 0.90 | 0.867 | 0.74 | 0.79 | 0.84 | 0.790 | |
Association rules | 0.36 | 0.56 | 0.78 | 0.567 | 0.47 | 0.58 | 0.63 | 0.560 |
1 |
ZENG M F, XIE P Y. Research on fault location of information system based on CMDB and rule inference [J]. Journal of Guangxi academy of sciences, 2017, 33(1): 53–58. DOI: 10.46960/2658-6754_2019_3_4
DOI |
2 | BOUILLARD A, M-O BUOB, RAYNAL M, et al. Log analysis via space-time pattern matching [C]//14th International Conference on Network and Service Management (CNSM). IEEE, 2018: 303–307 |
3 |
LIU M L, QI X G, LIU L F, et al. Roots-tracing of communication network alarm: A real-time processing framework [J]. Computer networks, 2021, 192: 108037. DOI: 10.1016/j.comnet.2021.108037
DOI |
4 |
ZHANG C X, SONG D J, CHEN Y C, et al. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data [C]//33th AAAI Conference on Artificial Intelligence. AAAI, 2019: 1409–1416. DOI: 10.1609/aaai.v33i01.33011409
DOI |
5 |
ZHANG K L, KALANDER M, ZHOU M, et al. An influence-based approach for root cause alarm discovery in telecom networks [C]//Service-Oriented Computing ICSOC 2020 workshops, 2021: 124–136. DOI: 10.1007/978-3-030-76352-7_16
DOI |
6 |
ZHANG L Y, ZHAO J B, ZHANG M. Root cause analysis of concurrent alarms based on random walk over anomaly propagation graph [C]//IEEE International Conference on Networking, Sensing and Control. IEEE, 2020: 1–6. DOI: 10.1109/ICNSC48988.2020.9238084
DOI |
7 |
YUAN Y N, YANG J L, DUAN R, et al. Anomaly detection and root cause analysis enabled by artificial intelligence [C]//IEEE Globecom Workshops. IEEE, 2020: 1–6. DOI: 10.1109/GCWkshps50303.2020.9367508
DOI |
8 |
LI Z Y, CHEN J J, JIAO R, et al. Practical root cause localization for microservice systems via trace analysis [C]//29th International Symposium on Quality of Service (IWQOS). IEEE, 2021: 1–10. DOI: 10.1109/IWQOS52092.2021.9521340
DOI |
9 |
ZHANG L Y, ZHAO J B, ZHANG M. Root cause analysis of concurrent alarms based on random walk over anomaly propagation graph [C]//IEEE International Conference on Networking, Sensing and Control. IEEE, 2020: 1–6. DOI: 10.1109/ICNSC48988.2020.9238084
DOI |
10 |
SHARMA B, JAYACHANDRAN P, VERMA A, et al. CloudPD: problem determination and diagnosis in shared dynamic clouds [C]//43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2013: 1–12. DOI: 10.1109/DSN.2013.6575298
DOI |
11 |
LIN J Y, ZHANG Q, BANNAZADEH H, et al. Automated anomaly detection and root cause analysis in virtualized cloud infrastructures [C]//IEEE/IFIP Network Operations and Management Symposium. IEEE, 2016: 550–556. DOI: 10.1109/NOMS.2016.7502857
DOI |
12 |
CHEN P F, QI Y, HOU D. CauseInfer: automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment [J]. IEEE transactions on services computing, 2019, 12(2): 214–230. DOI: 10.1109/TSC.2016.2607739
DOI |
13 |
ZHAO N W, CHEN J J, PENG X, et al. Understanding and handling alert storm for online service systems [C]//42nd International Conference on Software Engineering: Software Engineering in Practice. ACM, 2020: 162–171. DOI: 10.1145/3377813.3381363
DOI |
14 |
GOLIĆ M, ŽUNIĆ E, ĐONKO D. Outlier detection in distribution companies business using real data set [C]//18th International Conference on Smart Technologies. IEEE, 2019: 1–5. DOI: 10.1109/EUROCON.2019.8861526
DOI |
15 |
MANNING C D, RAGHAVAN P, SCHUTZE H. Introduction to information-retrieval [J]. Information retrieval, 2010, 13: 192–195. DOI: 10.1007/s10791-009-9115-y
DOI |
16 |
CHEN T Q, GUESTRIN C. XGBoost: a scalable tree boosting system [C]//22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016: 785–794. DOI: 10.1145/2939672.2939785
DOI |
17 |
WU L, TORDSSON J, ELMROTH E, et al. MicroRCA: root cause localization of performance issues in microservices [C]//IEEE/IFIP Network Operations and Management Symposium. IEEE, 2020: 1–9. DOI: 10.1109/NOMS47738.2020.9110353
DOI |
18 |
LIN J J, CHEN P F, ZHENG Z B. Microscope: pinpoint performance issues with causal graphs in micro-service environments [C]//ICSOC 2018: Service-Oriented Computing. ICSOC, 2018: 3–20. DOI: 10.1007/978-3-030-03596-9_1
DOI |
19 |
HRYCEJ T, STROBEL C M. (2008) Extraction of maximum support rules for the root cause analysis [M]//Computational Intelligence in Automotive Applications. Berlin Heidelberg, Germany: Springer. 2008: 117–131.DOI:10.1007/978-3-540-79257-4_6
DOI |
No related articles found! |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||