Cloud (including microservices) incidents/failures related work.
- How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service. Supriyo Ghosh, et al. SoCC'22 paper
- Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. Shutian Luo, et al. SoCC'21 paper
- What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang et al. DSN'17 paper
- A Survey on Failure Analysis and Fault Injection in AI Systems. Guangba Yu, et al. paper
- Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. Haryadi S. Gunawi, et al. SoCC'16 paper
- What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. Haryadi S. Gunawi, et al. SoCC'14 paper
- Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications. Vaastav Anand, et al. SOSP'23 paper
- (DeathStarBench) An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. Yu Gan, et al. ASPLOS'19 paper
- (TrainTicket) Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. Xiang Zhou, et al. IEEE Transactions on SE'18 paper
- µSuite: A Benchmark Suite for Microservices. Akshitha Sriraman, et al. IISWC'18 paper
- Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. Xiaofeng Guo, et al. ESEC/FSE'20 paper
- CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs. Xiao Yu, et al. ASPLOS'16 paper
- Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. Jonathan Mace, et al. SOSP'15 paper
- The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. Michael Chow, et al. OSDI'14 paper
- Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Benjamin H. Sigelman, et al. Technical report'10 paper
- X-Trace: A Pervasive Network Tracing Framework. Rodrigo Fonseca, et al. NSDI'07 paper
- (AnoFusion) Robust Multimodal Failure Detection for Microservice Systems. Chenyu Zhao, el al. KDD'23 paper
- (MSTGAD) Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System. Jun Huang, et al. ASE'23 paper
- Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. Chenxi Zhang, et al. ICSE'22 paper
- Fighting the Fog of War: Automated Incident Detection for Cloud Systems. Liqun Li, et al. ATC'21 paper
- FIRM: An Intelligent Fine-Grained Resource Management Frameworkfor SLO-Oriented Microservices. Haoran Qiu, et al. OSDI'20 paper
- Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. Ping Liu, et al. ISSRE'20 paper
- Continuous Incident Triage for Large-Scale Online Service Systems. Junjie Chen, et al. ASE'19 paper
- An Empirical Investigation of Incident Triage for Online Service Systems. Junjie Chen, et al. ICSE-SEIP'19 paper
- Fault Diagnosis for Test Alarms in Microservices through Multi-source Data. Shenglin Zhang, et al. FSE(Industry)'24 paper
- ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems. Guangba Yu, et al. FSE'24 paper
- BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection. Luan Pham, et al. FSE'24 paper
- A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization. Sungmin Kang, et al. FSE'24 paper
- Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph. Zhenhe Yao, et al. FSE'24 paper
- Towards Better Graph Neural Network-Based Fault Localization through Enhanced Code Representation. Md Nakhla Rafi, et al. FSE'24 paper
- Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems. Shenglin Zhang, et al. FSE'24 paper
- Exploring LLM-based Agents for Root Cause Analysis. Devjeet Roy, et al. arXiv'24 paper
- RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. Zefan Wang, et al. arXiv'23 paper
- PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis. Shizhuo Dylan Zhang, et al. FSE(Industry)'24 paper
- Assess and summarize: Improve outage understanding with large language models. Pengxiang Jin, et al. arXiv'23 paper
- (RCACopilot) Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. Yinfang Chen, et al. EuroSys'24 paper
- GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications. Gagan Somashekar, et al. WWW'24 paper
- Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data. Guangba Yu, et al. FSE'23 paper
- Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. Cheryl Lee, et al. ICSE'23 paper
- (DiagFusion) Robust Failure Diagnosis of Microservice System through Multimodal Data. Shenglin Zhang, et al. arXiv'23 paper
- Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. Toufique Ahmed, et al. ICSE'23 paper
- Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps. Amrita Saha, et al. arXiv'22 paper
- Scalable Statistical Root Cause Analysis on App Telemetry. Vijayaraghavan Murali, et al. ICSE-SEIP'21 paper
- Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices. Yu Gan, et al. ASPLOS'21 paper
- Groot: An event-graph-based approach for root cause analysis in industrial settings. Hanzhang Wang, et al. ASE'21 paper
- Practical Root Cause Localization for Microservice Systems via Trace Analysis. Zeyan Li, et al. IWQOS'21 paper
- CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms. Yingying Zhang, et al. CIKM'21 paper
- MicroHECL: high-efficient root cause localization in large-scale microservice systems. Dewei Liu, et al. ICSE-SEIP'21 paper
- Predicting Node Failures in an Ultra-large-scale Cloud Computing Platform: an AIOps Solution. Yangguang Li, et al. ACM Transactions on Software Engineering and Methodology'20 paper
- Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. Yu Gan, et al. ASPLOS'19 paper
- Latent error prediction and fault localization for microservice applications by learning from system trace logs. Xiang Zhou, et al. ESEC/FSE'19 paper
- Automated known problem diagnosis with event traces. Chun Yuan, et al. EuroSys'06 paper
- Delta Debugging Microservice Systems. Xiang Zhou, et al. ASE'18 paper
- How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems. Jiajun Jiang, et al. FSE'20 paper
- AutoTSG: Learning and Synthesis for Incident Troubleshooting. Manish Shetty, et al. arXiv'22 paper
- MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications. Hongyang Chen, et al. TDSC'24 paper
- Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay. Yuanliang Chen, et al. S&P'24 paper code
- Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. Jiawei Tyler Gu, et al. SOSP'23 paper code
- Phoenix: Detect and Locate Resilience Issues in Blockchain via Context-Sensitive Chaos. Fuchen Ma, et al. CCS'23 paper
- Coverage Guided Fault Injection for Cloud Systems. Yu Gao, et al. ICSE'23 paper code
- Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. Yinfang Chen, et al. NSDI'23 paper code
- Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. Lilia Tang, et al. Eurosys'23 paper [code]
- Automatic Reliability Testing for Cluster Management Controllers. Xudong Sun et al. OSDI'22 paper code
- IBIR: Bug Report driven Fault Injection. Ahmed Khanfir et al. FSE'22 paper code
- SlowCoach Mutating Code to Simulate Performance Bugs. Yiqun Chen, et al. ISSRE'22 paper
- Understanding a Program’s Resiliency Through Error Propagation. Zhimin Li, et al. PPoPP'21 paper
- CoFI: Consistency-Guided Fault Injection for Cloud Systems. Haicheng Chen, et al. ASE'20 paper code
- How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method. Yong Yang, et al. ISSRE'20 paper
- ProFIPy: Programmable Software Fault Injection as-a-Service. Roberto Natella, et al. DSN'20 paper
- Fitness-guided Resilience Testing of Microservice-based Applications. Zhenyue Long, et al. ICWS'20 paper
- Co-evolving Tracing and Fault Injection with Box of Pain. Daniel Bittman, et al. HotCloud'19 paper
- Automating Failure Testing Research at Internet Scale. Peter Alvaro, et al. SoCC'16 paper