Automatic Root Cause Analysis via Large Language Models for Cloud Incidents

Cloud computing serves as an indispensable infrastructure for numerous applications and services upon which people rely daily.
Root cause analysis (RCA) is pivotal in promptly and effectively addressing these incidents. 
Traditional approaches to cloud incident RCA typically involve the manual collection and analysis of various types of data,
such as logs [18, 19, 27, 31, 56], metrics [14, 38, 49], traces [53, 61], and incident tickets [20, 42].
Specifically, the engineering team documents the frequent troubleshooting steps in the form of troubleshooting guides
(TSGs) to facilitate the handling of future incidents. 
At the heart of RCA lies the fundamental challenge of efficiently collecting and interpreting comprehensive, incident-specific
data within a limited time frame. OCEs must quickly discern the relevance of various data types to the incident at hand and
interpret them correctly. 
Data-driven and Artificial Intelligence (AI) techniques have been leveraged for automating the incident management [9,10].
The recent advent and success of large language models (LLMs) in performing complex tasks [21, 28, 46], suggests a
promising avenue for enhancing RCA.
Recently, Ahmed et. al. [1] proposed to finetune a LLMs with domain-specific datasets for generating root causes of an
incident just by leveraging the title and summary information available at the time of incident creation. 
In this paper, we introduce RCACopilot, a novel on-call system presenting an automatic end-to-end approach to cloud
incident RCA.
The diagnostic information collection component of RCACopilot has been in use at Microsoft for over four years. In recent
developments, a root cause prediction component was prototyped and, following a successful preliminary phase, has been
actively deployed by an incident management team at Microsoft for a period spanning several months.

2 Background and Motivation

In this section, we first introduce the concept and importance of incident root cause analysis. 
In the realm of cloud services, an incident refers to any event that disrupts normal service operations or causes degradation in
the quality of services.
RCA in cloud services is a multi-faceted process:
Given the complexity and dynamism nature of cloud systems, along with the immense volume of data involved, conducting
RCA is a challenging task, which requires substantial expertise and time.

2.2 The Opportunities and Challenges of Multi-Source Data in Incident Management

Managing incidents in the complex ecosystem of cloud services necessitates a comprehensive understanding of system states.
2.2.1 Opportunities of Multi-Source Data. Different data sources provide different perspectives on the system state. 
2.2.2 Challenges of Multi-Source Data. Despite its potential, effectively leveraging multi-source data in incident man-
agement is challenging. 
2.2.3 Limitations of TSGs. Traditional TSGs represent an early attempt to leverage multi-source data for incident manage-
ment.
 Manual data integration: TSGs typically require OCEs to gather data from different sources manually.
Outdated information: TSGs, as static documents, often struggle to stay up-to-date with the evolving system changes and
new insights about incident root causes.
Insufficient details and coverage: High-level instructions often appear in TSGs, lacking in detail and specific guidance, which
forces OCEs into additional research and prolongs incident mitigation. 
the propagation of requests across services. 

2.3 The Promise of Large Language Models for Incident Management
The rapid advancements in natural language processing and machine learning have led to the development of powerful LLMs,
which are reported to be effective at various downstream tasks with zero-shot and few-shot learning [5, 11, 28].

2.4 Our Motivation
The motivation for our work is rooted in the challenges faced when using manual TSGs to diagnose incidents and identify the
underlying root causes.
Different from previous work [42], which employs AI techniques to generate automated workflow from existing TSGs, our
goal is to enable experienced OCEs to construct an automated pipeline for incident diagnosis.
We envision a future in which root cause analysis is predominantly automated, requiring minimal manual verification only
when necessary. 

3 Insights from Incidents
We conducted a comprehensive study of the one-year incidents from an email service from Microsoft, employing rigorous
qualitative analysis methods.
Insight 1: determining the root cause based on a single data source can be challenging. 
When a mailbox server sends mail to external email recipients, it uses specific front-door servers (proxies). However, each
front-door server has a limited number of available SMTP outbound proxy connections.
Insight 2: incidents stemming from similar or identical root causes often recur within a short period.
Insight 3: incidents with new root causes occur frequently and pose a greater challenge to analyze.

4 RCACopilot
RCACopilot has two stages: the diagnostic information collection stage and the root cause prediction stage as shown in
Figure 4.
Diagnostic information collection stage: This is the initial stage, where the incident is parsed and matched to the
pre-defined incident handler.
Root cause prediction stage: Once the diagnostic information is collected, RCACopilot transitions into the root cause
prediction stage.

4.1 Diagnostic Information Collection Stage
Driven by Insight-1 in Section 3, RCACopilot aims to collect multi-source data for RCA.
The RCACopilot incident handler is a workflow that consists of a series of actions.

4.1.1 Incident handler. The decision-making process that OCEs employ when handling an incident resembles a decision
tree’s control flow.
RCACopilot’s incident handlers are constructed manually first and can be updated and modified dynamically by OCEs,
allowing them to stay abreast with the most recent system changes and newly discovered root causes.

4.1.2 Handler action. RCACopilot leverages the synergy of multi-source data.
Scope switching action: This action facilitates precision in RCA by allowing adjustments to the data collection scope
based on the specific needs of each incident. 
The implementation of this action ensures that we efficiently navigate the information spectrum. When the investigation
requires a more targeted approach, this action can narrow the data collection scope. 
Query action: Query action can query data from different sources and output the query result as a key-value pair table.
For instance, in Figure 5, the “Known issue?” action node queries the database to see whether the current incident is a
known one or not based on its alert messages. 
The query action can also output an enum value to decide the next action node to execute, e.g., after getting the top error
message on the exception stack traces, i.e., "Get top error msg" node, the next action node to be run depends on the exception
type.
Mitigation action: This action refers to the strategic steps suggested to alleviate an incident, such as “restart service” or
“engage other teams”, as depicted in Figure 5.

4.1.3 Multi-source diagnostic information. RCACopilot’s diagnostic information collection stage serves as a valuable tool for ....
In the case of new incidents, RCACopilot can perform a range of common checks, such as evaluating the provisioning status
or analyzing thread stacks. 

4.2 LLMs for Incident Explanation
Upon thorough investigation, each incident within our service is manually assigned a root cause category by our seasoned
OCEs. 
Recently, LLMs have demonstrated remarkable capabilities in understanding the context of downstream tasks and generating
relevant information from demonstrations, making them a possible choice for incident RCA.

4.2.1 Embedding model. Our observation is that the semantics of incidents can be revealed from the context in which the
diagnostic information is described. 
We employ FastText as our embedding model, which is efficient, insensitive to text input length, and generates dense
matrices, making it easy to calculate the Euclidean distance between similar vectors. 

4.2.2 Nearest neighbor search. Incidents are heterogeneous, making it impractical to combine all past incidents’ informa-
tion for sampling due to the prompt length limitations, even after summarization
4.2.3 Diagnostic information summary. LLMs have shown potential for automatic summarization [40]. Nonetheless,
the length of the diagnostic information collected from RCACopilot handlers is often too extensive. 

4.2.4 Prediction prompt construction. CoT prompting is a gradient-free technique that guides LLMs to produce interme-
diate reasoning steps leading to the final answer

4.3
 Implementation
We have developed and deployed RCACopilot using a combined total of 58,286 lines of code, consisting of 56,129 lines of C#
and 2,157 lines of Python.
o facilitate the building of the RCACopilot incident handler, we have implemented RCACopilot’s handler construction as
a web application as shown in Figure 10.

5 Evaluation
...