AI-Driven Autonomous Incident Management

While working on integrating OpenTelemetry into one of my backend services, I had a thought:

“Can we use an AI-based agentic flow to do what a human SRE would — analyze 500 errors, dig through logs, understand the code, and even suggest fixes?”

That question led me to build a microservice that does exactly that.

It acts like an intelligent on-call engineer — powered by LLMs, observability tools like Tempo and Loki, and automation through agentic reasoning.

It gets the traces with errors from tempo, based on the trace id of the trace it gets the log from grafana loki and based on the logs if it finds the file which errors then it gets the source code from the github and based on all of this data it does root cause analysis and generates ticket with title, detailed description, source of the error, and possible solutions to solve the issue.

This blog post documents my learning journey as I explored agentic flows, RAG, OpenTelemetry, and AI-assisted error monitoring.
I built this flow in JS, with OpenAI(SDK as it is poc) and GPT-4o-mini but a better approach would be to use LangChain JS, LangGraph so that it can be used with multiple LLM’S with minimal code changes.

🔍 What Sparked the Idea?

OpenTelemetry gives developers powerful tools to observe their systems — tracing requests, collecting metrics, and logging errors. But these tools still require humans to:

Manually scan traces and filter 500s
Correlate logs with trace IDs
Read the code to identify the issue
Create a ticket with context and proposed fixes

While building dashboards and trace views, I thought:

“What if an LLM agent could reason through this process, step-by-step — like an actual engineer?”

This idea turned into a fully functional agent that automates root cause analysis and incident triage using OpenTelemetry data.

🗓 What the Microservice Does

This is a cron-based microservice that runs every hour, simulating the behavior of a Site Reliability Engineer AI Agent. It:

Uses Grafana Tempo to query traces with HTTP 500 status
Retrieves logs associated with those trace IDs from Loki
Analyzes the logs for root causes
Pulls the relevant source code from GitHub
Generates tickets summarizing the issue with root cause, stack info, and suggested fixes

All of this was implemented using JavaScript, building on my existing stack and tooling preferences.

🧠 The Agentic Reasoning System Prompt

This is the exact system prompt used to guide the AI agent:

You are an expert Site Reliability Engineer (SRE) AI agent. You operate using explicit reasoning steps: START, PLAN, ACTION, OBSERVATION, and OUTPUT.

🌟 Your goal is to detect and report backend system errors (HTTP 500 errors) by using a set of available tools.

🔍 Your workflow:
1. Use tools to retrieve current time and traces from the last hour.
2. For each trace, fetch logs using the trace ID.
3. Analyze the logs to find error root causes.
4. If no errors are found, output nothing.
5. If errors are found, create a report including summary, source location, possible causes, and at least 2 solutions.
6. Create ClickUp tickets only for **unique** issues (avoid duplicates for same error in different requests).

🛠 Available tools:
- getCurrentDateTime(): Get current UNIX timestamp.
- getErroredTraces(startTime: number, endTime: number, limit: number): Get recent error traces.
- getLogByTraceId(traceId: string, startTime: number, endTime: number): Get logs for a trace ID.
- createClickUpTicket(data: JSON): Create a ticket in ClickUp.
- getSourceCode(filePath: string): Get source code for a specific file and line.

📝 Expected output format for ticket:
{
  "title": "Title of the ticket",
  "description": "Description of the issue and likely cause(s)",
  "source": {
    "file": "file.js",
    "line": 1,
    "column": 1,
    "function": "functionName"
  },
  "solutions": ["Solution 1", "Solution 2", ...]
}

The flow is self-prompting: the agent makes decisions, chooses tools, observes outputs, and refines its actions.

📈 Diagram of the Flow

1. Start
   |
2. Retrieve traces from Grafana Tempo with HTTP 500 status.
   |
3. For each trace, fetch logs from Loki using the trace ID.
   |
4. Analyze logs for root cause (e.g., DB timeouts, code bugs).
   |
5. Pull source code from GitHub to examine problematic lines.
   |
6. Create ClickUp ticket with error details and proposed solutions.
   |
7. End

🌍 Real-World Use Cases

🔹 Debugging API Errors

In a production service returning 500s under load, this agent:

Fetches the exact traces
Reads logs to pinpoint failing functions (e.g. DB timeouts)
Creates developer-ready tickets

🔹 Regression Post-Deployment

When new code breaks production:

Agent identifies new stack traces
Fetches the file/line from GitHub
Suggests fixes based on surrounding code

🔹 After-Hours Triage

On weekends or nights:

The agent handles detection and reporting
By Monday, engineers find issues already investigated and documented

🔹 Scaling Across Microservices

For teams managing 20+ services:

Automatically scans all traces
Only raises unique tickets (deduplicated)
Saves hours of manual error triage

⚙️ How It Can Be Improved

Currently, the system does not track previously generated tickets or compare issues to past incidents. This can be enhanced by:

Caching previously generated issues with hashable features (e.g. stack trace fingerprint, error message, source location)
Storing metadata in a lightweight Redis or SQLite cache
Using similarity checks (Levenshtein distance or embedding vectors) to avoid duplicate tickets

Additionally, to reduce excessive LLM calls, we can:

Cache LLM outputs based on deterministic prompts (e.g. file content + trace ID)
Implement a memoization layer before each LLM call

To expand the capabilities:

Use LangChain to manage multi-step workflows
Incorporate multiple LLMs (e.g. GPT-4 for code analysis, Claude for summarization)
Add tool routing logic for different tasks (e.g. trace parsing vs. ticket drafting)

⚡ Bonus: Triggering Flow via Grafana Alerts

Although the current version runs on a cron job, it can be made real-time.

Grafana can trigger a function (via webhook) when it detects a 500 trace. That function can:

Accept a traceId
Launch the agentic root cause workflow

This setup enables near real-time SRE automation for critical errors.

🌐 Final Thoughts

This project was born from two parallel tracks:

My work integrating OpenTelemetry for service observability
My curiosity exploring agentic LLM flows and auto-prompting

The combination led to a simple but powerful idea:

“Can an AI reason through trace data and logs like an engineer?”

Turns out, yes — and it can even raise tickets with context and solutions.

https://github.com/onkarsabale15/AI-Incident-Management

Microservice Code : https://github.com/onkarsabale15/AI-Incident-Management
Note: This is not complete code as this is a micro-service part of larger application but this will give a little picture for your reference if needed.

🧠 From Observability to Autonomy: How AI Based Agentic workflow can be used to for Self-Driven Incident Management