What is AIOps

Welcome to AIOps, Artificial Intelligence for IT Operations.

Collect
Ingests everything from logs, metrics, events not just monitors but also extracts context.
Learn
Learns what is “normal” or what is “abnormal” with ML.
Guess
AI sees this and warns in advance, for example:
“This CPU spike is causing this service to crash after 30 minutes every time.”
Automate
Doesn’t create alerts from logs only takes action:
Restart, scale, rollback. Whatever is needed and no human needed.

DevOps was cute. GitOps was clean. Then infra started scaling like hell, sensors started screaming and logs became novels.

Now, it’s AIOps time!

No more manual dashboards at 4AM, no more “wait, which pod is on fire?” with AIOps, we’re not just monitoring we’re teaching the system to think, learn and action. Autonomously!

So last night irolled out our AIOps module for mission-critical container workloads at Goosey Inc. and i thought it might just run silently overnight. Turns out it kept me awake and not from errors but from excitement.

At Goosey, we are building next-gen observability for data centers, powered by real-time sensor analytics and AI and from anomaly detection to early warning systems Goosey Inc. is turning noise into insight. From anomaly detection to early warning systems, turning noise into insight.

Stay tuned, something nuclear is coming :)

Technologies Stack:

Machine Learning
Log & Metric correlation engines
Natural Language Processing
Automation tools

Core Stack (The AIOps Skeleteon):

Data Ingestion
Prometheus → metric
Loki
Fluentd
OpenTelemetry
Data Store
Elasticsearch
InfluxDB
NATS.IO
Correlation
Grafana Machine Learning
Yelp’s ElastAlert
OpenObserve

AI & ML Layer:
Facebook’s Kats
LangChain
NVIDIA Morpheus

Automation Layer:
Terraform or Pulumi
k8s Operators
n8n

Observability:
Grafana
Kibana
PagerDuty

Scenario: 04:00 AM, CPU spike and memory leak!

AI scans old data and observability records, sees what caused the crash before, finds root cause and restarts container, pods or applications.

And writes note on Dashboard:
“incident auto-resolved. go back to sleep. 🙂”

So you wake up like:

“oh. we didn’t even crash.”

The PoC Steps:

First, we will collect logs on Docker containers and we will set up a simple container application (nginx or another service) and get its logs.

Collecting:

Docker logs will be directed to Elasticsearch or a log management tool and we will collect logs with tools like Fluentd.

Learn:

We will run the AI model on these logs and detect anomalies for example we will detect abnormal increases in CPU usage or faulty requests.

Guess:

We will predict potential problems in the system using the AI model and then trigger an automatic action (restart a container or application and send an alarm).

Example Step-by-Step Setup:

Run the example Docker container
We will start an nginx container again, because it needs to produce logs.
Collect Logs:
We can get Docker logs directly with Rust instead of Filebeat and it’s possible to get logs via Docker API.
Process Logs:
We can use libraries like serde and regex to get and process logs in Rust.

Okay, will write a simple algorithm for anomaly detection for example calculating the probability of a “problem” as error messages increase in the log.

Intervention:
We can trigger an action after anomaly detection for example we can restart a Docker container.
for Rust:
serde: process JSON data
regex: analyze logs
reqwest: communicate with Docker API
tokio: For asynchronous operations

First Step: Running the Docker Container
To start, let’s run an nginx container with the following command:

```
docker run — name nginx -d -p 8080:80 nginx
```

Second Step: Create a Rust Project:

```
cargo new aiops
cd aiops
```
Add the required Libraries:
Let’s add the following dependencies to the cargo.toml file:

```
[dependencies]
serde = “1.0”
serde_json = “1.0”
regex = “1.5”
reqwest = { version = “0.11”, features = [“json”] }
tokio = { version = “1”, features = [“full”] }
```

Step Three: Fetching Docker Logs
We can use Docker API to fetch Docker logs. For example simple Rust code could be:

```
use reqwest::Client;
use serde_json::Value;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();

// get logs
let logs_url = “http://localhost:2375/containers/nginx/logs?stdout=true";
let logs: Value = client.get(logs_url)
.send()
.await?
.json()
.await?;

println!(“{:#?}”, logs);
Ok(())
}
```

Step Four: Process Logs and Detect Anomalies
In Rust, we can use regex to analyze logs for example if there is an error message and we can detect anomalies:

```
use regex::Regex;

fn detect_anomaly(log: &str) -> bool {
let re = Regex::new(r”error|fail|down”).unwrap();
re.is_match(log)
}

fn main() {
let log = “2025–04–24 04:01:10 error: something went wrong bla bla”;
if detect_anomaly(log) {
println!(“anomaly detected: {}”, log);
} else {
println!(“no anomaly detected”);
}
}
```

Step Five: Automate Proactive Intervention
When an anomaly is detected we can take action. For example restart a Docker container to do this we can send a request to Docker API:

```
use reqwest::Client;

async fn restart_container() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let restart_url = “http://localhost:2375/containers/nginx/restart";
client.post(restart_url)
.send()
.await?;

println!(“Anomaly detected, container restarted!”);
Ok(())
}
```

Observability:

Prometheus

So we will collect metrics of Docker application (CPU, memory bla bla) and we will pull and save these metrics using Prometheus.

Grafana

Integrating Grafana with Prometheus we will visualize all metrics on the application or container.

```
version: “3”
services:
prometheus:
image: prom/prometheus
container_name: prometheus
ports:
— “9090:9090”
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
container_name: grafana
ports:
— “3000:3000”
environment:
GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- prometheus
```

A simple configuration file for Prometheus (prometheus.yml):

```
global:
scrape_interval: 15s
scrape_configs:
- job_name: ‘nginx’
static_configs:
- targets: [‘nginx:9113’]
```

So login to Grafana admin interface and add Prometheus DataSource.

After logging in to Grafana go to the Data Sources section and add Prometheus then in a Dashboards tab, you can select a ready-made Nginx or other dashboard.

Anomaly Detection with Rust:
You can pull data with Rust using Prometheus metrics and again remember that you can get metrics from Prometheus /metrics endpoint using libraries like reqwest or tokio in Rust.

Action: After getting metrics for example if the CPU usage is very high or there is a memory error so we can detect this as anomaly and take action. When anomaly is detected we can restart container using Docker API.

Also, you can get anomaly detection and alerts on the dashboard with Prometheus Alertmanager or Grafana Alerting and this will relax you at 04:00 AM.

Finally, iwould be happy to provide information about AIOps (Artificial Intelligence for IT Operations) so AIOps refers to the use of Artificial Intelligence and Machine Learning technologies in IT operations. It’s main purpose is to strengthen IT infrastructure monitoring, problem detection, correlation of events and automatic solution generation processes with AI.

The basic components of AIOps are as follows:

Data Collection
Collects and combines data from different systems and sources.
Machine Learning Used for anomaly detection correlation of events and problem prediction.
Automation Increases operational efficiency by automating repetitive tasks.

Main benefits of AIOps:

Faster problem detection and resolution
Reducing downtime
Automation of manual processes
Proactive problem management

Error detection time: 5 min → 30 sec.

Reduce: Reduce by 80%.

Dashboard: Stylish charts in Grafana and view anomaly trends.

What is AIOps?

Scenario: 04:00 AM, CPU spike and memory leak!

Subscribe to my newsletter

Ozgur Kara

Ozgur Kara