AIOps in Cloud Infrastructure and Incident Management

Consider a large e-commerce enterprise that runs its operations entirely in the cloud, utilizing AWS for hosting. Its digital platform relies on a complex ecosystem made up of hundreds of virtual machines, microservices, databases, and web servers.

As the business expands, so does the complexity of managing its IT infrastructure, particularly when it comes to monitoring system performance, ensuring uptime, and maintaining operational health. Initially, the company will rely on basic cloud-native monitoring tools. However, as traffic and infrastructure scaled, the growing influx of logs, alerts, and performance metrics will begin to overwhelm the IT operations team. This made it increasingly difficult to quickly identify and resolve incidents.

Key Challenges:

  • Alert fatigue: The team faced hundreds of alerts each day, making it difficult to distinguish between high-priority issues and minor ones. This often delayed critical incident resolution.

  • Manual troubleshooting: Root cause analysis required combing through vast log files manually, a slow, tedious process prone to human error.

  • Limited scalability: As infrastructure grew, manual monitoring became unsustainable, and the system lacked the ability to react to incidents automatically without human oversight.

AIOps Adoption:

To address these challenges, AIOps comes in, AIOps solution is aimed at modernizing IT operations. The new platform enabled automated incident handling, proactive issue detection, and intelligent response orchestration helping Companies and Organizations scale efficiently and respond to issues in real time.

AIOps implementation:

Step 1: Setting Up Monitoring with Prometheus

Install Prometheus

wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz
tar -xvzf prometheus-2.27.1.linux-amd64.tar.gz
cd prometheus-2.27.1.linux-amd64/
./prometheus

Then install Node Exporter (to collect system metrics):

wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar -xvzf node_exporter-1.1.2.linux-amd64.tar.gz
cd node_exporter-1.1.2.linux-amd64/
./node_exporter

Next, configure Prometheus to scrape metrics from Node Exporter:

Edit the prometheus.yml file to include the Node Exporter target:

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

And start Prometheus:

./prometheus --config.file=prometheus.yml

You can now access Prometheus via http://localhost:9090 to verify that it's collecting metrics

Step 2: Collecting System Data (CPU Usage)

Querying Prometheus API for CPU Usage

Use Python to query Prometheus and retrieve CPU usage data (for example, using the node_cpu_seconds_total metric). fetch the data for the last 30 minutes.

import requests
import pandas as pd
from datetime import datetime, timedelta

# Define the Prometheus URL and the query
prom_url = "http://localhost:9090/api/v1/query_range"
query = 'rate(node_cpu_seconds_total{mode="user"}[1m])'

# Define the start and end times
end_time = datetime.now()
start_time = end_time - timedelta(minutes=30)

# Make the request to Prometheus API
response = requests.get(prom_url, params={
    'query': query,
    'start': start_time.timestamp(),
    'end': end_time.timestamp(),
    'step': 60
})

data = response.json()['data']['result'][0]['values']
timestamps = [item[0] for item in data]
cpu_usage = [item[1] for item in data]

# Create a DataFrame for easier processing
df = pd.DataFrame({
    'timestamp': pd.to_datetime(timestamps, unit='s'),
    'cpu_usage': cpu_usage
})

print(df.head())

Step 3: Anomaly Detection with Machine Learning

Train an Anomaly Detection Model:

First, install Scikit-learn:

pip install scikit-learn matplotlib

Then you’ll need to train the model using the CPU usage data we collected:


from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt

# Prepare the data for anomaly detection (CPU usage data)
cpu_usage_data = df['cpu_usage'].values.reshape(-1, 1)

# Train the Isolation Forest model (anomaly detection)
model = IsolationForest(contamination=0.05)  # 5% expected anomalies
model.fit(cpu_usage_data)

# Predict anomalies (1 = normal, -1 = anomaly)
predictions = model.predict(cpu_usage_data)

# Add predictions to the DataFrame
df['anomaly'] = predictions

# Visualize the anomalies
plt.figure(figsize=(10, 6))
plt.plot(df['timestamp'], df['cpu_usage'], label='CPU Usage')
plt.scatter(df['timestamp'][df['anomaly'] == -1], df['cpu_usage'][df['anomaly'] == -1], color='red', label='Anomaly')
plt.title("CPU Usage with Anomalies")
plt.xlabel("Time")
plt.ylabel("CPU Usage (%)")
plt.legend()
plt.show()

Step 4: Automating Incident Response with AWS Lambda

AWS Lambda for Automated Scaling

First, create your AWS Lambda function that scales EC2 instances when CPU usage exceeds 80%.

import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')

    # If CPU usage exceeds threshold, scale up EC2 instance
    if event['cpu_usage'] > 0.8:  # 80% CPU usage
        instance_id = 'i-1234567890'  # Replace with your EC2 instance ID
        ec2.modify_instance_attribute(InstanceId=instance_id, InstanceType={'Value': 't2.large'})

    return {
        'statusCode': 200,
        'body': f'Instance {instance_id} scaled up due to high CPU usage.'
    }

Then you’ll need to trigger the Lambda function. Set up AWS CloudWatch Alarms to monitor the output from the anomaly detection and trigger the Lambda function when CPU usage exceeds the threshold.

Step 5: Proactive Resource Scaling with Predictive Analytics

Finally, using predictive analytics, AIOps can forecast future resource usage and proactively scale resources before problems arise.

Predictive Scaling:

We’ll use a linear regression model to predict future CPU usage and trigger scaling events proactively.

Start by training a predictive model:

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

# Historical data (CPU usage trends)
data = pd.DataFrame({
    'timestamp': pd.date_range(start="2023-01-01", periods=100, freq='H'),
    'cpu_usage': np.random.normal(50, 10, 100)  # Simulated data
})

X = np.array(range(len(data))).reshape(-1, 1)  # Time steps
y = data['cpu_usage']

model = LinearRegression()
model.fit(X, y)

# Predict next 10 hours
future_prediction = model.predict([[len(data) + 10]])
print("Predicted CPU usage:", future_prediction)

If the predicted CPU usage exceeds a threshold, AIOps can trigger auto-scaling using AWS Lambda or Kubernetes.

Results:

  • Reduced incident resolution time: The average time to resolve incidents dropped from hours to minutes because AIOps helped the team identify issues faster.

  • Reduced false positives: By using anomaly detection, the system significantly reduced the number of false alerts.

  • Increased automation: With automated responses in place, the system dynamically adjusted resources in real time, reducing the need for manual intervention.

  • Proactive issue management: Predictive analytics enabled the team to address potential problems before they became critical, preventing performance degradation.

Conclusion

AIOps transforms IT operations, enabling companies to build more efficient, responsive, and superior systems. By automating routine tasks, identifying issues before they worsen, and providing real-time data, AIOps is altering the function of IT teams.

AIOps is the most effective tool for increasing system speed, reducing downtime, and streamlining your IT procedures. You can begin modestly, and gradually include more functionality. Then you’ll start to see how AIOps opens your IT environment to fresh ideas and increases its efficiency

References

0
Subscribe to my newsletter

Read articles from Abdulkareem Abdulmateen directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abdulkareem Abdulmateen
Abdulkareem Abdulmateen