Observability: Monitoring

Metrics:

Metrics: 
Metrics are numerical data points that represent the health and performance 
of your system.

Real World Metrics Example:
The "Metrics" in this context refer to the ongoing monitoring of a patient's 
vital signs (BP, Heart Rate, etc) with the nurse taking measurements and recording data at specific 
intervals. This information is then used by healthcare professionals, such as 
doctors, to ensure the patient receives the necessary care based on their 
current condition.

DevOps Metrics Example:
In DevOps monitoring, metrics are crucial for tracking the performance, 
stability, and overall health of applications, infrastructure, and services. 

Examples of Metrics:
Infrastructure: CPU utilization, disk I/O, network latency.
Application: Response times, error rates, request rates.
Pipeline: Build success rates, time to recover from a failed deployment.
Security: Number of detected vulnerabilities, failed login attempts.

Monitoring:

In DevOps, Monitoring involves tracking key metrics, displaying them on 
dashboards, and setting up alerts to notify teams when something goes wrong.

Monitoring Metrics in HIMS (Hospital Information Management System) and Patient Vitals - Simplified Overview with Examples

  1. Data Collection (Metrics Gathering):

    • Healthcare (Patient Monitoring): Nurses collect patient vitals like heart rate and blood pressure at regular intervals.

      • Example: The nurse measures the patient’s heart rate every 15 minutes. At 10:00 AM, the heart rate is 76 beats per minute (bpm). At 10:15 AM, the heart rate is 81 bpm.
    • DevOps (HIMS Monitoring): We collect system metrics like CPU usage, memory, and response time using tools like Prometheus or Datadog.

      • Example: Every 1 minute, the HIMS system checks the CPU usage. At 10:00 AM, CPU usage is 65%, and at 10:15 AM, CPU usage rises to 80%.
  2. Monitoring Intervals:

    • Healthcare: Vitals are recorded every 15 or 30 minutes to monitor the patient’s condition.

      • Example: The nurse records the patient’s blood pressure every 30 minutes. At 10:00 AM, it’s 120/80, and at 10:30 AM, it’s 125/85.
    • DevOps: Metrics are collected every minute or more frequently, depending on the criticality of the HIMS system.

      • Example: The HIMS system collects response time every minute. At 10:00 AM, the response time is 300ms, and at 10:15 AM, it rises to 450ms.
  3. Thresholds & Alerts:

    • Healthcare: If a patient’s vitals (like heart rate) go beyond safe limits, an alert is triggered for immediate action by the doctor.

      • Example: If the patient’s heart rate exceeds 120 bpm, an alert is sent to the doctor to intervene. The doctor assesses the patient and may adjust medications.
    • DevOps: If system metrics (like CPU usage > 85%) exceed set thresholds, an alert is sent to the DevOps engineer, prompting them to take action.

      • Example: If the CPU usage of the HIMS server exceeds 85%, an alert is triggered, notifying the DevOps team. The engineer investigates and scales up the server.
  4. Action on Alerts:

    • Healthcare: Doctors or nurses intervene, adjust medications, or perform treatments based on the alert.

      • Example: The nurse notices that the patient’s blood pressure is consistently high, so the doctor adjusts the medication to lower it.
    • DevOps: The engineer may scale up the infrastructure, fix bugs, or restart services to resolve issues in the HIMS system.

      • Example: After receiving a high CPU usage alert, the DevOps engineer adds more server resources to handle the extra load, bringing CPU usage down.
  5. Raw Data Complexity:

    • Healthcare: Large volumes of raw vitals data are hard to interpret manually.

      • Example: A long list of blood pressure readings (e.g., 120/80, 125/85, 130/90) can be hard to analyze manually without seeing trends or visualizing them.
    • DevOps: Long lists of raw metrics (CPU, memory, etc.) are difficult to analyze by looking at numbers alone.

      • Example: CPU readings over time (e.g., 65%, 80%, 90%) are difficult to interpret without visual tools.
  6. Dashboards:

    • Healthcare: Dashboards visualize patient vitals over time, making it easier for doctors to understand trends and spot issues.

      • Example: A dashboard shows the patient’s heart rate over time, displaying it as a line graph. The doctor can instantly see if the heart rate is steadily increasing and intervene before it becomes dangerous.
    • DevOps: Dashboards (using Grafana or Kibana) display system metrics visually (graphs, charts), helping engineers quickly identify problems and track system performance.

      • Example: A Grafana dashboard shows a line graph of CPU usage over time. The DevOps engineer can see a clear trend that CPU usage spikes every morning around 9:00 AM, indicating the need to scale resources during that time.

In Short with Examples:

  • Metrics are like vitals in healthcare, constantly monitored to ensure the HIMS system or patient remains healthy.

  • Alerts are triggered when something is wrong, similar to when a patient's vitals go out of range, or the system's CPU usage exceeds safe limits.

  • Dashboards are used to visualize complex data, making it easier for DevOps engineers to keep track of system performance, and for doctors to monitor patient health.

Example (HIMS): Every morning at 9:00 AM, the CPU usage of the HIMS system spikes above 85%. The dashboard shows this pattern, and an alert is triggered. The DevOps engineer scales the system to prevent downtime.

Example (Patient): A patient’s blood pressure starts rising throughout the day. The dashboard shows this trend, and the doctor is alerted to adjust the treatment before the patient’s condition worsens.


Below is a text-based structural diagram that simplifies the process of metrics collection for the real world example and the roles involved. This should make the entire concept easy to understand.

                       +---------------------+
                       |    Patient          |
                       |   (Health Metrics)  |
                       +---------------------+
                                 |
                                 | (Vital Signs: BP, HB, etc.)
                                 V
+-------------------+    +---------------------+
|                   |    |                     |
|    Nurse          |---->  Nurse Records      | 
| (Collects Vitals) |    | (Heartbeat, BP)     |
|                   |    | at Regular Intervals|
+-------------------+    +---------------------+
                                 |
                                 | (e.g., 10:00 AM -> HB = 76)
                                 | (e.g., 10:15 AM -> HB = 81)
                                 V
                        +---------------------+
                        |   Reports or Notes   |
                        | (Shared with Doctor) |
                        +---------------------+
                                 |
                                 | (Data shared for decision making)
                                 V
                      +-----------------------------+
                      |   Doctor                     |
                      | (Reviews Data, Makes Decisions|
                      |  Based on Trends and Alerts) |
                      +-----------------------------+
                                 |
                                 | (Adjust treatment, respond to trends)
                                 V
                  +---------------------------------------+
                  |  Real-Time Decision Making            |
                  |  (Adjustments to treatment based      |
                  |   on vital trends or emergency alerts)|
                  +---------------------------------------+

In Summary:

  • Nurse: Collects and records the metrics.

  • Doctor: Reviews the metrics and makes decisions.

  • Metrics: Provide insights into the patient's health (e.g., BP and heartbeat).

  • Interval Monitoring: Allows close tracking and timely responses to health changes.

Now we will look on the text-based structural diagram that simplifies the process of metrics collection for the DevOps Metrics example HIMS Application on AWS EKS and the roles involved.

                    +-------------------------------+
                    |  HIMS Application on AWS EKS  |
                    +-------------------------------+
                                      |
              ---------------------------------------------------
              |                     |                        |
      +----------------+      +----------------+      +------------------+
      |  CPU of Nodes   |      | Memory of Nodes|     |    Pod Status    |
      +----------------+      +----------------+      +------------------+
              |                     |                        |
      - Tracks how much       - Monitors memory        - Checks if pods 
        CPU power each          usage on nodes           are running or 
        node is using         - Ensures no memory        stuck (CrashLoop, Pending)
      - Action: Add more        overuse                 - Action: Investigate 
        resources if CPU      - Action: Add more         and fix pod issues 
        usage > 85%            memory if needed          if they fail
      - Example: Node         - Example: Memory         - Example: Pod in CrashLoop 
        consistently hitting    usage reaches 85%,       may indicate a bug 
        90% CPU                need to scale up

              |                     |                        |
              -------------------------------------------------
                                      |
                            +---------------------+
                            | Deployment Status   |
                            +---------------------+
                                      |
                            - Tracks whether the 
                              app is deployed successfully
                            - Action: Rollback or 
                              debug if deployment fails
                            - Example: Failed deployment 
                              means outdated replicas or system issues

                                      |
         ----------------------------------------------------------------
         |                          |                                  |
+-------------------+     +--------------------+           +-------------------------------+
| Replica Status    |     | HTTP Requests      |           |   Patients Data Metrics        |
+-------------------+     +--------------------+           +-------------------------------+
        |                      |                                      |
- Tracks how many        - Monitors the number                  - Tracks number of Outpatient 
  replicas of your         of incoming requests                   (OP) and Inpatient (IP)
  app are running        - Helps manage traffic load              patients registered
- Action: Ensure         - Action: Scale up if traffic           - Identifies the departments
  enough replicas          spikes during peak times                with the most patient load
  are running            - Example: Spike in HTTP                - Action: Analyze which 
- Example: Deployment      requests during hospital                departments (Cardiology,
  requires 3 replicas,     hours requires more resources            Neurology, etc.) have higher load
  but only 1 is running                                            - Categorizes patient load based 
                                                                    on payment type (Scheme, Cash, Insurance)
                                                                  - Example: Cardiology has the most OP 
                                                                    patients, Insurance has the highest load 

                                      |
                            +---------------------+
                            |  Health Monitoring  |
                            +---------------------+
                                      |
                            - Combines all the metrics to 
                              give an overview of the system's health
                            - Ensures smooth operation and quick response 
                              to issues

                                      |
                            +---------------------+
                            |  Decision Making    |
                            +---------------------+
                                      |
                            - Based on these metrics, take actions such as:
                              - Scaling resources
                              - Fixing pod issues
                              - Adjusting deployments
                              - Optimizing patient registration workflows

                                      |
                            +---------------------+
                            | System Availability |
                            +---------------------+
                                      |
                            +---------------------+
                            | Smooth Operations   |
                            +---------------------+

Explanation of the Simplified Diagram:

In Summary:

  • CPU of Nodes:

    • Tracks how much processing power (CPU) each node is using.

    • If CPU usage is too high (e.g., >85%), the system may slow down or crash. Add more resources if needed to ensure the HIMS application remains responsive for hospital staff.

  • Memory of Nodes:

    • Monitors how much memory (RAM) is being used by each node.

    • If memory usage is high (e.g., >85%), scaling up or optimizing resource allocation might be necessary to prevent performance issues in critical hospital operations.

  • Pod Status:

    • Checks if your application’s pods are running properly or if any are stuck in CrashLoopBackOff or Pending.

    • If a pod is crashing, it could indicate a bug or misconfiguration that needs to be fixed to ensure continued functionality of the HIMS application.

  • Deployment Status:

    • Ensures that the HIMS application is deployed successfully without issues.

    • If deployments fail, rolling back or fixing issues is necessary before new features or updates are available for hospital staff to use.

  • Replica Status:

    • Ensures the desired number of replicas (copies of your application) are running to handle the load.

    • If fewer replicas are running than required, this can reduce system performance or cause outages, impacting hospital operations.

  • HTTP Requests:

    • Tracks how many HTTP requests your HIMS application is handling to monitor traffic patterns.

    • During peak hospital hours, traffic spikes may require scaling up resources to maintain performance and prevent slowdowns in critical hospital systems like patient registration.

  • Patients Data Metrics:

    • Tracks the number of Outpatient (OP) and Inpatient (IP) patients registered in the hospital system.

    • Identifies which departments (e.g., Cardiology, Neurology) are handling the most patient load, helping hospital management allocate resources effectively.

    • Categorizes patients by payment type (e.g., Scheme, Cash, Insurance) to help the hospital assess and manage administrative workloads.

    • Example: If Cardiology is handling the most OP patients, additional resources may be required to meet the department's demand.

  • Health Monitoring:

    • Brings together all metrics (CPU, memory, pods, HTTP requests, patient data) to give you a comprehensive view of the system’s health.

    • Ensures the HIMS application is running smoothly to support hospital operations without interruptions.

  • Decision Making:

    • Based on the data collected, informed decisions can be made to scale resources, fix bugs, adjust deployments, or allocate additional staff to busy departments.

    • Example: If patients data metrics show an increase in insurance patients, the hospital may need to allocate more staff for insurance processing.

  • System Availability:

    • Monitoring all these metrics ensures the HIMS system is always available and responsive to hospital staff and patients, preventing downtime during critical hospital operations.
  • Smooth Operations:

    • Keeping track of all these metrics ensures smooth hospital operations, allowing hospital staff and patients to access the system without interruptions, ensuring efficiency in patient registration, department workflows, and overall hospital management.

By continuously tracking CPU, memory, pods, replicas, HTTP traffic, and user activity, you can ensure that your HIMS application remains available and performs well. Monitoring these metrics allows you to detect problems early, scale resources when needed, and make sure the application is always ready for hospital staff and patients.

Prometheus:

Prometheus is an open-source monitoring and alerting tool widely used in DevOps to track system and application metrics. It collects, stores, and analyzes time-series data and is especially suited for cloud-native, containerized environments like Kubernetes.

Prometheus in DevOps Monitoring:

  • Data Collection: Prometheus scrapes metrics from various endpoints (servers, containers, applications) using HTTP. Metrics are in the form of time-series data (e.g., CPU usage over time).

  • Storage: It stores the collected metrics in its time-series database, enabling efficient querying.

  • Alerting: Prometheus integrates with the Alertmanager to send alerts when certain conditions are met, such as high CPU usage or service downtime.

  • Visualization: Prometheus can be integrated with tools like Grafana to create dashboards for visualizing metrics and trends.

Architecture:

This diagram illustrates the architecture of Prometheus and some of its ecosystem components:

Prometheus architecture

The Prometheus ecosystem consists of multiple components, many of which are optional:

Most Prometheus components are written in Go, making them easy to build and deploy as static binaries.

Workflow:

  1. Prometheus scrapes metrics from exporters or applications.

  2. Metrics are stored in the time-series database.

  3. Queries are run using PromQL to generate insights or trigger alerts based on predefined thresholds.

  4. Alerts are managed by Alertmanager and routed to notify DevOps teams.

Grafana:

Grafana is an open-source platform for monitoring, visualizing, and analyzing data in real-time. It connects to various data sources like Prometheus, Elasticsearch, and MySQL to display metrics on customizable dashboards.

Grafana helps DevOps teams monitor system performance, detect issues, and set alerts when specific thresholds (like CPU usage) are exceeded. Its real-time visualizations and interactive dashboards make it easy to track infrastructure and application health, enabling proactive monitoring and quick troubleshooting.

Why Use Grafana:

  • Unified View: Grafana allows you to combine data from different sources into a single view, providing a holistic view of your system’s health.

  • Customizable: Its flexibility makes it easy to build dashboards tailored to specific needs.

  • Extensible: Plugins and integrations expand Grafana’s capabilities.

  • Collaborative: Team-based access control and sharing features make it ideal for collaborative work.

  • Proactive Monitoring: With real-time visualizations and alerting, Grafana helps you catch and resolve issues before they impact end users.

Key Features:

  • Custom Dashboards for real-time data visualization.

  • Supports Multiple Data Sources (e.g., Prometheus, InfluxDB, MySQL).

  • Alerts based on predefined thresholds.

  • Extensible with plugins for enhanced functionality.

Use Cases:

  • Infrastructure Monitoring (e.g., CPU, memory, disk usage).

  • Application Performance Monitoring (APM).

  • Business Metrics Visualization.

Credits: Mr. Abhishek Veeramalla (Observability Series)

Resources: Prometheus Official Site

"Learning never exhausts the mind." — Leonardo da Vinci

Thank you, Happy Learning!

0
Subscribe to my newsletter

Read articles from Subbu Tech Tutorials directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Subbu Tech Tutorials
Subbu Tech Tutorials