Observability Driven Auto-Scaling

For many DevOps teams, horizontal auto-scaling in Kubernetes is still tied almost exclusively to CPU and memory thresholds. While these metrics are easy to implement and often good enough for basic workloads, they are blunt instruments that fail to reflect real application performance or user experience. The future of scaling is observability driven, making dynamic scaling decisions based on service level indicators (SLIs) and business relevant metrics, not just system resource usage.

Traditional scaling triggers are reactive in nature. A spike in CPU may not indicate degraded user experience, and conversely, a service could be struggling under a high load of slow database queries while CPU usage remains comfortably low. By shifting to an observability driven model, we can base scaling on the metrics that truly matter, such as request latency, queue depth, or even domain specific KPIs like transactions per second or error rate.

Metrics

The starting point is recognising that scaling should be guided by your service objectives. If your SLO defines that 95% of API calls must return within 250ms, latency becomes the primary candidate for scaling triggers. This requires metrics instrumentation at the application layer, not just infrastructure.

For example, an e-commerce checkout service could export metrics such as http_request_duration_seconds_bucket and http_requests_in_flight using Prometheus client libraries. By combining these with Kubernetes’ Horizontal Pod Autoscaler (HPA) via the Prometheus Adapter, you can create scaling rules like:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_request_duration_seconds_bucket
      target:
        type: AverageValue
        averageValue: 200ms

In this scenario, the deployment scales not when CPU reaches 80%, but when the average request latency breaches the 200ms threshold. This ensures scaling decisions are tied directly to the end user experience.

Integrating KEDA for Event-Driven Scaling

Some workloads don’t follow a steady request/response pattern and are instead driven by asynchronous queues or event streams. For these, KEDA (Kubernetes Event-Driven Autoscaling) provides a more flexible mechanism.

If we look at a video processing pipeline reading jobs from an Azure Service Bus queue. You can scale the number of workers based on queue depth or message age, ensuring timely processing during peak loads without over provisioning during quiet periods:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: video-processor
spec:
  scaleTargetRef:
    name: video-processor
  triggers:
  - type: azure-servicebus
    metadata:
      queueName: videos-to-process
      messageCount: "50"
      connectionFromEnv: SERVICEBUS_CONNECTION_STRING

Here, scaling is driven by a business relevant signal (jobs waiting in the queue) rather than raw infrastructure load.

Closing the Feedback Loop with Observability

Scaling logic is only as good as the feedback it receives. With an observability stack in place, typically Prometheus for metrics, OpenTelemetry for distributed tracing, and Grafana for visualisation, you can monitor not just when scaling occurs, but also how it impacts performance.

For example, you might observe that latency drops after scaling up, but so does throughput after a certain point due to database connection limits. These insights can lead to architectural changes, such as introducing connection pooling, read replicas, or caching layers, which improve performance without additional scaling. Additionally, tracing can uncover hidden bottlenecks that scaling alone cannot solve. If the bottleneck is downstream in a payment gateway or external API, scaling your pods won’t help, you need to optimise those dependencies or introduce throttling.

Avoiding Metric Flapping

One common pitfall in observability driven scaling is metric flapping, rapid up/down scaling caused by transient metric spikes. To counter this, you can introduce stabilisation windows, smoothing functions, or hysteresis in your scaling configuration. For example, the HPA supports a stabilizationWindowSeconds parameter to delay scaling decisions until a metric breach has persisted for a set time. Alternatively, you can preprocess metrics in Prometheus with rate functions, histograms, or moving averages before exposing them to the HPA.

An ideal production setup might combine multiple scaling inputs:

CPU/memory for baseline resource health.
Application latency for user experience protection.
Queue length for asynchronous workloads.
Custom KPIs for business critical services.

By layering these metrics, you create a resilient, intelligent scaling strategy that responds to real demand patterns rather than static thresholds.