In modern cloud-native environments like Kubernetes, observability is critical — and Prometheus is the go-to tool for metrics monitoring. However, as your infrastructure grows across clusters and regions, Prometheus alone may not be enough.

That’s where Thanos comes in.

What is Prometheus?

Prometheus is an open-source monitoring system designed for reliability and simplicity. It uses a pull-based model to scrape metrics from targets (like apps, nodes, or containers), stores them in a time-series database (TSDB), and supports querying via PromQL.

Prometheus Limitations

Despite its power, Prometheus has a few limitations:

No built-in high availability
Metrics are stored locally, often retained for only 15 days
Doesn’t support multi-cluster federation natively
Scaling requires significant manual setup

Thanos

Thanos is an open-source project that extends Prometheus to provide:

Global view across clusters (federation)
Long-term storage using object stores like S3 or GCS
High availability with deduplication
A single query interface for all Prometheus instances

Thanos Architecture

Prometheus HA with Thanos. Introduction | by Ramu Nakerikanti | Medium

1. Prometheus

Prometheus is the core metrics collection engine. It scrapes metrics from Kubernetes nodes, applications, and exporters at regular intervals.

2. Thanos Sidecar

Each Prometheus instance runs alongside a Thanos Sidecar container. The sidecar serves two critical roles:

Exposes Prometheus data via gRPC so it can be queried remotely.
Uploads TSDB blocks (time-series data) to Object Storage (e.g., Amazon S3, GCS, Azure Blob).

3. Object Storage (e.g., S3)

Long-term storage for Prometheus metrics. Sidecars push blocks to it, and other Thanos components (like Store and Compactor) interact with it.

4. Thanos Store

Acts as a gateway to historical data stored in S3. It allows Querier to access older metrics, even after Prometheus instances have deleted them locally.

5. Thanos Compactor

This component downsamples and compacts metrics data in object storage to optimize performance and reduce storage costs. It also applies deduplication.

6. Thanos Querier

A central query layer that connects to:

Sidecars (for real-time data)
Store Gateway (for historical data)
It provides a unified view of metrics across all clusters and Prometheus instances.

7. Thanos Ruler

It runs recording and alerting rules on historical and real-time data, just like Prometheus does. Useful when you want global alerting.

8. Grafana

Grafana dashboards are configured to query the Thanos Querier using PromQL, enabling users to visualize both live and historical metrics from multiple Prometheus instances.

9. Alertmanager

Used to manage alerts, including deduplication and routing. It can receive alerts from both Prometheus and Thanos Ruler.

Data Flow Summary

Prometheus scrapes metrics and stores them locally.
Sidecar uploads the blocks to S3 and exposes data via gRPC.
Thanos Querier pulls real-time data from Sidecars and historical data from Store.
Grafana sends queries to Querier for dashboard visualization.
Compactor cleans and compresses S3 blocks.
Ruler evaluates alerts and recording rules.
AlertManager routes alerts to teams.

Real-World Use Case

In one of our projects, we had Prometheus running in multiple EKS clusters. We used Thanos to:

Store metrics in S3 for 6 months
Query all clusters from a single Grafana dashboard
Ensure no data loss even if a Prometheus pod restarted
Alert on CPU, memory, and pod failures across clusters

Why Use Thanos?

Horizontal scalability for Prometheus
Global view across multiple clusters
Long-term storage via object stores
Downsampling & data compression
HA monitoring setup with deduplication

Summary

Thanos doesn’t replace Prometheus — it enhances it. If you’re dealing with multiple clusters, need centralised monitoring, or want to retain metrics for months, Thanos is a solid choice.

Whether you're prepping for an interview or building real systems, Thanos + Prometheus is a powerful stack every DevOps engineer should know.

Understanding Thanos + Prometheus for Scalable Monitoring