OTeL Setups: Bare-Metal & Kubernetes Guide Part 1

The intention of this post is to be an opinionated suggestion on collector configurations to gather comprehensive metrics for two fundamental environment types. Fortunately, the OTeL community has made it fairly straight-forward. In "Part 1", we will explore simple host monitoring.

Host Metrics and Attributes

The hostmetricsreceiver is a mainstay of the OTeL collector repository and provides system health metrics. There are various components within the configuration patterns that can be enabled selectively.

Collection Interval

Receivers can scrape at various intervals:

hostmetrics:
        collection_interval: 10s

Depending on data retention windows and granularity needed, one should expect to set this from 5s-30s (in increments of 5s), although any frequency can be used.

CPU and Memory

CPU and Memory metrics can be enabled, however, it is important to know which ones are on by "default". Let's walkthrough a config:

hostmetrics:
        collection_interval: 10s
        scrapers:
          cpu:
            metrics:
              system.cpu.utilization:
                enabled: true
          memory:
            metrics:
              system.memory.utilization:
                enabled: true

In the above code, just enabling cpu populates system.cpu.time time metrics as a sum. system.cpu.utilization is recommended as it provides percentage based metrics as a gauge. Operators may be more familiar with this.

It's important to consider that states such as idle, wait, user are in the form of attributes, not unique metric names. So instead of system.cpu.utilization.user, depending on the backend, one must filter by an attribute / tag

Similarly, system.memory.utilization offers gauge like metrics that may be more appetizing. Attributes such as used and free are also available rather than unique metric names.

Disk and Filesystem

Disk performance and overall filesystem usage can be added as seen in the below config:

hostmetrics:
        collection_interval: 10s
        scrapers:
          disk:
          filesystem:
            metrics:
              system.filesystem.utilization:
                enabled: true

Important metrics, especially system.disk.io, are populated via the disk scraper. Attributes provide read and write info by device. Filesystem usage (space used vs free) is also added, with system.filesystem.utilization providing gauge like metrics. Again, attributes exist for free and used.

Network

To get network I/O, connections, and errors/dropped packets:

hostmetrics:
        collection_interval: 10s
        scrapers:
          network:

These metrics can be quite valuable as they can pinpoint network issues by device, protocol and receive / transmit

Resource Attributes

Resource attribution decoration is crucial to query metrics by information such as "host", "region", "provider", etc:

processors:
  resourcedetection:
    detectors: [gcp, ecs, ec2, azure, system]
    override: true

Keep in mind: order matters and the first detector to insert wins.

Part 2

In Part 2, we will explore relevant metrics and metadata decoration within a Kubernetes environment.

Ideal OTeL Configurations for Bare-Metal and Kubernetes (Part 1)