Agent/Non-Agent based monitoring & Distributed tracing

Farbod AhmadianFarbod Ahmadian
5 min read

Introduction

When monitoring applications and infrastructure, businesses usually choose between agent-based and non-agent based monitoring solutions. Some tools, such as Datadog or Splunk, can use agent-based approaches, while both also support non-agent methods. In this article, we’ll use Datadog as an example of agent-based monitoring and Splunk as an example of non-agent-based monitoring, to show the differences in practice. Each approach has its strengths and challenges, depending on needs, environments, and monitoring goals.

💡
Disclaimer: Both Datadog and Splunk support agent-based and non-agent-based monitoring approaches. The examples in this article use Datadog primarily to illustrate agent-based monitoring and Splunk to illustrate non-agent-based monitoring. This is for clarity in comparing the concepts, not to suggest that one tool is limited to only one method.

Agent-Based Monitoring

Agent-based monitoring means installing a small program (an agent) on each server or host that you want to monitor.

How it works

  • An agent runs continuously on the host machine.

  • It collects metrics like CPU, memory, disk usage, application logs, and more.

  • Data is sent to a central server (like DataDog's cloud) for analysis.

from datadog import initialize, statsd

# Initialize (if sending directly to local agent, defaults are ok)
options = {"statsd_host":"localhost", "statsd_port":8125}
initialize(**options)

# Send a gauge metric (e.g., app.queue.size=5)
statsd.gauge("app.queue.size", 5)

Pros

  • Rich Data Collection: Agents can collect detailed performance metrics and logs directly from the machine or container.

  • Real-Time Monitoring: Since agents run locally, they can send data almost instantly.

  • Custom Checks: You can configure agents to do extra checks, like custom scripts.

  • Easy Auto-Discovery: Agents can sometimes auto-detect services and start monitoring them automatically.

Cons

  • Deploy and Maintain: Every machine or container needs an agent installed and kept up to date.

  • Resource Usage: Agents use a bit of the host’s CPU and memory, though this is usually small.

  • Compatibility: Some environments (like highly restricted or legacy systems) may not allow agent installation.

Non-Agent Based Monitoring

Non-agent based monitoring collects data without installing anything on the host. A central system pulls in data, often by receiving logs or metrics through APIs, syslog, or other protocols.

How it works

  • Systems send their log files, events, or performance data to a central collector (like Splunk).

  • No agents run on the host; configuration often happens on log shippers or using built-in system protocols.

import requests
import json

splunk_url = "<https://splunk-server:8088/services/collector>"
headers = {
    "Authorization": "Splunk YOUR_HEC_TOKEN"
}
data = {
    "event": "metric",  # event type
    "fields": {
        "metric_name:app.queue.size": 5
    }
}

# Send the data to Splunk
requests.post(splunk_url, headers=headers, data=json.dumps(data), verify=False)

Pros

  • No Agent Management: Nothing to install or update on the monitored systems.

  • Good for Legacy/Restricted Systems: Useful where you cannot install extra software.

  • Centralized Control: All settings and updates occur on the collector’s side.

Cons

  • Limited Data: Sometimes, only basic metrics or logs are available unless the system supports rich exports.

  • Slower or Batch Updates: Data may arrive in batches, so insights may lag behind real-time.

  • Harder Customization: Custom health checks or metrics are harder to set up.

Distributed Tracing

What is distributed tracing?

Distributed tracing helps developers follow the journey of a user or API request as it moves through different services in a system. Each step the request takes is called a “span.” Each span gets a span ID (unique identifier).

Where is it used?

  • Microservices: When applications have many small services talking to each other.

  • Serverless and Cloud-Native Apps: Where requests touch multiple services.

  • Debugging Performance Issues: To find bottlenecks in big, complex systems.

How is it used?

  • When a request starts, a trace ID and a span ID are created.

  • As the request travels, new spans and IDs are made for each new service or step.

  • All the spans are grouped together under the single trace ID.

  • This lets you see the entire path—the trace—of a request, how long each step takes, and where failures happen.

Span ID

  • Every operation in the trace has its own span ID.

  • The span ID helps to track and organize all steps within a single trace.

  • By analyzing span IDs, you can see how long each segment took and how services relate to each other.

When to use each monitoring solution

  • Visualization: DataDog has easy-to-use dashboards to view full traces and performance of each span.

  • Friendly for Developers: Many programming languages and frameworks are supported with minimal manual work.

  • Built-In Agent Support: DataDog’s agent natively collects distributed traces together with logs and metrics. This makes setup simpler and deeper.

  • Real-Time Tracing: The agent sends tracing data in real time, which helps with quick debugging.

  • Automatic Context Linking: Traces, logs, and metrics are tied together, making it easier to investigate issues.

  • More Control: If you want to have more control over OpenTelemetry abstraction, Splunk can be a better option. Although the OpenTelemetry library is still under development.

Whether agent-based or non-agent-based monitoring is a better fit depends on your environment, security requirements, and operational trade-offs.

By contrast, while Splunk can handle traces through external plugins or integrations, it often requires more manual setup and may not offer real-time or tightly integrated tracing experiences out-of-the-box.

Simple Python example of how distributed tracing can be done in DataDog:

from ddtrace import tracer, patch_all

patch_all()

@tracer.wrap()
def say_hello():
    print("Hello from DataDog tracing!")

if __name__ == "__main__":
    say_hello()

Same example with Splunk:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

otlp_exporter = OTLPSpanExporter(endpoint="<https://ingest.us1.signalfx.com/v2/trace>", headers={
    "X-SF-TOKEN": "your-splunk-access-token"
})

trace.set_tracer_provider(TracerProvider())
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("splunk-example-span"):
    print("Hello from Splunk tracing!")

Conclusion

  • Both Datadog and Splunk offer agent-based and agentless options. The choice is less about the tool and more about your monitoring strategy: agent-based for richer, real-time detail, and non-agent for environments where agents aren’t possible.

  • For distributed tracing, Datadog offers strong out-of-the-box support, while Splunk emphasizes flexibility and standards like OpenTelemetry.

1
Subscribe to my newsletter

Read articles from Farbod Ahmadian directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Farbod Ahmadian
Farbod Ahmadian