Agent/Non-Agent based monitoring & Distributed tracing

Introduction
When monitoring applications and infrastructure, businesses usually choose between agent-based and non-agent based monitoring solutions. Some tools, such as Datadog or Splunk, can use agent-based approaches, while both also support non-agent methods. In this article, we’ll use Datadog as an example of agent-based monitoring and Splunk as an example of non-agent-based monitoring, to show the differences in practice. Each approach has its strengths and challenges, depending on needs, environments, and monitoring goals.
Agent-Based Monitoring
Agent-based monitoring means installing a small program (an agent) on each server or host that you want to monitor.
How it works
An agent runs continuously on the host machine.
It collects metrics like CPU, memory, disk usage, application logs, and more.
Data is sent to a central server (like DataDog's cloud) for analysis.
from datadog import initialize, statsd
# Initialize (if sending directly to local agent, defaults are ok)
options = {"statsd_host":"localhost", "statsd_port":8125}
initialize(**options)
# Send a gauge metric (e.g., app.queue.size=5)
statsd.gauge("app.queue.size", 5)
Pros
Rich Data Collection: Agents can collect detailed performance metrics and logs directly from the machine or container.
Real-Time Monitoring: Since agents run locally, they can send data almost instantly.
Custom Checks: You can configure agents to do extra checks, like custom scripts.
Easy Auto-Discovery: Agents can sometimes auto-detect services and start monitoring them automatically.
Cons
Deploy and Maintain: Every machine or container needs an agent installed and kept up to date.
Resource Usage: Agents use a bit of the host’s CPU and memory, though this is usually small.
Compatibility: Some environments (like highly restricted or legacy systems) may not allow agent installation.
Non-Agent Based Monitoring
Non-agent based monitoring collects data without installing anything on the host. A central system pulls in data, often by receiving logs or metrics through APIs, syslog, or other protocols.
How it works
Systems send their log files, events, or performance data to a central collector (like Splunk).
No agents run on the host; configuration often happens on log shippers or using built-in system protocols.
import requests
import json
splunk_url = "<https://splunk-server:8088/services/collector>"
headers = {
"Authorization": "Splunk YOUR_HEC_TOKEN"
}
data = {
"event": "metric", # event type
"fields": {
"metric_name:app.queue.size": 5
}
}
# Send the data to Splunk
requests.post(splunk_url, headers=headers, data=json.dumps(data), verify=False)
Pros
No Agent Management: Nothing to install or update on the monitored systems.
Good for Legacy/Restricted Systems: Useful where you cannot install extra software.
Centralized Control: All settings and updates occur on the collector’s side.
Cons
Limited Data: Sometimes, only basic metrics or logs are available unless the system supports rich exports.
Slower or Batch Updates: Data may arrive in batches, so insights may lag behind real-time.
Harder Customization: Custom health checks or metrics are harder to set up.
Distributed Tracing
What is distributed tracing?
Distributed tracing helps developers follow the journey of a user or API request as it moves through different services in a system. Each step the request takes is called a “span.” Each span gets a span ID (unique identifier).
Where is it used?
Microservices: When applications have many small services talking to each other.
Serverless and Cloud-Native Apps: Where requests touch multiple services.
Debugging Performance Issues: To find bottlenecks in big, complex systems.
How is it used?
When a request starts, a trace ID and a span ID are created.
As the request travels, new spans and IDs are made for each new service or step.
All the spans are grouped together under the single trace ID.
This lets you see the entire path—the trace—of a request, how long each step takes, and where failures happen.
Span ID
Every operation in the trace has its own span ID.
The span ID helps to track and organize all steps within a single trace.
By analyzing span IDs, you can see how long each segment took and how services relate to each other.
When to use each monitoring solution
Visualization: DataDog has easy-to-use dashboards to view full traces and performance of each span.
Friendly for Developers: Many programming languages and frameworks are supported with minimal manual work.
Built-In Agent Support: DataDog’s agent natively collects distributed traces together with logs and metrics. This makes setup simpler and deeper.
Real-Time Tracing: The agent sends tracing data in real time, which helps with quick debugging.
Automatic Context Linking: Traces, logs, and metrics are tied together, making it easier to investigate issues.
More Control: If you want to have more control over OpenTelemetry abstraction, Splunk can be a better option. Although the OpenTelemetry library is still under development.
By contrast, while Splunk can handle traces through external plugins or integrations, it often requires more manual setup and may not offer real-time or tightly integrated tracing experiences out-of-the-box.
Simple Python example of how distributed tracing can be done in DataDog:
from ddtrace import tracer, patch_all
patch_all()
@tracer.wrap()
def say_hello():
print("Hello from DataDog tracing!")
if __name__ == "__main__":
say_hello()
Same example with Splunk:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
otlp_exporter = OTLPSpanExporter(endpoint="<https://ingest.us1.signalfx.com/v2/trace>", headers={
"X-SF-TOKEN": "your-splunk-access-token"
})
trace.set_tracer_provider(TracerProvider())
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("splunk-example-span"):
print("Hello from Splunk tracing!")
Conclusion
Both Datadog and Splunk offer agent-based and agentless options. The choice is less about the tool and more about your monitoring strategy: agent-based for richer, real-time detail, and non-agent for environments where agents aren’t possible.
For distributed tracing, Datadog offers strong out-of-the-box support, while Splunk emphasizes flexibility and standards like OpenTelemetry.
Subscribe to my newsletter
Read articles from Farbod Ahmadian directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by