Zero-Config Remote Worker Mesh


I've been fascinated by distributed systems lately, especially after dealing with the pain of setting up worker clusters that require extensive configuration files, service registries, and manual IP management. What if we could build something simpler? What if worker nodes could just... find each other?
That's exactly what I set out to build: a zero-configuration worker mesh where nodes automatically discover peers, distribute jobs, and sync state using nothing but UDP broadcasts and Protocol Buffers.
The Problem with Traditional Distributed Systems
Most distributed task systems I've worked with follow this pattern:
Central scheduler (like Kubernetes master)
Service registry (like etcd)
Static configuration files with hardcoded IPs
Complex networking setup
This works, but it's heavy. For smaller deployments or edge computing scenarios, you often want something that "just works" without the operational overhead.
What is a Worker Mesh?
Think of it like this: instead of having a boss (central scheduler) telling workers what to do, imagine if the workers could talk directly to each other. Each worker node can:
Broadcast "hey, I'm here and available" to the local network
Listen for work requests from any other node
Execute tasks and report results back
Keep track of who's online and who's busy
No master, no configuration files, no service discovery headaches.
Why Protocol Buffers?
When I started this project, my first instinct was to use JSON for node communication. But then I remembered the performance issues I'd faced before with chatty microservices.
JSON problems:
Verbose (lots of overhead for simple messages)
No schema enforcement
Parsing is expensive
No versioning story
Protocol Buffers solve all of this:
Binary format is compact and fast
Schema-driven with strong typing
Built-in versioning and backward compatibility
Cross-language support (Go, Python, Java, etc.)
Here's a simple example. A JSON heartbeat might look like:
{
"node_id": "worker-abc123",
"address": "192.168.1.100:8080",
"status": "IDLE",
"last_heartbeat": "2025-07-21T10:30:00Z",
"metadata": {"hostname": "worker01", "pid": "1234"}
}
The equivalent Protobuf message is ~60% smaller and deserializes much faster. For a mesh broadcasting heartbeats every 5 seconds, this adds up.
System Architecture: The Big Picture
The beauty is in the simplicity. Each node is identical and self-contained:
Discovery: UDP broadcasts for peer finding
Job Engine: Execute shell commands with timeouts
Database: SQLite for local state persistence
API: REST interface for external control
Mesh Protocol: Protobuf messages for efficiency
The Protobuf Schema: Defining Our Language
The heart of the mesh communication is three message types:
// Every 5 seconds, nodes broadcast this
message NodeInfo {
string node_id = 1; // "node-abc123"
string address = 2; // "192.168.1.100:8080"
NodeStatus status = 3; // IDLE, BUSY, DRAINING, FAILED
google.protobuf.Timestamp last_heartbeat = 4;
map<string, string> metadata = 5;
string version = 6; // For compatibility checks
}
// Work to be distributed
message Job {
string job_id = 1; // UUID
string target_node_id = 2; // Which node should run this
string command = 3; // Shell command to execute
map<string, string> env = 4; // Environment variables
int32 timeout_seconds = 5; // Kill job after this
JobStatus status = 6; // PENDING, RUNNING, COMPLETED
string created_by_node = 7; // Source node
}
// Results after job execution
message JobResult {
string job_id = 1;
string executor_node_id = 2;
int32 exit_code = 3; // 0 = success
string stdout = 4; // Command output
string stderr = 5; // Error output
google.protobuf.Timestamp started_at = 6;
google.protobuf.Timestamp finished_at = 7;
}
These messages get wrapped in a MeshMessage
envelope that identifies the message type and sender.
Zero-Config Discovery: How Nodes Find Each Other
This is where the magic happens. Instead of requiring configuration files or service registries, nodes use UDP broadcast discovery.
Here's the discovery algorithm:
1. Node Startup:
- Generate unique node ID (crypto/rand)
- Bind UDP socket to :8080
- Start heartbeat timer (5 second interval)
- Start cleanup timer (remove stale peers)
2. Heartbeat Broadcast:
- Create NodeInfo with current status
- Serialize to Protobuf bytes
- UDP broadcast to 255.255.255.255:8080
- All nodes on LAN receive message
3. Message Reception:
- Receive UDP packet
- Deserialize Protobuf
- Validate sender != self
- Update peer table in memory + database
- Log peer discovery
4. Peer Cleanup:
- Every 10 seconds, check peer timestamps
- Remove peers silent for >30 seconds
- Handle reconnections gracefully
Why UDP? It's perfect for this use case:
Connectionless: No TCP overhead or connection state
Broadcast-friendly: Single message reaches all local nodes
Simple: No complex error handling needed
Fast: Immediate delivery without handshakes
The trade-off is no delivery guarantee, but that's fine for heartbeats since the next one arrives in 5 seconds.
Job Lifecycle: From API Call to Shell Execution
Let's trace a job through the entire system:
Phase 1: Job Submission
A client submits a job via REST API:
curl -X POST http://node1:3000/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"command": "python3 process_data.py --input /tmp/data.csv",
"env": {"PYTHONPATH": "/app/libs", "DEBUG": "1"},
"timeout_seconds": 300
}'
The handler converts this to a Protobuf Job message and generates a UUID.
Phase 2: Local Queuing
The job gets stored in SQLite and added to an in-memory queue (Go channel). This provides durability and allows the API to return immediately.
Phase 3: Execution Engine
A worker goroutine processes jobs from the queue:
Below is a simplified version for a better understanding,
// Simplified execution logic
func (n *Node) executeJob(ctx context.Context, job *pb.Job) {
// Set node status to BUSY
n.setStatus(pb.NodeStatus_BUSY)
defer n.setStatus(pb.NodeStatus_IDLE)
// Create timeout context
jobCtx, cancel := context.WithTimeout(ctx,
time.Duration(job.TimeoutSeconds)*time.Second)
defer cancel()
// Parse command and execute
parts := strings.Fields(job.Command)
cmd := exec.CommandContext(jobCtx, parts[0], parts[1:]...)
// Set environment variables
if job.Env != nil {
cmd.Env = append(os.Environ(), envMapToSlice(job.Env)...)
}
// Capture output and timing
startTime := time.Now()
stdout, err := cmd.Output()
finishTime := time.Now()
// Create result
result := &pb.JobResult{
JobId: job.JobId,
ExecutorNodeId: n.ID,
ExitCode: getExitCode(err),
Stdout: string(stdout),
StartedAt: timestamppb.New(startTime),
FinishedAt: timestamppb.New(finishTime),
}
// Store result in database
n.storeJobResult(job, result)
}
Phase 4: Result Storage
Job results get stored in SQLite with full execution details - exit codes, stdout/stderr, timing information, and any error messages.
State Management with Ent ORM
Rather than writing raw SQL, I used Ent for type-safe database operations. The schemas are straightforward:
Worker Node Schema:
type WorkerNode struct {
NodeID string // Unique identifier
Address string // IP:port for communication
Status string // Current operational state
LastHeartbeat time.Time // When we last heard from this node
Metadata map[string]string // Hostname, PID, etc.
Version string // Protocol compatibility
}
Job Schema:
type Job struct {
JobID string // UUID for tracking
TargetNodeID string // Which node should execute
Command string // Shell command to run
Status string // Execution state
ExitCode int32 // Command result (0 = success)
Stdout string // Command output
Stderr string // Error output
CreatedAt time.Time // When job was submitted
StartedAt time.Time // When execution began
FinishedAt time.Time // When execution completed
}
Ent generates all the CRUD operations, migrations, and even provides a query builder. Much cleaner than hand-written SQL.
REST API: External Control Interface
The Echo REST API provides external access to the mesh:
GET /api/v1/health # Node health check
GET /api/v1/status # Current node status + peer count
GET /api/v1/peers # List all known peer nodes
GET /api/v1/jobs?limit=50 # Job execution history
POST /api/v1/jobs # Submit new job for execution
Example API responses:
Node Status:
{
"node_id": "node-abc123",
"address": "192.168.1.100:8080",
"status": "IDLE",
"peer_count": 2,
"metadata": {
"hostname": "worker01",
"pid": "1234"
}
}
Known Peers:
{
"peers": [
{
"node_id": "node-def456",
"address": "192.168.1.101:8080",
"status": "BUSY",
"last_heartbeat": "2025-07-21T10:30:05Z"
}
],
"count": 1
}
Job History:
{
"jobs": [
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"command": "echo 'Hello World'",
"status": "JOB_COMPLETED",
"exit_code": 0,
"stdout": "Hello World\n",
"created_at": "2025-07-21T10:30:10Z",
"finished_at": "2025-07-21T10:30:11Z"
}
]
}
Testing the Mesh in Action
The best way to see this working is to spin up multiple nodes and watch them discover each other.
Terminal 1 - Start first node:
make run-node1 # API on :3001, UDP discovery on :8080
Terminal 2 - Start second node:
make run-node2 # API on :3002, UDP discovery on :8080
Terminal 3 - Start third node:
make run-node3 # API on :3003, UDP discovery on :8080
Within seconds, you'll see logs like:
Updated peer: node-def456 at 192.168.1.101:8080 (status: IDLE)
Updated peer: node-ghi789 at 192.168.1.102:8080 (status: IDLE)
Terminal 4 - Test the system:
# Check peer discovery worked
curl http://localhost:3001/api/v1/peers
# Submit a job
curl -X POST http://localhost:3001/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{"command": "uptime"}'
# Check job results
curl http://localhost:3001/api/v1/jobs?limit=5
What's satisfying is watching the mesh adapt to changes. Kill a node and others remove it from their peer tables after 30 seconds. Restart it and it rejoins automatically.
Performance and Resilience
A few things I learned while building this:
UDP Reliability: Lost heartbeat packets are actually fine. Since nodes broadcast every 5 seconds, temporary packet loss doesn't matter. The mesh heals itself quickly.
Resource Usage: SQLite handles hundreds of jobs without issues. The UDP overhead is minimal - heartbeat messages are ~100 bytes compressed.
Graceful Degradation: If nodes can't communicate, they continue working independently. When network connectivity returns, they resync automatically.
Job Timeouts: Using Go's context cancellation, jobs that exceed their timeout get killed cleanly without orphaned processes.
Concurrent Safety: All peer table updates use mutexes. Job queues use channels for thread-safe communication between goroutines.
What's Missing (Future Work)
This is just phase 1. Here's what I'm considering next:
Cross-Node Job Distribution: Currently jobs run locally. Next phase will let you submit jobs to specific remote nodes.
Job Result Broadcasting: When nodes complete jobs, broadcast results to the entire mesh for better visibility.
Load Balancing: Distribute jobs automatically based on node capacity and current workload.
Security: Add TLS encryption for production use. Authentication between nodes.
Web UI: A simple dashboard showing mesh topology, job history, and real-time status.
gRPC Integration: While UDP works great for discovery, gRPC might be better for larger job payloads.
Key Takeaways
Protocol Buffers are Worth It: The performance and versioning benefits are significant. Generated Go code is clean and type-safe.
UDP for Discovery Works: Don't overcomplicate service discovery. UDP broadcast is simple and effective for local networks.
SQLite is Underrated: For single-node storage, SQLite handles concurrent reads/writes beautifully. No need for external databases.
Go's Concurrency Shines: Goroutines and channels make concurrent programming straightforward. The job queue pattern is elegant.
Zero-Config is Achievable: With some thought, you can eliminate most configuration requirements. Nodes really can "just work."
Wrapping Up
Building this worker mesh taught me that distributed systems don't have to be complex. Sometimes the simplest approach - nodes talking directly to each other - works better than elaborate orchestration frameworks.
The combination of Go's concurrency, Protocol Buffers' efficiency, and UDP's simplicity creates a surprisingly powerful foundation for distributed computing.
Whether you're distributing CI/CD jobs, processing data across edge nodes, or running distributed tests, the patterns here provide a solid starting point.
The complete code is available on GitHub if you want to experiment with your own mesh. I'd love to hear about any interesting use cases or improvements you come up with.
Next time you're tempted to reach for a heavy orchestration framework, consider: do your nodes really need a central coordinator, or could they just talk to each other?
⚠️ This is not a perfect approach. I’m still actively learning, experimenting, and refining this system. If you’re a distributed systems engineer, protocol geek, or just curious — I’d love to hear your feedback, questions, or ideas. Let’s build and learn together.
Subscribe to my newsletter
Read articles from Krish Srivastava directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
