I've been fascinated by distributed systems lately, especially after dealing with the pain of setting up worker clusters that require extensive configuration files, service registries, and manual IP management. What if we could build something simpler? What if worker nodes could just... find each other?

That's exactly what I set out to build: a zero-configuration worker mesh where nodes automatically discover peers, distribute jobs, and sync state using nothing but UDP broadcasts and Protocol Buffers.

The Problem with Traditional Distributed Systems

Most distributed task systems I've worked with follow this pattern:

Central scheduler (like Kubernetes master)
Service registry (like etcd)
Static configuration files with hardcoded IPs
Complex networking setup

This works, but it's heavy. For smaller deployments or edge computing scenarios, you often want something that "just works" without the operational overhead.

What is a Worker Mesh?

Think of it like this: instead of having a boss (central scheduler) telling workers what to do, imagine if the workers could talk directly to each other. Each worker node can:

Broadcast "hey, I'm here and available" to the local network
Listen for work requests from any other node
Execute tasks and report results back
Keep track of who's online and who's busy

No master, no configuration files, no service discovery headaches.

Why Protocol Buffers?

When I started this project, my first instinct was to use JSON for node communication. But then I remembered the performance issues I'd faced before with chatty microservices.

JSON problems:

Verbose (lots of overhead for simple messages)
No schema enforcement
Parsing is expensive
No versioning story

Protocol Buffers solve all of this:

Binary format is compact and fast
Schema-driven with strong typing
Built-in versioning and backward compatibility
Cross-language support (Go, Python, Java, etc.)

Here's a simple example. A JSON heartbeat might look like:

{
  "node_id": "worker-abc123",
  "address": "192.168.1.100:8080",
  "status": "IDLE",
  "last_heartbeat": "2025-07-21T10:30:00Z",
  "metadata": {"hostname": "worker01", "pid": "1234"}
}

The equivalent Protobuf message is ~60% smaller and deserializes much faster. For a mesh broadcasting heartbeats every 5 seconds, this adds up.

System Architecture: The Big Picture

The beauty is in the simplicity. Each node is identical and self-contained:

Discovery: UDP broadcasts for peer finding
Job Engine: Execute shell commands with timeouts
Database: SQLite for local state persistence
API: REST interface for external control
Mesh Protocol: Protobuf messages for efficiency

The Protobuf Schema: Defining Our Language

The heart of the mesh communication is three message types:

// Every 5 seconds, nodes broadcast this
message NodeInfo {
  string node_id = 1;           // "node-abc123"
  string address = 2;           // "192.168.1.100:8080"
  NodeStatus status = 3;        // IDLE, BUSY, DRAINING, FAILED
  google.protobuf.Timestamp last_heartbeat = 4;
  map<string, string> metadata = 5;
  string version = 6;           // For compatibility checks
}

// Work to be distributed
message Job {
  string job_id = 1;            // UUID
  string target_node_id = 2;    // Which node should run this
  string command = 3;           // Shell command to execute
  map<string, string> env = 4;  // Environment variables
  int32 timeout_seconds = 5;    // Kill job after this
  JobStatus status = 6;         // PENDING, RUNNING, COMPLETED
  string created_by_node = 7;   // Source node
}

// Results after job execution
message JobResult {
  string job_id = 1;
  string executor_node_id = 2;
  int32 exit_code = 3;          // 0 = success
  string stdout = 4;            // Command output
  string stderr = 5;            // Error output
  google.protobuf.Timestamp started_at = 6;
  google.protobuf.Timestamp finished_at = 7;
}

These messages get wrapped in a MeshMessage envelope that identifies the message type and sender.

Zero-Config Discovery: How Nodes Find Each Other

This is where the magic happens. Instead of requiring configuration files or service registries, nodes use UDP broadcast discovery.

Here's the discovery algorithm:

1. Node Startup:
   - Generate unique node ID (crypto/rand)
   - Bind UDP socket to :8080
   - Start heartbeat timer (5 second interval)
   - Start cleanup timer (remove stale peers)

2. Heartbeat Broadcast:
   - Create NodeInfo with current status
   - Serialize to Protobuf bytes
   - UDP broadcast to 255.255.255.255:8080
   - All nodes on LAN receive message

3. Message Reception:
   - Receive UDP packet
   - Deserialize Protobuf
   - Validate sender != self
   - Update peer table in memory + database
   - Log peer discovery

4. Peer Cleanup:
   - Every 10 seconds, check peer timestamps
   - Remove peers silent for >30 seconds
   - Handle reconnections gracefully

Why UDP? It's perfect for this use case:

Connectionless: No TCP overhead or connection state
Broadcast-friendly: Single message reaches all local nodes
Simple: No complex error handling needed
Fast: Immediate delivery without handshakes

The trade-off is no delivery guarantee, but that's fine for heartbeats since the next one arrives in 5 seconds.

Job Lifecycle: From API Call to Shell Execution

Let's trace a job through the entire system:

Phase 1: Job Submission

A client submits a job via REST API:

curl -X POST http://node1:3000/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "command": "python3 process_data.py --input /tmp/data.csv",
    "env": {"PYTHONPATH": "/app/libs", "DEBUG": "1"},
    "timeout_seconds": 300
  }'

The handler converts this to a Protobuf Job message and generates a UUID.

Phase 2: Local Queuing

The job gets stored in SQLite and added to an in-memory queue (Go channel). This provides durability and allows the API to return immediately.

Phase 3: Execution Engine

A worker goroutine processes jobs from the queue:

Below is a simplified version for a better understanding,

// Simplified execution logic
func (n *Node) executeJob(ctx context.Context, job *pb.Job) {
    // Set node status to BUSY
    n.setStatus(pb.NodeStatus_BUSY)
    defer n.setStatus(pb.NodeStatus_IDLE)

    // Create timeout context
    jobCtx, cancel := context.WithTimeout(ctx, 
        time.Duration(job.TimeoutSeconds)*time.Second)
    defer cancel()

    // Parse command and execute
    parts := strings.Fields(job.Command)
    cmd := exec.CommandContext(jobCtx, parts[0], parts[1:]...)

    // Set environment variables
    if job.Env != nil {
        cmd.Env = append(os.Environ(), envMapToSlice(job.Env)...)
    }

    // Capture output and timing
    startTime := time.Now()
    stdout, err := cmd.Output()
    finishTime := time.Now()

    // Create result
    result := &pb.JobResult{
        JobId:          job.JobId,
        ExecutorNodeId: n.ID,
        ExitCode:       getExitCode(err),
        Stdout:         string(stdout),
        StartedAt:      timestamppb.New(startTime),
        FinishedAt:     timestamppb.New(finishTime),
    }

    // Store result in database
    n.storeJobResult(job, result)
}

Phase 4: Result Storage

Job results get stored in SQLite with full execution details - exit codes, stdout/stderr, timing information, and any error messages.

State Management with Ent ORM

Rather than writing raw SQL, I used Ent for type-safe database operations. The schemas are straightforward:

Worker Node Schema:

type WorkerNode struct {
    NodeID        string    // Unique identifier
    Address       string    // IP:port for communication  
    Status        string    // Current operational state
    LastHeartbeat time.Time // When we last heard from this node
    Metadata      map[string]string // Hostname, PID, etc.
    Version       string    // Protocol compatibility
}

Job Schema:

type Job struct {
    JobID         string    // UUID for tracking
    TargetNodeID  string    // Which node should execute
    Command       string    // Shell command to run
    Status        string    // Execution state
    ExitCode      int32     // Command result (0 = success)
    Stdout        string    // Command output
    Stderr        string    // Error output  
    CreatedAt     time.Time // When job was submitted
    StartedAt     time.Time // When execution began
    FinishedAt    time.Time // When execution completed
}

Ent generates all the CRUD operations, migrations, and even provides a query builder. Much cleaner than hand-written SQL.

REST API: External Control Interface

The Echo REST API provides external access to the mesh:

GET  /api/v1/health           # Node health check
GET  /api/v1/status           # Current node status + peer count
GET  /api/v1/peers            # List all known peer nodes
GET  /api/v1/jobs?limit=50    # Job execution history
POST /api/v1/jobs             # Submit new job for execution

Example API responses:

Node Status:

{
  "node_id": "node-abc123",
  "address": "192.168.1.100:8080",
  "status": "IDLE", 
  "peer_count": 2,
  "metadata": {
    "hostname": "worker01",
    "pid": "1234"
  }
}

Known Peers:

{
  "peers": [
    {
      "node_id": "node-def456",
      "address": "192.168.1.101:8080", 
      "status": "BUSY",
      "last_heartbeat": "2025-07-21T10:30:05Z"
    }
  ],
  "count": 1
}

Job History:

{
  "jobs": [
    {
      "job_id": "550e8400-e29b-41d4-a716-446655440000",
      "command": "echo 'Hello World'",
      "status": "JOB_COMPLETED",
      "exit_code": 0,
      "stdout": "Hello World\n", 
      "created_at": "2025-07-21T10:30:10Z",
      "finished_at": "2025-07-21T10:30:11Z"
    }
  ]
}

Testing the Mesh in Action

The best way to see this working is to spin up multiple nodes and watch them discover each other.

Terminal 1 - Start first node:

make run-node1  # API on :3001, UDP discovery on :8080

Terminal 2 - Start second node:

make run-node2  # API on :3002, UDP discovery on :8080

Terminal 3 - Start third node:

make run-node3  # API on :3003, UDP discovery on :8080

Within seconds, you'll see logs like:

Updated peer: node-def456 at 192.168.1.101:8080 (status: IDLE)
Updated peer: node-ghi789 at 192.168.1.102:8080 (status: IDLE)

Terminal 4 - Test the system:

# Check peer discovery worked
curl http://localhost:3001/api/v1/peers

# Submit a job
curl -X POST http://localhost:3001/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{"command": "uptime"}'

# Check job results
curl http://localhost:3001/api/v1/jobs?limit=5

What's satisfying is watching the mesh adapt to changes. Kill a node and others remove it from their peer tables after 30 seconds. Restart it and it rejoins automatically.

Performance and Resilience

A few things I learned while building this:

UDP Reliability: Lost heartbeat packets are actually fine. Since nodes broadcast every 5 seconds, temporary packet loss doesn't matter. The mesh heals itself quickly.

Resource Usage: SQLite handles hundreds of jobs without issues. The UDP overhead is minimal - heartbeat messages are ~100 bytes compressed.

Graceful Degradation: If nodes can't communicate, they continue working independently. When network connectivity returns, they resync automatically.

Job Timeouts: Using Go's context cancellation, jobs that exceed their timeout get killed cleanly without orphaned processes.

Concurrent Safety: All peer table updates use mutexes. Job queues use channels for thread-safe communication between goroutines.

What's Missing (Future Work)

This is just phase 1. Here's what I'm considering next:

Cross-Node Job Distribution: Currently jobs run locally. Next phase will let you submit jobs to specific remote nodes.

Job Result Broadcasting: When nodes complete jobs, broadcast results to the entire mesh for better visibility.

Load Balancing: Distribute jobs automatically based on node capacity and current workload.

Security: Add TLS encryption for production use. Authentication between nodes.

Web UI: A simple dashboard showing mesh topology, job history, and real-time status.

gRPC Integration: While UDP works great for discovery, gRPC might be better for larger job payloads.

Key Takeaways

Protocol Buffers are Worth It: The performance and versioning benefits are significant. Generated Go code is clean and type-safe.

UDP for Discovery Works: Don't overcomplicate service discovery. UDP broadcast is simple and effective for local networks.

SQLite is Underrated: For single-node storage, SQLite handles concurrent reads/writes beautifully. No need for external databases.

Go's Concurrency Shines: Goroutines and channels make concurrent programming straightforward. The job queue pattern is elegant.

Zero-Config is Achievable: With some thought, you can eliminate most configuration requirements. Nodes really can "just work."

Wrapping Up

Building this worker mesh taught me that distributed systems don't have to be complex. Sometimes the simplest approach - nodes talking directly to each other - works better than elaborate orchestration frameworks.

The combination of Go's concurrency, Protocol Buffers' efficiency, and UDP's simplicity creates a surprisingly powerful foundation for distributed computing.

Whether you're distributing CI/CD jobs, processing data across edge nodes, or running distributed tests, the patterns here provide a solid starting point.

The complete code is available on GitHub if you want to experiment with your own mesh. I'd love to hear about any interesting use cases or improvements you come up with.

Next time you're tempted to reach for a heavy orchestration framework, consider: do your nodes really need a central coordinator, or could they just talk to each other?

⚠️ This is not a perfect approach. I’m still actively learning, experimenting, and refining this system. If you’re a distributed systems engineer, protocol geek, or just curious — I’d love to hear your feedback, questions, or ideas. Let’s build and learn together.

Zero-Config Remote Worker Mesh