Some engineering problems look easy—but when coding agents like Sourcegraph's AMP get involved, they reveal surprising complexity. One such example? Gracefully shutting down an MCP server.

🧠 The Setup

I’ve been building a GitHub-integrated agent infrastructure, powered by an MCP (Message Control Protocol) server. The server handles tasks like:

Fetching build/test logs,
Checking the build status,
Responding to GitHub PR comments.

One of the main challenges was that I wanted to support multiple repositories with one server. I (and Amp) tried out various ideas how to do that. We ended up creating subprocess workers for each repository.

Each worker:

Runs in its own process,
Listens on its own port,
Is managed by a master process.

Shutdown Design

A major difficult was the shutdown process. I though it'd be relatively simple:

Master receives SIGTERM.
Master iterates throught workers and triggers their shutdown:
Each worker
- Notifies the MCP client to prepare for shutdown.
- Waits for the connection to cleanly close (can take up to 30 seconds!)
- Returns.
Once all workers successfully returned, exit.
If any worker doesn't return within the timeout, just kill the worker.

Sounds clean, right? In practice… not so much.

🐛 The Subtleties (a.k.a. Landmines)

Even with a defined design, the shutdown process unraveled due to subtle system-level behaviors:

Port detection mismatch:
The startup and shutdown logic used different methods to check if a port was free. This led to workers incorrectly assuming ports were available, while the master failed to bind—because the ports were still in use .
Lingering ports:
Even after a socket is closed, the OS may hold the port in TIME_WAIT for ~30 seconds. If you restart too soon, it fails silently or sporadically.
Worker subprocesses lack environment context:
The GitHub token (needed for API calls) wasn’t inherited by the subprocess, leading to puzzling authentication failures.
launchctl kills too fast:
On macOS, launchctl sends a SIGKILL 5 seconds after SIGTERM by default. This clashed with the graceful 30-second shutdown design—unless you explicitly configure a longer timeout.

🧪 What the Agent Did (and Didn't) Do

I gave AMP the task to make this shutdown process robust. The results were mixed.

What AMP did:

Found https://pypi.org/project/mcp/ - which helped with message formatting and such
Tried tweaking timeouts (blindly).
Piled on more control logic (without fully understanding the interactions).
Ignored spec-level shutdown features (e.g. the MCP shutdown message handshake - which is kind of ironic as Amp literally didn't understand itself here!)

What I had to point out:

Add logs to understand what's actually happening instead of guessing. Repeatedly.
Respect the MCP protocol, including graceful shutdown signals - I pointed Amp to this website that detailled the protocol.
Unify port availability checks, so shutdown and startup use consistent logic.
Understand system behavior, like launchctl sending a SIGKILL (and that it’s configurable via ExitTimeOut).

🎓 Lessons Learned

I’m still exploring which classes of problems coding agents can handle well. This one seemed like a great fit: bounded complexity, OS-level behaviors, well-defined shutdown sequences.

And yet—it was clearly too much for AMP.

Coding agents can assist, but only within a tightly scoped, observable problem. Once things involve concurrency, system-level timing, and nuanced behavior (like TIME_WAIT), you still need a human to recognize the patterns and know where to look.

As always, if you want to check out the current implementation: https://github.com/mstriebeck/github-agent

When Coding Agents Struggle: Lessons from Shutting Down the MCP Server