When Coding Agents Struggle: Lessons from Shutting Down the MCP Server


Some engineering problems look easy—but when coding agents like Sourcegraph's AMP get involved, they reveal surprising complexity. One such example? Gracefully shutting down an MCP server.
🧠 The Setup
I’ve been building a GitHub-integrated agent infrastructure, powered by an MCP (Message Control Protocol) server. The server handles tasks like:
Fetching build/test logs,
Checking the build status,
Responding to GitHub PR comments.
One of the main challenges was that I wanted to support multiple repositories with one server. I (and Amp) tried out various ideas how to do that. We ended up creating subprocess workers for each repository.
Each worker:
Runs in its own process,
Listens on its own port,
Is managed by a master process.
Shutdown Design
A major difficult was the shutdown process. I though it'd be relatively simple:
Master receives
SIGTERM
.Master iterates throught workers and triggers their shutdown:
Each worker
Notifies the MCP client to prepare for shutdown.
Waits for the connection to cleanly close (can take up to 30 seconds!)
Returns.
Once all workers successfully returned, exit.
If any worker doesn't return within the timeout, just kill the worker.
Sounds clean, right? In practice… not so much.
🐛 The Subtleties (a.k.a. Landmines)
Even with a defined design, the shutdown process unraveled due to subtle system-level behaviors:
Port detection mismatch:
The startup and shutdown logic used different methods to check if a port was free. This led to workers incorrectly assuming ports were available, while the master failed to bind—because the ports were still in use .Lingering ports:
Even after a socket is closed, the OS may hold the port inTIME_WAIT
for ~30 seconds. If you restart too soon, it fails silently or sporadically.Worker subprocesses lack environment context:
The GitHub token (needed for API calls) wasn’t inherited by the subprocess, leading to puzzling authentication failures.launchctl
kills too fast:
On macOS,launchctl
sends aSIGKILL
5 seconds afterSIGTERM
by default. This clashed with the graceful 30-second shutdown design—unless you explicitly configure a longer timeout.
🧪 What the Agent Did (and Didn't) Do
I gave AMP the task to make this shutdown process robust. The results were mixed.
What AMP did:
Found https://pypi.org/project/mcp/ - which helped with message formatting and such
Tried tweaking timeouts (blindly).
Piled on more control logic (without fully understanding the interactions).
Ignored spec-level shutdown features (e.g. the MCP shutdown message handshake - which is kind of ironic as Amp literally didn't understand itself here!)
What I had to point out:
Add logs to understand what's actually happening instead of guessing. Repeatedly.
Respect the MCP protocol, including graceful shutdown signals - I pointed Amp to this website that detailled the protocol.
Unify port availability checks, so shutdown and startup use consistent logic.
Understand system behavior, like
launchctl
sending aSIGKILL
(and that it’s configurable viaExitTimeOut
).
🎓 Lessons Learned
I’m still exploring which classes of problems coding agents can handle well. This one seemed like a great fit: bounded complexity, OS-level behaviors, well-defined shutdown sequences.
And yet—it was clearly too much for AMP.
Coding agents can assist, but only within a tightly scoped, observable problem. Once things involve concurrency, system-level timing, and nuanced behavior (like TIME_WAIT
), you still need a human to recognize the patterns and know where to look.
As always, if you want to check out the current implementation: https://github.com/mstriebeck/github-agent
Subscribe to my newsletter
Read articles from Mark Striebeck directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
