GSoC Week 3 & 4: Debugging Adventures and Breakthrough Moments


A quick note before we dive in: I know it's been a little while! My last update was this one, about three weeks ago. I had planned to post this sooner, but I've been a bit caught up with family gatherings. It's the festival season here where I live, so things have been wonderfully busy. Thanks for your patience!
If the first two weeks were about laying the foundation for Lima's driver plugin system, weeks 3 and 4 were all about getting into the nitty-gritty of network programming and solving some truly head-scratching problems. Let me take you through this journey where I went from feeling completely stuck to having those "eureka!" moments that make all the debugging worth it.
The Plugin Cleanup Victory
The week started on a positive note. I had successfully implemented reusing an existing plugin and cleanup mechanisms. Now when limactl
finishes running, it properly cleans up after itself - no more lingering processes or resources.
The Great gRPC Mystery
Then came the challenge that would dominate most of these two weeks: the GuestAgentConn() problem. Picture this - everything seems to be working fine, drivers are loading, the plugin system is functioning, but then I encounter this error:
WARN[0054] [hostagent] guest agent events closed unexpectedly
error="rpc error: code = Unavailable desc = error reading from server: stream receive error: rpc error: code = Canceled desc = context canceled"
The most frustrating part? The ga.sock
file would appear in the Lima instance directory, and then it would disappear.
➜ ls /Users/ansumansahoo/.lima/default
ga.sock ha.pid qemu.pid # ... other files
➜ ls /Users/ansumansahoo/.lima/default
ha.pid qemu.pid # ... other files (no ga.sock!)
Learning Network Programming the Hard Way
This is where I had to dive deep into network programming concepts. I spent considerable time learning from two fantastic resources that I'd recommend to anyone working with networks in Go:
Matt Layher's GopherCon talk on "Building a net.Conn Type From the Ground Up" - This was pure gold for understanding how Go's networking internals work, especially when dealing with custom connection types and the runtime network poller.
Beej's Guide to Network Programming - While it's written for C, the fundamental concepts of sockets, TCP/UDP, and network programming patterns are universal. It really helped me understand what was happening under the hood.
The Mentor's Wisdom
When I was completely stuck with the gRPC hijack approach over stdio, my mentor Akihiro Suda stepped in with some crucial insights. He suggested exploring alternatives:
"grpchijack-over-stdio could be a design mistake. Aside from the implementation complexity, it is unlikely to be efficient for port forwarding packets."
He pointed me towards a much simpler approach: just let the external drivers listen on ga.sock
directly
The Breakthrough: Network Proxying
This led me to implement a basic network proxy for vsock connections. The idea was elegant: instead of trying to hijack the gRPC connection, create a proxy that bridges the gap between different connection types.
I learned about Lima's existing bicopy
utility, which handles bidirectional copying between network connections. But wait - I was making a fundamental mistake! As Akihiro pointed out: "You have to listen, not dial." I was trying to dial the Unix socket when I should have been listening on it.
if connType != "unix" {
proxySocketPath := filepath.Join(s.driver.Info().InstanceDir, filenames.GuestAgentSock)
listener, err := net.Listen("unix", proxySocketPath)
if err != nil {
logrus.Errorf("Failed to create proxy socket: %v", err)
return nil, err
}
go func() {
defer listener.Close()
defer conn.Close()
proxyConn, err := listener.Accept()
if err != nil {
logrus.Errorf("Failed to accept proxy connection: %v", err)
return
}
bicopy.Bicopy(conn, proxyConn, nil)
}()
}
Transport Layer Evolution
During this journey, I also experimented with switching the gRPC transport layer from stdio pipes to Unix sockets. While this didn't solve the core problem, it taught me valuable lessons about different transport mechanisms and their trade-offs:
stdio pipes: Simple but can be fragile for complex bidirectional communication
Unix sockets: More robust, but require careful lifecycle management
Architecture Improvements
Beyond solving the guest agent connection issue, I also worked on improving the overall plugin architecture:
Driver Discovery Enhancement
Fixed driver discovery in ~/.local/libexec/lima/drivers/
to properly scan for external drivers. Now Lima can find and load drivers from the standard system locations.
Priority System
Implemented a priority system where internal drivers take precedence over external ones when both exist. This ensures backwards compatibility while allowing for external extensions:
Driver-Level Proxying
Moved the proxying connection logic directly into the driver level rather than trying to handle it generically. This makes the implementation cleaner and allows each driver to handle its specific connection requirements.
Platform Challenges
I also started testing on Windows, which brought its own set of challenges. WSL2 integration has some quirks, and I encountered YAML parsing issues that I'm still working through. Cross-platform development always adds complexity, but it's essential for Lima's broad adoption.
Lessons Learned
These two weeks taught me several valuable lessons:
Sometimes, step back and simplify: When the complex solution isn't working, often a simpler approach is better.
Network programming is an art: Understanding the fundamentals of sockets, connection lifecycle, and data flow is crucial.
Mentorship is invaluable: Having someone with experience guide you through architectural decisions saves enormous amounts of time.
Don't be afraid to completely change approach: I spent days trying to make gRPC hijack work before switching to the proxy approach.
Looking Ahead
As I wrapped up these two weeks, the core guest agent connection problem was finally solved, but there's still exciting work ahead:
Performance benchmarking: Ensuring the plugin system doesn't introduce performance regressions
Makefile integration: Making it easy to build drivers as internal or external
Windows support: Getting the full system working cross-platform
Image downloader refactoring: Moving image downloading logic for better modularity
Cherry-picking commits: Tidying up commits from a large PR to make smaller ones, preparing for midterm evaluations.
The plugin system is really taking shape now. What started as an abstract idea is becoming a concrete, working implementation that will enable Lima to support a much broader ecosystem of VM drivers.
The Joy of Problem Solving
Looking back on these weeks, what strikes me most is how the debugging process, while frustrating in the moment, led to a much deeper understanding of Lima's architecture and network programming in general. The moment when the proxy approach finally worked and I saw clean bidirectional communication between the host agent and guest agent through the external driver - that was pure magic. These are the moments that make all the late-night debugging sessions worth it.
Subscribe to my newsletter
Read articles from Anshuman Sahoo directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
