Battle of the AI Coding Assistants: A Real-World Flutter App Showdown


Choosing the right coding assistant can significantly impact your productivity in the rapidly evolving landscape of AI-powered development tools. I recently put three popular options through their paces: Google's Gemini CLI, Warp, and Anthropic's Claude Code. The challenge? Building a Flutter to-do application with specific requirements. Here's what I discovered.
Why Terminal Agents, Not IDE Assistants?
Before diving into the comparison, it's worth explaining why I focused exclusively on terminal-based agents rather than IDE assistants. In my experience, IDE assistants create a fundamental workflow problem: they operate in the same space where developers need to work, often preventing productive coding while the AI generates its suggestions. You're essentially forced to stop and wait, breaking your flow.
Terminal agents, on the other hand, work independently in their own space while you continue developing. This separation offers something even more valuable—they can provide the benefits of mob programming, where the AI acts as a collaborative partner working alongside you rather than interrupting your process. You can review, integrate, and iterate on their output at your own pace, maintaining productivity throughout.
The Setup
To ensure a fair comparison, I gave each tool the same initial prompt:
"Create a Flutter todo application in a folder called todo_app/apps/flutter_todo. It should use Riverpod for state management"
This straightforward request would test each assistant's ability to understand requirements, generate functional code, and handle a common development scenario.
Initial Results: Speed vs. Features
The Authentication Roadblock
Right out of the gate, Gemini CLI hit a snag. Authentication proved to be completely broken—I couldn't get it working with either my personal or work accounts. This immediate failure meant Gemini was out of the race before it even began, leaving just Warp and Claude Code to compete. Because Gemini CLI was just announced, I give Google grace for this issue, but authentication is a general pain point I have with their AI products.
Claude Code: The Speed Demon
Claude Code impressed with its raw performance, completing the initial task in half the time that Warp took - 10 minutes vs. 20 minutes, respectively. This is possibly due to Claude Code using Opal for its model compared to Warp using Sonnet. This speed advantage came despite Claude occasionally pausing to request permissions—a security feature that, while slightly interrupting the flow, didn't significantly impact overall performance.
Unlike Warp, Claude did generate tests for the application, though they exhibited brittleness in their implementation. The tests relied on finding widgets by text content rather than using more reliable methods like widget keys or types—a pattern that could lead to fragile tests that break with minor UI text changes.
Warp: The Feature-Rich Contender
While Warp took longer to complete the task, it compensated with a more comprehensive implementation. The Warp-generated application included:
In-depth documentation with example usage
Offline storage capabilities
Sorting functionality
List summaries
Enhanced todo metadata
A more polished, aesthetically pleasing UI
However, Warp's ambitious approach came with critical flaws: the generated code contained a runtime exception when attempting to add todos, caused by a failure to properly initialize the storage provider. Additionally, despite the comprehensive feature set, Warp failed to produce any meaningful tests—a glaring omission for what otherwise appeared to be production-oriented code.
Common Limitations
Neither assistant produced responsive or adaptive layouts beyond Flutter's default capabilities—a missed opportunity for creating a truly production-ready application.
More concerning was that both Claude and Warp chose to implement state management using the legacy StateNotifier
pattern rather than the newer Notifier
and AsyncNotifier
providers. These modern providers have been the recommended approach in Riverpod for over two years, offering better type safety and a more intuitive API. This suggests both AI assistants may be working with outdated training data or examples, potentially leading developers toward deprecated patterns.
Round Two: Testing Adaptability
Further testing each assistant's capabilities, I challenged them with modification requests. This would reveal how well they could understand and refactor existing code.
The Prompts
For Warp, I kept it simple:
"Rewrite this example so that it does not use code generation for the Riverpod providers, only for serialization"
For Claude, I added significant complexity:
"Rewrite this example so that it does not use code generation for the Riverpod providers, only for serialization. Also add filtering by created date, due date, priority, and title. Add any fields missing from the current implementation. Also add the ability to clear completed tasks. Finally, add offline persistence using shared_preferences."
The Results
Claude Code again demonstrated superior speed, completing the more complex task faster than Warp handled the simpler request. However, Claude failed to infer that it should update these tests after the refactoring.
Warp took a more holistic approach to the refactoring, updating documentation to reflect the changes. However, still notably absent from Warp's output were meaningful tests.
Key Takeaways
Speed vs. Completeness
This comparison revealed a classic trade-off in AI assistants: speed versus completeness. Claude Code excels at rapid code generation and can handle complex requirements efficiently. However, more explicit instructions may be required to handle the full scope of development tasks.
The Importance of Working Code
Warp's runtime exception highlights a critical issue: feature-rich code is worthless if it doesn't work. While its comprehensive approach is admirable, the failure to generate immediately functional code is a significant drawback.
Outdated Patterns Signal a Deeper Issue
Both assistants' use of legacy StateNotifier
instead of modern Riverpod patterns, reveals a concerning limitation: AI coding assistants may generate code based on outdated best practices. This emphasizes the importance of staying current with framework developments and not blindly accepting AI-generated patterns.
The Human Touch Still Matters
Both assistants required human oversight—whether to catch runtime errors, ensure responsive design, or handle peripheral tasks like test updates. Think of AI coding assistants like enthusiastic junior developers: they can produce a lot of code quickly and often come up with creative solutions, but they need careful code reviews, guidance on best practices, and someone to ensure they're considering the full context of the project. Just as you wouldn't deploy a junior developer's code without review, AI-generated code requires the same scrutiny and mentorship. This reinforces that AI coding assistants are best viewed as productivity enhancers rather than complete replacements for developer expertise.
Recommendations
Choose Claude Code if:
Speed is your priority
You're comfortable providing explicit, detailed instructions
You want basic test coverage (but are prepared to refactor for robustness)
You prefer to handle documentation separately
Choose Warp if:
You value comprehensive feature implementations
Documentation is important to your workflow
You're willing to debug generated code and write your tests
Avoid Gemini CLI until:
Authentication issues are resolved
There's evidence of stability improvements
When to Re-evaluate Your Choice
The AI coding assistant landscape is evolving at a breakneck pace, reminiscent of the JavaScript ecosystem's notorious framework churn. Just as you wouldn't switch from React to the latest framework every week, resist the temptation to constantly chase the newest AI assistant.
Instead, adopt a pragmatic approach: find a tool that works for your workflow, commit to it for a few months, and genuinely learn its strengths and limitations. This sustained usage will give you insights that quick comparisons can't provide—you'll discover workarounds for its weaknesses and develop patterns that maximize its strengths.
Plan to re-evaluate your choice only when:
Major version updates are released
Your development needs have significantly changed
You hit consistent friction points that impact productivity
Industry consensus shifts dramatically (as it did when Claude and Gemini caught up to early leaders)
This measured approach ensures you're neither missing out on significant improvements nor wasting time on constant tool-switching that disrupts productivity.
Conclusion
The current generation of AI coding assistants shows tremendous promise but clear limitations. Claude Code's speed advantage makes it ideal for rapid prototyping and iterative development, while Warp's thoroughness suits projects where comprehensive documentation and testing are priorities, provided you're prepared to debug the output.
What's particularly impressive is the rapid evolution we've witnessed in this space. Just a year ago, Claude and Gemini were trailing behind established players like GitHub Copilot and ChatGPT. Their ability to not only catch up but also take leadership positions in certain aspects is a testament to the engineering excellence at Anthropic and Google. This rapid improvement suggests we can expect even more dramatic enhancements in the coming months.
As these tools continue to evolve, we can expect improvements in both speed and accuracy. For now, the key to maximizing their value lies in understanding their strengths and weaknesses and choosing the right tool for your specific needs. Most importantly, maintaining realistic expectations and treating these assistants as powerful aids rather than autonomous developers will lead to the best outcomes.
The future of AI-assisted development is bright, but we're not quite at the point where we can hand over the keys entirely. Choose your assistant wisely, and always verify the code they generate.
Subscribe to my newsletter
Read articles from Ryan Edge directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ryan Edge
Ryan Edge
I am a developer who is passionate about building applications and an enthusiast of all things JavaScript, Dart, and Flutter