Trust, but verify - how testing and reviewing code keeps the AI honest

Alexandru IonAlexandru Ion
6 min read

Just because the AI says it works, and the code looks clean, doesn’t mean you shouldn't double-check. You’re still the conductor of a software orchestra.

You’ve embraced structured Prompt-Driven Development, and you’re feeding your AI well-defined documentation, one layer at a time. The code it’s generating is cleaner, more scoped, and actually resembles what you asked for.

Clean code doesn’t automatically mean correct code. The AI can still miss edge cases your docs didn't explicitly cover, and it might skip over critical error handling because you weren't specific enough. Or it might subtly misinterpret the nuanced logic you thought was crystal clear in feature-logic.md.

And here’s what "vibe coders" seem to forget: You are still the engineer. The AI doesn't own the results, they are still yours. AI makes your job easier, it doesn't fully replace you as an engineer.

When I was building that Jira sync system, before I figured out PDD, everything looked right on the surface after a few "successful" prompts. But there were bugs "hidden" everywhere that almost went completely unnoticed. The unit tests asked the AI to generate, all passed, and the UI seemed to work great.

But the more i was digging deeper into why the tests pass and the app is failing, the more i realised that I should trust the AI's output, but only after I’ve put it through rigorous tests.

Human review is still the best tool yet

Even if you didn't type a single line of the generated code, you have to review it. Put your engineering hat on and be nit-picking the code it like it was written by a junior dev who’s brilliant but still learning the basics. Here’s your checklist:

  1. Did it actually follow the documentation? Pull up the relevant @feature-backend.md or @overview.md and compare. Line by line if necessary. Did it use the naming conventions? Did it implement the logic flows as specified?

  2. Does it handle errors like a grown-up? What happens if the input data is incorrect? What if the database connection flakes out mid-operation? If your docs specified error handling, is it there? If they didn't, why not, and what should it be? Did it cover all the cases you described? Have there been any cases missed?

  3. Are names, structures, and patterns consistent? Does this new code look like it belongs with the rest of your PDD-generated codebase, following the patterns in your overview.md, or does it stick out of the pattern?

  4. Is it readable? Seriously, even as a non-developer, if you have to read a function 6 times to understand its logic, chances are the AI overcomplicated it or missed a simpler path. Remember that it started from an idea, that was turned into documentation which you did understand, so the code should be easy to understand based on your existing knowledge. Complexity is a bug waiting to happen.

If you’re even remotely unsure about a piece of generated code, you should make the AI explain it for you:

Explain this generated function `featureFunction` from `@feature-service.ts` line by line. What assumptions is it making about the input `featureData` and the state of the `featureCache`? What are the potential failure points?

If the explanation doesn’t perfectly match your idea or the specifications in your documentation, you’ve found a problem, and you should address it. Either by refining the code yourself or by re-prompting with even more specific instructions.

Automated testing can be a double-edged sword

PDD isn’t just about generating code fast, but about building complex and reliable systems. This sometimes means baking testing into every single feature layer. You shouldn't postpone tests for later or treat them as a nice to have thing. You should build them with the feature, often prompting for them right after the feature logic is generated.

Here’s how I handle it, making the AI do the heavy lifting for test creation as well:

  1. Unit tests: Example prompt (after generating cardService.createCard):
Now, generate comprehensive Jest unit tests for the `cardService.createCard()` function as defined in `@cards-backend.md` and implemented in `services/cardService.ts`.
Cover these scenarios:
1. Successful card creation with valid data.
2. Attempted creation with missing required fields (e.g., no title).
3. Attempted creation where a dependency (e.g., database write) fails.
Mock any external dependencies.

Unit tests are the quickest way to confirm functionality after a feature is developed. They can however be misleading as the AI may resort to mocked data from start to end. You should pay close attention to the unit tests throughout the development process, as you can end up with unit tests against mocks that will forever pass, and a code repository that doesn't match that initial logic anymore.

  1. Integration tests: Example prompt (after backend endpoints are done):
Write integration tests for the `POST /api/cards` endpoint using `supertest`.
Use mock data based on the schema in `@cards-model.md`.
Test for:
1. Successful creation (201 Created response).
2. Invalid input (400 Bad Request response with error messages).
3. Authorization failures (401/403 Unauthorized/Forbidden if applicable, based on @cards-permissions.md).

Although the integration tests are against mocks as well, they are far less susceptible to become unreliable. If all your endpoints are well defined ever since the planning and documentation phase, you're setting them once and then forget about them.

  1. End-to-end tests: Example prompt (after a full user flow is implemented):
Generate a Playwright test script for the complete "Create New Card" user flow.
This flow is described in `@cards.md#user-flow-create-card`.
The test should:
1. Log in as a test user (credentials from @test-users.md).
2. Navigate to the card creation page.
3. Fill out the form with valid data.
4. Submit the form.
5. Verify the new card appears in the card list and displays the correct information.
6. Clean up by deleting the created card.

The most reliable way of testing your AI-assisted developed application from my experience so far is through Playwright tests. These end-to-end tests run against the application's UI which is what you and your end users will be using.

If anything is missed by your reviews that the AI has modified without your permission, then these tests will catch it. If the AI changed the logic of your tests, functionality, or UI, then the test results will reflect those changes immediately.

All these tests aren’t just bonus points, instead they are your safety net. They ensure that every scoped piece you build with AI not only works today but continues to work as your system grows and changes.

Testing and reviewing shouldn't be seen as bottlenecks

This review and testing phase isn't a formality you rush through, or postpone as a last step after everything has been developed. It’s where bugs are caught before they see the light of day, where subtle logical flaws are exposed, and where crucial assumptions are clarified and documented.

AI can give you incredible speed, but that speed is useless if you’re just racing towards an unreliable "complete" product. Human oversight, rigorous review, and comprehensive automated testing should not be skipped nor postponed. Testing at the right time will save you a lot of complete development restarts.

0
Subscribe to my newsletter

Read articles from Alexandru Ion directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alexandru Ion
Alexandru Ion