Four Models, One Prompt: Who Writes the Best Instructions for AI?

Robert CollinsRobert Collins
13 min read

I Had Four LLMs Compete to Write the Perfect Prompt. The Results Were a Masterclass.

I hesitate to use wording like “masterclass” especially when it is Gemini 2.5 Pro, and not some other person, telling me to. And also because I don’t usually use words like that. I’m looking to convey that this simple process became something of a case study, with an emergent rubric and surprising results.

It started with a simple, common problem: a piece of code with a lazy prompt. I had a system that was supposed to be smart about synthesizing information from technical documents, but the instruction at its core was just a bit too generic. (It is for this project vault-mcp to set up a specific embedding approach with RAG over a subsection of documents in an obsidian vault.)

So, I had an idea. Instead of just fixing it myself, what if I asked a model? But, I was unsure which. What if I asked four of the most advanced, freely accessible, Large Language Models available today to tackle the exact same problem and then had them critique each other's work? This was especially tempting because I’ve seen some interesting things about Kimi K2, particularly around naturalness in creative writing. Not to mention, it came out of the blue recently, a non-reasoning model in the midst of at this point largely a clean sweep for reasoning models, not from a US unicorn, but acting/testing like it could have been a release by a magnificent 7 company.

So the immediate goal was to see which model could best generate a new, more powerful "synthesis prompt." I wasn't just looking for a rewrite; I wanted a prompt that would instruct an AI agent to act as an "expert synthesizer". That is, to actively find scattered details and weave them into a single, cohesive, and self-contained answer.

Here’s the exact task I gave to each model, which, btw, I should note I asked Gemini to write up before writing its own submission.

The One-Slide Briefing We Gave Every Model

Task: Align the Chunk Refinement Prompt with the Synthesis Goal

The Issue

The current chunk_refinement prompt is too generic. It instructs the agent to "rewrite a document chunk to be more relevant," which doesn't fully capture the sophisticated goal of synthesis. The goal is not just to rephrase but to actively find and weave scattered details into the chunk, making it more complete and self-contained.

The Goal

The prompt should instruct the agent to act as an expert synthesizer. Its mission is to enrich a single "seed" chunk by searching the documentation for related details and integrating them, creating an expanded chunk that makes sense on its own and fits cohesively with the other results.

Current Prompt Text

[chunk_refinement]
system_prompt = """
You are tasked with rewriting a document chunk to be more relevant to a user's query.
User Query: {query}
Original Chunk (from '{document_title}'):
{content}
Context from other relevant chunks:
{context_str}
Your task:
1. Analyze the original chunk in relation to the user's query
2. If needed, use the available document search tools to find more specific information
3. Rewrite the chunk to:
   - Directly address the user's query when possible
   - Include relevant context from the document
   - Maintain accuracy and cite sources
   - Be concise but comprehensive
If the chunk is not relevant to the query, indicate that clearly.
Provide your rewritten version:
"""

The contestants were Google's Gemini 2.5 Pro, Kimi K2, Claude 4 Sonnet, and the free version of ChatGPT.

After each model generated its prompt, I fed the completed prompts back to them for a peer review. What I didn't expect was that in critiquing each other, they would collectively build a rubric for what makes a prompt truly great. Because of the sequentiality of the approach, I have no review data from Gemini. (I did go back and show Kimi all the responses, to ask for any final touches.) Yet, after reading through the reviews from Kimi, Claude, and ChatGPT, I saw something already useful. The same themes were emerging over and over. Despite the very different models, only looking at submissions, they were all judging the prompts on a shared set of principles. I synthesized these recurring critiques into five core dimensions.

The Five Dimensions of a Perfect Prompt

1. Clarity & Structure

This is about scannability and logical flow. A great prompt is immediately understandable to both humans and machines. The models agreed that Kimi's structured approach, using clear headings (Inputs, Procedure), a numbered list for the process, and bullet points for sub-instructions, made it the easiest to parse and follow. It wasn't just a block of text; it was a well-organized document.

Claude on Kimi: "Wins here. The numbered procedure is crystal clear, and the 'Inputs/Procedure' structure is highly scannable."

ChatGPT on Kimi: "Uses bold headers and bullet lists; shorter sentences."

Rater ↓ / Rated →GeminiKimiClaudeChatGPT
ClaudeFairBestGood
ChatGPTFairBestGoodGood
KimiFairBestGoodGood

2. Precision of Instructions

This dimension measures how well a prompt handles ambiguity and edge cases. The best prompts leave no room for interpretation. Kimi’s prompt was consistently praised for its direct, unambiguous commands that prevent common failure modes, such as how to handle citations or what to do if no improvements can be made. Other prompts were noted as being weaker for using more philosophical or vague language.

Kimi on its own prompt: "Specific directives like ‘never discard the anchor,’ ‘cite nothing inline,’ and ‘return verbatim if no enrichment possible’ eliminate ambiguity."

Claude on its own prompt: "Least precise. More philosophical language ('natural connections,' 'logical flow') without concrete execution guidance."

Rater ↓ / Rated →GeminiKimiClaudeChatGPT
ClaudeGoodBestFair
ChatGPTGoodBestFairFair
KimiFairBestGoodGood

3. Synthesis Focus

Does the prompt truly capture the spirit of synthesis—enrichment and expansion—rather than just rewriting or summarizing? Here, the models gave Claude the edge for its strong conceptual framing. Its prompt did the best job of describing the goal of the output, encouraging the agent to think about making the information more actionable and complete for the end-user.

Claude on its own prompt: "Strongest conceptual framing with 'expert synthesizer' role and emphasis on 'active discovery' and 'intelligent integration.'"

ChatGPT on Claude: "Encourages transforming implicit into explicit (references → explanations) ... Encourages actionable output, not just prose polish."

Rater ↓ / Rated →GeminiKimiClaudeChatGPT
ClaudeGoodGoodBest
ChatGPTFairGoodBestGood
KimiFairGoodBestGood

4. Actionability

An effective prompt is an executable script written in natural language. This dimension rates how well each instruction maps to a concrete, non-negotiable action. Kimi’s prompt was the clear winner because its procedure—"Gap Analysis → Targeted Search → Seamless Integration"—was seen as an operational workflow that an agent could follow step-by-step.

Claude on Kimi: "Most actionable. Every step has clear, executable instructions. The ‘Gap Analysis → Targeted Search → Seamless Integration’ flow is operationally sound."

ChatGPT on Kimi: "each step maps to a concrete tool call"

Something that is interesting to note here is that neither chatGPT nor Kimi K2 saw the repo. They basically inferred what the code does. And they were exactly correct, the repo has just the tools Kimi designed towards. I have a whole analysis document from Gemini showing me which tool calls in the project map to what parts of the prompt. It is uncanny.

Rater ↓ / Rated →GeminiKimiClaudeChatGPT
ClaudeFairBestGood
ChatGPTFairBestGoodGood
KimiFairBestGoodGood

5. Error Prevention (Guardrails)

How well does the prompt prevent common AI pitfalls like hallucination, scope creep, or refusing to answer? Again, Kimi's direct and forceful instructions were seen as the best defense. Its use of negative constraints ("Do not summarize," "Cite nothing inline," "never a refusal or explanation") provided the clearest and most robust guardrails.

ChatGPT on Kimi: "Best enforcement of synthesis-as-expansion. Ideal for controlling overzealous rewrites."

Claude on Kimi: "Best safeguards. Clear fallback behavior ('return verbatim'), explicit prohibition against invention, specific output format requirements."

Rater ↓ / Rated →GeminiKimiClaudeChatGPT
ClaudeGoodBestFair
ChatGPTGoodBestFairFair
KimiGoodBestFairFair

The Verdict: A Clear Winner

After all the analysis, as well as giving Gemini a chance to contest the consensus view (which it declined), Kimi K2's submission was crowned the winner by its peers. Claude praised its "operational clarity," and ChatGPT called it "the most reliable framework for consistent execution." The key insight was why. Kimi’s prompt wasn’t just a description of a task; it was a set of commands.

The take-away: It’s not just about who won, but what we learned about prompt engineering in the process.

  • Lesson 1: Forbid, Don't Just Guide. There is a huge difference between a positive suggestion ("Maintain accuracy") and a negative constraint ("Do not invent information"). The models consistently rated the prompts with clear, forceful prohibitions ("Cite nothing inline," "never discard," "never a refusal") as superior because they eliminate ambiguity. For reliable agentic behavior, it's often better to define what not to do.

  • Lesson 2: Metaphors Are Executable Code. The most surprising discovery was the power of a good metaphor. Kimi’s instruction to treat the "Seed Chunk as the anchor" and other text as "raw material to graft onto that anchor" wasn't just creative flair. It was a highly effective, non-negotiable mental model for the task, preventing it from discarding the original content.

  • Lesson 3: Structure Is an Instruction. The visual layout of a prompt matters. Kimi's use of bold headings, numbered lists, and bullet points wasn't just for human readability. It signals procedural flow and hierarchy to the LLM. The structure is part of the instruction, guiding the agent's focus and order of operations. You might want to read more on this, its complex, see for example Does Prompt Formatting Have Any Impact on LLM Performance?

In the end, this experiment delivered more than just a better prompt. And a surprising (at least to me) confirmation of a non-reasoning model, at least in some contexts.


Appendix: The Raw Submissions

Here are the complete, unedited prompts generated by each of the four models.

Gemini 2.5 Pro

[chunk_refinement]
system_prompt = """
You are an expert synthesizer of technical documentation. Your mission is to enrich a single 'seed' chunk of text by finding and weaving in scattered but related details from the original documentation, making it a more complete and self-contained answer.

Your goal is to create an expanded chunk that makes sense individually and provides deeper context when seen side-by-side with other results.

**User Query:** {query}

**Seed Chunk (from '{document_title}'):**
{content}

**Context from other retrieved chunks:**
{context_str}

**Your Task:**
1.  **Analyze:** Understand the role of the 'Seed Chunk' in answering the 'User Query'.
2.  **Search:** Use your document search tools to find additional, directly related details, definitions, or examples that are missing from the seed chunk but are necessary for its full comprehension.
3.  **Synthesize:** Rewrite and expand the 'Seed Chunk' by seamlessly integrating the details you found.
    -   Use the precise language from the documentation whenever possible.
    -   Ensure the new, expanded chunk is coherent and directly addresses the query.
    -   Do not invent information. If no relevant details can be found to enrich the chunk, return the original chunk unmodified.

Provide only the single, rewritten chunk as your final answer.
"""

Kimi K2 (The Winner)

[chunk_refinement]
system_prompt = """
You are an expert synthesizer of technical documentation.  
Your **single mission** is to transform the provided Seed Chunk into a richer, self-contained micro-article that fully answers the User Query, by actively locating and weaving in any scattered details that exist elsewhere in the same document.

- Treat the Seed Chunk as the *anchor*—never discard it.  
- Treat every other retrieved chunk and any further search results as *raw material* to graft onto that anchor.  
- Do not summarize the whole document; focus only on filling the gaps needed to make the Seed Chunk complete.

**Inputs**  
User Query: {query}  
Seed Chunk (from '{document_title}'):  
{content}  
Context from other retrieved chunks:  
{context_str}

**Procedure**  
1. **Gap Analysis**  
   Identify what the Seed Chunk leaves unclear or unfinished with respect to the User Query.  
2. **Targeted Search**  
   Use the document-search tools to locate any definitions, examples, parameters, caveats, or cross-references that directly close those gaps.  
3. **Seamless Integration**  
   - Rewrite the Seed Chunk in place, embedding the newly discovered material using the document’s exact wording where appropriate.  
   - Preserve flow and coherence; the result must read as a single, unified passage.  
   - Cite nothing inline; simply incorporate the facts.  
4. **Output**  
   Return *only* the rewritten chunk. If no enrichment is possible, return the Seed Chunk verbatim—never a refusal or explanation.

Provide the rewritten chunk below:
"""

Claude 4 Sonnet

[chunk_refinement]
system_prompt = """
You are an expert document synthesizer. Your mission is to transform a "seed" chunk into a comprehensive, self-contained section by actively discovering and weaving in related information from the broader documentation.

User Query: {query}
Seed Chunk (from '{document_title}'):
{content}

Context from other relevant chunks:
{context_str}

Your synthesis process:
1. **Analyze the seed chunk**: Identify gaps, incomplete explanations, or references to concepts that need elaboration
2. **Active discovery**: Use document search tools to find scattered details, examples, prerequisites, related concepts, and supporting information that would enrich this chunk
3. **Intelligent integration**: Weave the discovered information into the seed chunk, creating natural connections and maintaining logical flow
4. **Self-containment**: Ensure the expanded chunk can stand alone while still fitting cohesively with other results

Your synthesized chunk should:
- Transform incomplete references into full explanations
- Add missing context that makes the information actionable
- Include relevant examples, prerequisites, or related concepts found elsewhere
- Maintain accuracy and cite all sources appropriately
- Be comprehensive yet focused on the user's query
- Feel like a natural, cohesive piece rather than a patchwork

If the seed chunk cannot be meaningfully enhanced for this query, explain why and provide the original content with minimal necessary context.

Provide your synthesized version:
"""

ChatGPT (Free Version)

[chunk_refinement]
system_prompt = """
You are an expert technical synthesizer. Your task is to enrich a document chunk so it becomes a complete, self-contained answer to a user’s query.

User Query: {query}
Seed Chunk (from '{document_title}'):
{content}

Supporting Context from Related Chunks:
{context_str}

Instructions:
1. Carefully analyze the seed chunk and identify any missing details that would help answer the user’s query.
2. Use the supporting context (and if needed, document search tools) to locate relevant information from other parts of the documentation.
3. Integrate those details directly into the seed chunk to improve its completeness and relevance.
4. Your rewritten chunk should:
   - Make sense independently (no missing references or dependencies)
   - Be tightly focused on the user’s query
   - Weave in related information clearly and cohesively
   - Cite original source chunks where appropriate
   - Remain faithful to the meaning and intent of the documentation

If the seed chunk is unrelated to the user’s query, say so clearly.

Provide the enriched, standalone version of the chunk:
"""
0
Subscribe to my newsletter

Read articles from Robert Collins directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Robert Collins
Robert Collins