OverThink: Improved LLM Reasoning with Self-Para-Consistency


I discovered this idea on the Cognitive Revolution podcast. I created a web app based on this Association for Computational Linguistics research paper by Wenqing Chen, et al., 2024. The core idea was to generate paraphrases of the original prompt, then have the LLM vote on which response was best. This strategy improves the odds of capturing the ideal chain of thought. In the paper, it showed better results in ~80% of test cases.
I named the app OverThink. The code is on github if you want to try it yourself. It was written in Python and JavaScript with bootstrap for some UI elements. It was created using the Google Gemini 2.5 Pro canvas tool. I prefer the Google version over the OpenAI canvas since it seemed easier to manage multiple files. The Anthropic version is supposed to be the excellent, but I haven’t tried it yet. A first for me was that Gemini coded the entire application. The only changes I made directly were tweaking a few default parameters and choosing which models to use. From my perspective, I wrote it in English. That felt transformative. The app calls GPT 4o mini for the paraphrases and initial responses. Instead of having the same LLM vote on which response was best, the app sends all responses to Gemini 2.5 Pro for consolidation. Results in the paper showed improvement with up to 64 iterations and diminishing returns after that. OverThink currently runs 16 iterations.
Another deviation made from the original paper was changing the temperature on each call to try to inject more creativity in the responses. Temperature controls randomness in LLM outputs. At 0, the model is deterministic. As it increases (e.g., 0.7 or 1.0), the model becomes more creative and varied. Six of the sixteen iterations have a specific system prompt with explicit instructions to be creative. The app shows real time messages as the iterations run, returns the final consolidated answer from Gemini, saves the prompt and all the responses in a local database (sqlite3), and shows how many tokens were used for each API.
Saving the results in a database makes it easy to find previous answers without running the prompt again. Token counts are also saved to the database.
Running additional prompts obviously can run up costs. Most models have different rates per million input tokens, cached tokens, and output tokens. Since we are sending paraphrases, cached tokens probably don’t apply. The app was built in about four hours with another four hours of debugging (spread over a week).
In my limited testing, OverThink appears to provide better answers than one-shot prompts, but my tests were subjective and neither rigorous nor scientific in any way. Apart from any usefulness of the app, a takeaway for me was building a framework for integrating multiple frontier models with python glue.
Following is the complete output from one of my test prompts: “Is it an advantage or disadvantage to use multiple LLMs to handle a query, if the output from the first one is fed into the second one, specifically when the LLMs were created by different organizations?”
Processing started. Please wait...
Original Prompt: Is it an advantage or disadvantage to use multiple LLMs to handle a query, if the output from the first one is fed into the second one, specifically when the LLMs were created by different organizations?
--- Generating Paraphrased Prompts ---
Generating Paraphrase 1/15...
Paraphrase 1 generated.
Generating Paraphrase 2/15...
Paraphrase 2 generated.
Generating Paraphrase 3/15...
Paraphrase 3 generated.
Generating Paraphrase 4/15...
Paraphrase 4 generated.
Generating Paraphrase 5/15...
Paraphrase 5 generated.
Generating Paraphrase 6/15...
Paraphrase 6 generated.
Generating Paraphrase 7/15...
Paraphrase 7 generated.
Generating Paraphrase 8/15...
Paraphrase 8 generated.
Generating Paraphrase 9/15...
Paraphrase 9 generated.
Generating Paraphrase 10/15...
Paraphrase 10 generated.
Generating Paraphrase 11/15...
Paraphrase 11 generated.
Generating Paraphrase 12/15...
Paraphrase 12 generated.
Generating Paraphrase 13/15...
Paraphrase 13 generated.
Generating Paraphrase 14/15...
Paraphrase 14 generated.
Generating Paraphrase 15/15...
Paraphrase 15 generated.
--- Starting Standard OpenAI Iterations ---
Running Standard OpenAI iteration 1/10 (Original)...
Standard OpenAI Iteration 1 completed.
Running Standard OpenAI iteration 2/10 (Original + Paraphrase #1)...
Standard OpenAI Iteration 2 completed.
Running Standard OpenAI iteration 3/10 (Original + Paraphrase #2)...
Standard OpenAI Iteration 3 completed.
Running Standard OpenAI iteration 4/10 (Original + Paraphrase #3)...
Standard OpenAI Iteration 4 completed.
Running Standard OpenAI iteration 5/10 (Original + Paraphrase #4)...
Standard OpenAI Iteration 5 completed.
Running Standard OpenAI iteration 6/10 (Original + Paraphrase #5)...
Standard OpenAI Iteration 6 completed.
Running Standard OpenAI iteration 7/10 (Original + Paraphrase #6)...
Standard OpenAI Iteration 7 completed.
Running Standard OpenAI iteration 8/10 (Original + Paraphrase #7)...
Standard OpenAI Iteration 8 completed.
Running Standard OpenAI iteration 9/10 (Original + Paraphrase #8)...
Standard OpenAI Iteration 9 completed.
Running Standard OpenAI iteration 10/10 (Original + Paraphrase #9)...
Standard OpenAI Iteration 10 completed.
--- Starting Creative OpenAI Iterations ---
Running Creative OpenAI iteration 1/6 (Original + Paraphrase #10)...
Creative OpenAI Iteration 1 completed.
Running Creative OpenAI iteration 2/6 (Original + Paraphrase #11)...
Creative OpenAI Iteration 2 completed.
Running Creative OpenAI iteration 3/6 (Original + Paraphrase #12)...
Creative OpenAI Iteration 3 completed.
Running Creative OpenAI iteration 4/6 (Original + Paraphrase #13)...
Creative OpenAI Iteration 4 completed.
Running Creative OpenAI iteration 5/6 (Original + Paraphrase #14)...
Creative OpenAI Iteration 5 completed.
Running Creative OpenAI iteration 6/6 (Original + Paraphrase #15)...
Creative OpenAI Iteration 6 completed.
Consolidating all results with Gemini (streaming)...
--- Gemini Final Answer --- (gemini-2.5-pro-preview-05-06): Using multiple LLMs sequentially to handle a query, where the output from the first LLM (from one organization) is fed into a second LLM (from a different organization), presents both potential advantages and disadvantages. The effectiveness of this approach largely depends on the specific LLMs, the nature of the query, and how the process is managed.
Potential Advantages:
Diverse Perspectives and Enhanced Nuance: LLMs from different organizations are often trained on different datasets and with varying architectures or fine-tuning methodologies. This can lead to a broader range of perspectives, insights, and problem-solving approaches. The second LLM might refine, expand upon, or offer a different angle on the first LLM's output, leading to a richer, more nuanced, and potentially more comprehensive final response.
Leveraging Specialized Strengths: Different LLMs excel at different tasks (e.g., one might be better at creative generation, another at logical reasoning or data extraction). A sequential chain can leverage these specialized strengths; for example, the first LLM could generate a draft, and the second could critique or reformat it based on specific criteria.
Potential for Improved Accuracy and Quality: By combining the "knowledge" or processing capabilities of multiple models, it's possible to achieve a higher quality or more accurate output, especially if the second LLM can correct or improve upon the first's weaknesses.
Increased Robustness: If one LLM has certain biases or limitations, a second, different LLM might help mitigate these, though this is not guaranteed and could also introduce new biases.
Potential Disadvantages:
Inconsistencies and Mismatches: LLMs from different organizations can have varying styles, tones, terminologies, and underlying assumptions. Feeding the output of one into another can result in an inconsistent, disjointed, or stylistically jarring final response. The second LLM might also misinterpret the nuances or intent of the first LLM's output.
Compounding Errors (Error Propagation): If the first LLM produces an inaccurate, biased, or incomplete output, the second LLM might not only fail to correct it but could build upon the error, potentially amplifying it or leading the query further astray.
Increased Complexity and Integration Challenges: Managing a multi-LLM workflow is more complex. It requires careful prompt engineering for each stage and consideration of how the models will interact. Compatibility issues can arise if the output format of the first LLM isn't well-suited as input for the second.
Increased Latency and Cost: Each LLM call adds to the overall processing time and computational cost. A sequential chain will inherently be slower and likely more expensive than a single LLM call.
Difficulty in Debugging and Attribution: If the final output is unsatisfactory, it can be harder to pinpoint which LLM or which part of the interaction caused the issue.
Conclusion:
Chaining LLMs from different organizations can be advantageous if thoughtfully implemented for tasks that benefit from diverse processing or specialized capabilities. However, it introduces significant risks of inconsistencies, error propagation, and increased complexity. Success often hinges on:
The specific choice of LLMs: Their individual strengths, weaknesses, and how complementary they are.
The nature of the task: Complex tasks requiring multiple stages of refinement might benefit, while simpler tasks may not.
Careful prompt engineering and intermediary processing: Ensuring the output of one model is a suitable and effective input for the next.
In essence, while this approach offers the potential for superior results through diversity and specialization, it requires careful design and testing to mitigate the inherent challenges and ensure the benefits outweigh the drawbacks.
Token Usage Summary
OpenAI (gpt-4o-mini-2024-07-18 & gpt-4o-mini-2024-07-18): Input: 3384, Output: 1241
Gemini (gemini-2.5-pro-preview-05-06): Input: 815, Output: 787
Results saved to database.
Processing complete.
Subscribe to my newsletter
Read articles from Keith Winston directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Keith Winston
Keith Winston
Data science, machine learning, applied AI researcher, and mountaineer. Retired from the City of Garden Grove, CA.