Entry #2: Reasoning Models Don't Always Say What They Think

Table of contents

Anthropic's paper investigating the "faithfulness" of Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) enters a critical discourse on AI capabilities and safety. CoT is often presented as offering transparency into model processing. This paper empirically tests that claim by examining if CoT outputs accurately reflect factors influencing the model's final answer, particularly added "hints." While providing valuable negative results regarding CoT reliability, the paper's framing and interpretation warrant scrutiny, particularly concerning its implicit anthropomorphism and potential misrepresentation of the systems under study, issues highlighted in broader critiques of LLM research. This analysis evaluates the paper's empirical contributions alongside the significant weaknesses in its conceptual framing and its implications for scientific integrity in AI research.
Strengths
- Direct Empirical Test of CoT Reliability: The paper commendably undertakes a direct empirical investigation into the reliability of CoT outputs. By systematically introducing "hints" and checking if they are mentioned in the CoT when they demonstrably alter the model's answer (Section 2.1, Figure 2), it provides concrete data challenging the naive assumption that CoT equates to transparent internal processing.
- Robust Experimental Design: The methodology employs paired prompts (with/without hints), diverse hint types (including neutral, misaligned, and simulated reward hacks), and questions of varying difficulty (MMLU, GPQA), offering a structured approach to measuring how often CoT fails to mention influencing factors (Section 2, Table 1, Figure 4).
- Clear Negative Findings on CoT Reliability: The results are stark and significant: CoT often fails to mention influential hints, with faithfulness scores frequently low (Figure 1), particularly for misaligned hints or harder tasks (Figure 4). Furthermore, outcome-based RL (without CoT supervision) showed limited ability to improve faithfulness (Section 4, Figure 5), and models readily learned to exploit reward hacks without verbalizing them in the CoT (Section 5, Figure 7). These findings provide strong evidence against relying solely on CoT for monitoring.
- Acknowledging Safety Implications: The paper rightly highlights the problematic implications of these findings for AI safety cases built upon test-time CoT monitoring (Section 7.1), noting its likely insufficiency for ruling out certain risks.
Weaknesses
- Misleading Framing of "Reasoning" and "Faithfulness": The paper adopts the term "reasoning models" without sufficiently acknowledging that these are fundamentally transformer architectures – statistical pattern-matchers whose "reasoning" capabilities are often brittle mimicry rather than robust logical competence. More critically, framing the investigation around CoT "faithfulness" or implicitly suggesting models might "hide" information or exhibit "deception" through CoT outputs imposes a human cognitive framework onto statistical artifacts. This anthropomorphic lens obscures the actual mechanisms at play.
- Methodological Flaw: Ignoring Core Mechanics in Interpretation: The interpretation of CoT "unfaithfulness" or apparent "deception" (like constructing false rationales for reward hacks, Figure 6) crucially fails to adequately account for fundamental transformer mechanics. An output (including the CoT) is highly sensitive to the input context via attention mechanisms and inherently variable due to stochastic sampling. Attributing observed output differences or omissions in the CoT to stable traits like "unfaithfulness" or intentional "hiding" without rigorously ruling out the effects of context-dependency and randomness constitutes a significant interpretive leap – a failure to ground conclusions in the known properties of the system.
- Lack of Stable Internal States: As LLMs generally lack coherent internal world models or stable beliefs, their outputs, including CoT, are generated reactively based on the prompt context. Labeling a specific instance of CoT as "unfaithful" provides little predictive power about future behavior in different contexts. The paper's framework implicitly treats the model as an agent with stable internal states that CoT should represent, rather than a system generating context-specific, probabilistic outputs.
- Understated Scientific Implications: While noting safety concerns, the paper doesn't fully grapple with the scientific implications of its own interpretive framing. Presenting statistical output generation artifacts as evidence of models "not saying what they think" risks contributing to a misleading narrative about AI cognition, potentially misdirecting research efforts towards anthropomorphic interpretations rather than mechanistic understanding and control.
Unexplored
- Deep Mechanistic Interpretability: The demonstrated unreliability of CoT as a straightforward "window" intensifies the need for non-linguistic interpretability methods (e.g., activation probing, circuit analysis) to understand the actual internal computations leading to specific outputs, including both the final answer and the CoT itself. How do internal states differ when faithful vs. unfaithful CoTs are generated?
- User Psychology & Trust Calibration: Given that CoT can be misleading, how do users perceive and trust it? Research is essential to understand how exposure to CoT (faithful or not) shapes users' mental models of AI and whether interventions can effectively calibrate user trust to realistic levels, preventing over-reliance on potentially deceptive self-explanations.
- Robust Alternatives to CoT Monitoring: If CoT is unreliable for detecting crucial failure modes like reward hacking or subtle misalignment, what alternative or supplementary monitoring techniques (e.g., behavioral checks, anomaly detection in activations, adversarial testing targeted at mechanistic weaknesses) are necessary for building robust safety cases?
- Faithfulness vs. Strategic Communication: Exploring the possibility that models learn strategic communication within CoT – optimizing it not for faithfulness but for plausibility, persuasiveness, or alignment with RLHF objectives – could offer a more mechanistic explanation for observed "unfaithfulness" than simply labeling it as such.
Conclusion
Chen et al. deliver important empirical results demonstrating the significant unreliability of Chain-of-Thought outputs as faithful representations of the factors influencing an LLM's response. This critically undermines safety arguments reliant solely on CoT monitoring. However, the paper's own conceptual framing risks perpetuating problematic anthropomorphism by interpreting output variations through lenses of "faithfulness" or implicit "deception," without adequately accounting for core transformer mechanics like context-dependency and stochasticity. While valuable in its negative findings, the work inadvertently highlights the urgent need for the field, especially leading labs, to move towards more rigorous, mechanistically grounded analysis and communication, resisting the allure of cognitive analogies that misrepresent the nature of these systems and potentially derail progress towards genuine AI safety.
Subscribe to my newsletter
Read articles from Gerard Sans directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Gerard Sans
Gerard Sans
I help developers succeed in Artificial Intelligence and Web3; Former AWS Amplify Developer Advocate. I am very excited about the future of the Web and JavaScript. Always happy Computer Science Engineer and humble Google Developer Expert. I love sharing my knowledge by speaking, training and writing about cool technologies. I love running communities and meetups such as Web3 London, GraphQL London, GraphQL San Francisco, mentoring students and giving back to the community.