LLM Hallucinations, Intrinsic Knowledge

This paper by Orgad et al. delves into the internal workings of Large Language Models (LLMs) to understand how errors, often termed "hallucinations," are represented. Moving beyond purely external, behavioral analyses, the authors use probing techniques on internal activations to detect errors, classify error types, and investigate discrepancies between what models seem to "encode" internally and what they generate externally. The work aims to provide a deeper, model-centric understanding of LLM errors to guide mitigation strategies. This analysis evaluates the paper's empirical findings and methodological contributions, particularly questioning its interpretation of internal states as representing "truthfulness" versus reflecting training data statistics, and examining the influence of anthropomorphic assumptions.

Strengths

Focus on Internal Representations: The paper makes a valuable contribution by shifting the focus of error analysis from external observation to the LLM's internal activations, seeking intrinsic signals related to erroneous outputs.
Improved Error Detection via Probing: It demonstrates empirically that probing classifiers trained on internal activations, particularly those associated with the specific tokens representing the core answer ("exact answer tokens"), can outperform error detection methods relying solely on output logits or less localized activations (Section 3, Table 1). The finding that relevant signals concentrate in these specific tokens is a notable practical insight.
Challenging Universal Truthfulness Encoding: The research provides strong evidence against a simple, universal truthfulness mechanism within LLMs. The finding that probing classifiers generalize poorly across diverse tasks (Section 4, Figure 3) suggests that any internal correlates of correctness are likely multifaceted and "skill-specific," not a single unified signal.
Predicting Error Types: The demonstration that internal states can be used to predict types of errors (e.g., consistent vs. occasional errors, Section 5, Table 2) offers a potentially valuable tool for developing more targeted error mitigation strategies.
Highlighting Internal/External Discrepancy: The paper identifies cases where models seem to internally encode information favoring the correct answer, yet still generate incorrect outputs (Section 6, Figure 5). This points to complex dynamics within the generation process worthy of further study.

Weaknesses

Fundamental Misinterpretation of "Truthfulness" Encoding: The paper's central interpretive framework – that internal activations "encode truthfulness" or that models "know" when they are wrong – is highly contestable and likely flawed. The probing classifiers are more plausibly detecting statistical correlations reflecting training data distribution and coverage. An activation pattern strongly associated with "correct" outputs likely signifies that the generated answer closely matches patterns dominant in the training data for similar contexts, not that the model possesses an intrinsic representation of objective truth. This framing conflates statistical likelihood with epistemic certainty.
Pervasive Anthropomorphism: The paper suffers significantly from anthropomorphic framing, both in language and methodology:
- Terminology: Phrases like "LLMs know more than they show," internal states "encode the correct answer," or implying models choose to generate incorrect answers despite internal signals, attribute cognitive states (knowledge, intention, contradiction) to the model.
- Methodology: Using probes to access "truthfulness features" treats activations as direct readouts of internal beliefs, akin to mind-reading, rather than complex features arising from statistical processing. Taxonomizing errors based on generation consistency (Section 5.1) can subtly imply agent-like stability or confidence levels.
- Assumptions: The very notion of a unified internal "truthfulness" signal, even a multifaceted one, risks imposing a human cognitive structure onto a non-human system. The investigation of internal/external "discrepancy" presupposes the internal state represents a "belief" that the external action contradicts.
Insufficient Accounting for Autoregressive Generation: The analysis sometimes underemphasizes that LLMs generate text token-by-token, where each step is conditioned on the preceding ones. An "internal encoding" at one point might be overridden or ignored later in the sequence generation due to sampling probabilities or attention shifts, making the notion of a stable internal state "contradicting" the final output overly simplistic.
Conflating Fine-tuned Behavior with Pre-trained Representation: Much of the analysis uses instruction-tuned models (Mistral-Instruct, Llama-Instruct). Fine-tuning significantly alters model behavior and likely internal representations. Attributing findings solely to intrinsic LLM properties without carefully disentangling pre-training vs. fine-tuning effects can be misleading.

Unexplored

Connecting Internal States to Training Data: Rigorous analysis is needed to directly link the activation patterns identified by probes to statistical properties of the training data (e.g., frequency, context). This would directly test the "data coverage vs. truth" hypothesis.
Mechanistic Explanation of Activation Patterns: Probing identifies correlations but doesn't explain how these specific activation patterns arise from the model's computations or what specific circuit-level features they represent. Deeper mechanistic interpretability work is required.
Causal Intervention vs. Correlation: While probes can predict errors, can manipulating these internal activations reliably correct errors? Establishing causality, rather than just correlation, is crucial for leveraging these findings for mitigation.
Understanding the Generation Process Dynamics: More research is needed on why the generation process sometimes diverges from internal activation patterns that correlate with correctness. What role do attention mechanisms, sampling strategies, and later-layer computations play in this discrepancy?

Conclusion

Orgad et al. provide valuable empirical results, particularly the finding that error-related signals in LLM activations are localized and that probing "exact answer tokens" improves detection. Their demonstration that these signals lack universal cross-task generalization is also a significant contribution, challenging simplistic views of internal truth representation. However, the paper's central interpretation – framing these signals as intrinsic "truthfulness" encoding – appears fundamentally flawed, likely mistaking statistical echoes of training data for epistemic states. This interpretive weakness stems from pervasive anthropomorphic assumptions about LLM cognition ("knowing," "encoding," "contradicting"). While the probing techniques may offer practical utility for error correlation, the paper's narrative about why these techniques work risks reinforcing misleading, human-like conceptions of LLMs. Future work should focus on mechanistically grounding these findings in the statistics of the training data and the dynamics of the generation process, adopting a framework that resists attributing cognitive properties to these complex pattern-matching systems.

Entry #4: LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Table of contents

Strengths

Weaknesses

Unexplored

Conclusion

Subscribe to my newsletter

Gerard Sans

Gerard Sans