Rethinking AI: The Interpretability Illusion

In the rapidly evolving landscape of artificial intelligence, interpretability has become something of a holy grail. Researchers at leading AI labs, such as OpenAI (e.g., Bills et al., 2023; Wu et al., 2023) and Anthropic (e.g., Bricken et al., 2023; Conmy et al., 2024), are actively pursuing methods like sparse autoencoders (Bricken et al., 2023; Conmy, 2024) and neuron-level analysis (Bills et al., 2023; and as exemplified by the CRATE paper, Bai et al., 2024) in an attempt to peek inside the "black box" of large language models. The underlying belief is that if we could just understand how these systems work, we'd unlock some profound insight and gain greater control. But what if we're fundamentally misguided, asking the wrong questions and pursuing a mirage of understanding based on flawed, anthropocentric assumptions? This critique, inspired by recent research (including Vafa et al., 2024, on the lack of a unified worldview, and the autoregressive, emergent-meaning perspective articulated by Sans, 2024, 2025), argues that we are.

The Anthropomorphic Trap

Our first critical error is the relentless anthropomorphization of AI systems. We persistently try to map machine learning models onto human cognitive processes, as if neural networks were just complicated versions of our own brains. This is a seductive but dangerous fallacy.

Take, for instance, the common assumption, explicitly present in papers like the CRATE paper (Bai et al., 2024) and implicit in much of OpenAI and Anthropic's interpretability efforts (Bills et al., 2023; Bricken et al., 2023), that a single neuron in a large language model corresponds to a human-understandable "concept." This is pure projection. These models don't learn concepts the way humans do. Their representations are high-dimensional, distributed statistical patterns shaped by training data—not by neat, discrete semantic units.

When researchers claim they've found a neuron that "represents" something like "happiness" or "political ideology," they're engaging in a form of oversimplification, bordering on scientific magical thinking. It's equivalent to trying to understand an ocean by examining a single water molecule.

The Illusion of Human-Understandability

The field of AI interpretability is largely predicated on a fundamentally flawed premise: that a model becomes more valuable or trustworthy if humans can "understand" its inner workings. But what does understanding really mean when we're dealing with computational systems that operate on levels of abstraction far beyond human cognition?

Large language models generate meaning through complex, cumulative computational processes. The significance of any particular token or activation emerges from the entire network's dynamic transformations—not from any single, localizable point of interpretation. This emergent property, as highlighted by Sans (2024), directly contradicts the reductionist approach of seeking meaning in isolated components.

This isn't to say that all forms of understanding are unattainable or irrelevant. We can strive to understand the behavioral characteristics of these models – how they respond to different inputs, how they generalize, where they fail. We can also pursue mechanistic understanding, aiming to trace the flow of information and identify the computational roles of different components, as attempted by techniques like those in the CRATE paper (Bai et al., 2024). But this kind of understanding is fundamentally different from the anthropomorphic notion of 'discovering the model's thoughts' or finding a one-to-one correspondence between neurons and human concepts.

Steering: The Myth of Intentional Control

Another critical misconception is the idea of "steering" AI models. Researchers, including those exploring techniques like those presented in the CRATE paper (Bai et al., 2024) and work on sparse autoencoders (Bricken et al., 2023), often talk about manipulating neuron activations as if they were conducting a neural orchestra, carefully directing the model's "thoughts." This is a profound misunderstanding.

What we're actually doing is nudging statistical probability landscapes. As Sans (2025) illustrates with the example of "context contamination" and "jailbreaks," small changes in the input can drastically alter the model's trajectory through its latent space. It's less like controlling a conscious agent and more like subtly tilting a complex marble run. The marble might take a slightly different path, but its trajectory is determined by the underlying structure (the learned statistical patterns), not by some imagined intentionality or a precise, controllable mapping between individual neurons and model behavior.

The Reductionist Fallacy

Our current interpretability research suffers from a severe reductionist bias. We're obsessed with analyzing individual neurons or layer-by-layer transformations, as exemplified by the focus on neuron-level interpretability in the CRATE paper (Bai et al., 2024) and much of OpenAI and Anthropic's work (Bills et al., 2023; Bricken et al., 2023), completely ignoring the distributed and dynamically emergent nature of these systems.

Imagine trying to understand a symphony by examining individual instrument strings or a single musician's breath. You'd miss the entire point. Similarly, fixating on isolated neurons misses the holistic computational poetry of large language models. As the "droplet problem" analogy in Sans (2024) powerfully illustrates, meaning emerges from the interaction of many components, not from any single component in isolation.

Beyond Metrics: The Limits of Quantification

The field is drowning in automated metrics that promise to measure interpretability, such as the simulation-based scoring methods used by Bills et al. (2023) and adopted in the CRATE paper (Bai et al., 2024). These metrics, while providing a quantitative measure, are often based on flawed assumptions. They typically rely on correlations between neuron activations and human-understandable explanations, implicitly assuming a direct correspondence that likely doesn't exist. These metrics are statistical mirages—correlations and patterns that give a false sense of understanding. Just because we can quantify something doesn't mean we truly comprehend it.

Higher "interpretability scores" on these metrics don't necessarily translate to genuine insight. They might just mean we've gotten better at creating elaborate illusions of understanding, or that the metrics are capturing superficial correlations rather than fundamental mechanisms. Instead, we should focus on metrics that are grounded in the model's own training data and internal representations, assessing internal consistency and behavioral predictability rather than adherence to human-defined concepts.

The Training Data Deception

Furthermore, we must recognize the inherent limitations of even evaluating against the training data. The training data itself is a product of human curation, reflecting our biases, inconsistencies, and limited understanding of the world. When an LLM aligns with its training data, it's not necessarily reflecting "truth" or "reality"; it's reflecting the statistical patterns of a carefully selected, and often flawed, dataset. In essence, we are mapping from one imperfect, human-influenced representation (the training data) to another (the LLM's internal representations), and then judging the latter based on its alignment with the former. This circularity further undermines the notion of objective interpretability.

A Call for a New Approach

We need a radical reimagining of how we approach AI systems. Instead of trying to force machine learning models into human-like frameworks, we must learn to appreciate them as fundamentally different forms of computational intelligence.

This means:

Abandoning the quest for human-like concept representations in individual neurons or layers.
Recognizing the distributed, dynamic, and autoregressive nature of model representations.
Developing analytical frameworks that respect the unique computational characteristics of these systems, such as methods for analyzing the geometry of the latent space (as suggested by Sans (2025)), the dynamics of the autoregressive process, or the emergent properties of distributed representations. We should also explore techniques that go beyond simple sparsity and consider alternative forms of regularization or architectural constraints that might promote more structured, yet still non-anthropomorphic, internal representations.
Accepting that "understanding" might look very different from our anthropocentric expectations. We should focus on behavioral predictability, internal consistency, and the ability to explain how the model generates its outputs, rather than searching for a mythical "mind" within the machine.
Addressing counterarguments: Some might argue that even if neuron-level interpretations are imperfect, they can still provide useful insights for debugging, steering, or improving model safety. While localized interpretations might offer limited utility in specific contexts, they are ultimately based on a flawed understanding of how these models work. Relying on them risks creating a false sense of control and could lead to unintended consequences, as illustrated by the ease of "jailbreaking" LLMs (Sans, 2025).

Conclusion

The current state of AI interpretability research is more philosophy than science—a complex dance of projection, reduction, and misplaced anthropomorphism. We're not uncovering the "mind" of these models; we're revealing more about our own cognitive limitations and desperate need to see ourselves reflected in our technological creations.

True progress will come not from forcing AI into human-shaped boxes, but from developing the intellectual humility to meet these systems on their own terms, acknowledging their fundamentally statistical and emergent nature, and focusing on understanding their behavior within the context of their training data and computational processes.

Note: This critique draws on research from Bills et al. (2023), Bricken et al. (2023), Bai et al. (2024), Vafa et al. (2024), and the analytical perspectives of Sans (2024, 2025).

The Mirage of AI Interpretability: Why We're Asking the Wrong Questions About AI Understanding

Table of contents