Large Language Models Are Not Language Models but Stochastic Isomorphic Engines

Gerard SansGerard Sans
4 min read

In recent years, the explosion of large language models (LLMs) like GPT has sparked a global conversation about machine "understanding" of language. Often described as "language models," these systems are credited with performing complex language tasks, producing coherent text, and even participating in seemingly insightful dialogue. Yet, this label of "language model" is misleading at best and dangerously inaccurate at worst. At their core, these models operate through a series of token transformations based on mathematical isomorphisms, which have little to do with human semantics. Referring to them as "stochastic isomorphic engines" may be far more accurate.

The Isomorphism Principle

One of the fundamental concepts driving the functionality of LLMs is isomorphism, a mathematical principle indicating that relationships can be encoded in a structure-preserving manner that is entirely independent of language or meaning. This isomorphism principle shows that LLMs' understanding of patterns predates language syntax and grammar. What the models achieve is more akin to finding structural relationships among tokens rather than deriving meaning from them. This means that what an LLM produces is a reflection of encoded patterns rather than a thoughtful representation of the language we use.

Tokens Are Structural, Not Semantic

Another key issue is the structural nature of the tokens used by these models. Tokens, which are the smallest units LLMs process, overlap in characteristics for structural reasons rather than for reasons of meaning. This token profile overlap explains why models can produce convincing sentences without actually understanding them. An LLM doesn't "know" what a sentence means in any human sense; instead, it has learned that certain tokens commonly follow others based on the enormous corpus it was trained on. The relationships among tokens are grounded in patterns that do not inherently correspond to human-defined semantics.

Multimodality and Beyond-Language Data

Contrary to popular belief, LLMs are not purely language-based entities. Their training often includes a diverse array of multimodal data: images, structural formats, symbols, and codes. Even within language-specific training, LLMs rely on correlations and long-range token co-occurrences, which extend far beyond what humans consider meaningful language constructs. This multimodal foundation undermines the notion that LLMs specialize in language. They handle a multitude of patterns that can be tokenized, without concern for whether those patterns correspond to human language, visual information, or symbolic logic.

Hierarchical Structures Without Language

LLMs identify hierarchies and patterns using structural cues—such as separators or delimiters—that signal the beginning and end of sequences. This hierarchy-building capacity mirrors how models in fields like image processing construct meaning from pixels without "seeing" as humans do. Just as a Vision Transformer doesn't understand a "cat" but discerns low- to high-level structures, an LLM can construct sentence-like structures without understanding them. This pattern of structural isomorphism aligns with how tokens are interpreted, underscoring that language is incidental to the model's architecture.

Contamination and Language Neutrality

The presence of diverse data types in training complicates the issue of language representation. While some might expect LLMs to improve with more varied data, the opposite often holds true. When models are trained on data that strays from language-specific contexts, their effectiveness in language tasks can actually diminish. Contaminating the dataset with non-language-specific information dilutes the model's ability to focus on language structure alone, pulling it further from the ideal of a true language model.

Semantics Are a Human Projection

Our tendency to ascribe semantics to LLMs is a projection of human intent onto a system that recognizes only patterns. The model does not grasp meaning; it is indifferent to what any given word or sentence signifies. Instead, it responds by processing token profiles and selecting outputs based on probabilistic patterns. This projection problem illustrates a critical flaw in how LLMs are positioned within the field of Natural Language Processing (NLP). These systems are grounded in token relationships rather than any understanding of language.

The Misnomer of "Language Model"

Given these limitations, it is time to reconsider the term "language model." These models are not intrinsically tied to language, nor are they structured to develop proficiency in language alone. They are stochastic engines, mechanisms that mirror the structure of data fed into them through token isomorphisms and pattern recognition. They could just as easily be trained on code, mathematical notation, or molecular structures, adapting their responses based on the probability distributions of tokens rather than any sense of meaning.

A Call for Terminological Precision

The current state of NLP research is marked by a significant contradiction: we treat LLMs as language-processing systems while acknowledging their reliance on data-driven, pattern-based token manipulation. The lack of terminological precision in NLP has muddied the landscape, blurring the lines between what these models are capable of and what they actually do. To move forward, the field must recognize that these engines do not understand language but rather process token-based profiles that mimic language's structural appearance.

Conclusion: Stochastic Isomorphic Engines

As LLMs continue to advance, it becomes increasingly important to define them accurately. They are, at their core, stochastic isomorphic engines—systems that identify and replicate token patterns across various data types without inherent language orientation. Only by shedding the inaccurate label of "language models" and acknowledging these systems as isomorphic engines can we begin to set realistic expectations for their capabilities and limitations. With this perspective, we can build a more precise and productive understanding of the remarkable but ultimately non-linguistic power of these models.​​​​​​​​​​​​​​​​

0
Subscribe to my newsletter

Read articles from Gerard Sans directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Gerard Sans
Gerard Sans

I help developers succeed in Artificial Intelligence and Web3; Former AWS Amplify Developer Advocate. I am very excited about the future of the Web and JavaScript. Always happy Computer Science Engineer and humble Google Developer Expert. I love sharing my knowledge by speaking, training and writing about cool technologies. I love running communities and meetups such as Web3 London, GraphQL London, GraphQL San Francisco, mentoring students and giving back to the community.