Transformers and the Mirage of Intelligence

Alex KaganAlex Kagan
7 min read

How LLMs are still stuck in the same architecture that started it all

These days, AI is an inescapable topic of conversation, whether you are scrolling on the toilet, sipping your morning coffee, or sitting through yet another boardroom pitch. Everyone has an opinion, your coach, CEO or the uncle who still thinks Siri is a person.

At the center of all this noise sits one concept: large language models, while behind every one of those, powering everything from autocomplete to simulated reasoning, is the same core architecture we have been using since 2017, the Transformer.

Nearly every LLM in production today, whether it is GPT-4, Claude, Gemini, or something with a clever codename, is built on the same foundation. It is the same self-attention mechanism, the same recursive token-by-token prediction pattern. Sure, the outer layers have evolved,, we have added memory context tricks, alignment layers, and increasingly elaborate prompting strategies, but under the hood, very little fundamental has changed.

That has not stopped people from projecting intelligence, even consciousness, onto these models. What we actually have are fully deterministic pattern-completion engines. They are incredibly powerful and increasingly useful, but they are still just that and nothing more.

This post is a reality check, a way to separate real breakthroughs from polished extensions, and a reminder that the only leap in LLM architecture still came from a single paper almost a decade ago.

And just to be clear, this is about language models. The story in vision, robotics, and audio is different, and deserves its own post.


Transformers Are Still Running the Show

The original Transformer architecture was a breakthrough in performance and engineering elegance. Self-attention replaced recurrence, enabling parallel training and much longer-range sequence modeling, which opened the door to scaling in a way that older architectures simply could not support.

That elegant foundation quickly became the scaffolding for everything that followed:

  • BERT introduced masked token prediction and bidirectional attention, and quickly turned NLP benchmarks into target practice.

  • GPT-2 and GPT-3 demonstrated that raw next-word prediction, when trained on a massive scale, could unlock surprising generalization.

  • T5 reframed everything as a text-in, text-out task, allowing translation, summarization, and classification to all fall under a unified pretraining strategy.

  • ChatGPT introduced alignment tuning, memory scaffolding and a friendly interface, transforming LLMs from technical toy to mass-market product.

Every one of these sits on the same base, same attention blocks, same stack of tokens, same inductive bias. Even now, as companies push toward trillion-parameter models, the underlying architecture is still a Transformer. We are driving faster, more capable models, but we have not changed what is under the hood.


Tweaks That Look Like Breakthroughs

There has been plenty of progress, but most of it has come in the form of compute optimizations, clever training tricks and architectural extensions that stretch the original Transformer rather than evolve it.

Here are four of the most important innovations shaping today’s LLMs.

Mixture of Experts (MoE) and Sparse Routing

What it is: Instead of using the full feed-forward layer for every token, MoE models activate only a small subset of “experts” per input during feed-forward processing.

Why it matters: This allows a model to have far greater parameter capacity without exploding inference costs. A trillion-parameter model might only use ten percent of those weights on a single forward pass. There is also evidence of reduced hallucination with this approach.

Verdict: A smart and practical way to scale further, but not a new architecture.


Long-Context Models and Rotary Positional Embeddings

What it is: Transformers were originally designed for sequences only a few thousand tokens long. Rotary positional embeddings (RoPE) offer a more scalable way to encode token position, enabling context windows of 100,000 tokens or more.

Why it matters: Longer context allows LLMs to handle entire documents, hold memory over long conversations, and tackle tasks that involve reasoning across time and structure, such as legal analysis or large codebases.

Bonus detail: RoPE relies on operations in the complex number plane to preserve relative position information more smoothly than absolute encodings. Very clever.

Verdict: A clever solution to a structural limitation.


Synthetic Data and Chain-of-Thought Prompting

What it is: We are running low on high-quality human-generated data, so we have started generating training data using other models. Chain-of-thought prompting is one popular strategy, encouraging models to “think” through problems step by step rather than jumping to the answer.

Why it matters: Synthetic data, especially when scaffolded with stepwise reasoning, allows us to scale training further without relying on new human annotations. It also helps shape the illusion of reasoning in frontier models.

The problem: When models learn from other models, you risk a feedback loop. The output becomes a copy of a copy. The quality drifts, the diversity shrinks, and the errors get amplified.

Verdict: Highly effective in the short term, but fragile over time. We are starting to copy our own homework.


Natively Multimodal Foundation Models

What it is: Foundation models are now being trained across multiple modalities. GPT-4, Claude 3, and Gemini can process text, images, and sometimes audio or video within the same neural core.

Why it matters: In theory, this brings us closer to grounded reasoning, where a model can make sense of the world across senses, not just symbols.

In practice: Most multimodal models are still fundamentally text-first, or text-labeled. Vision or audio inputs often get tokenized and fed into the same Transformer pipeline, which means it works, but feels like a patch rather than a deep integration.

Verdict: A meaningful step forward, but still anchored in the same architecture.


The Real Action Is Happening Around the Model

While the core model has remained mostly unchanged, the tooling and infrastructure surrounding it have exploded in complexity and usefulness.

  • Retrieval-augmented generation (RAG) gives models an external source of memory and factual grounding.

  • Vector databases let us organize and retrieve embeddings with sub-second latency

  • Fine-tuning, distillation, quantization, and LoRA have made it possible to tailor and deploy models more efficiently.

  • Agentic frameworks like AutoGPT and OpenDevin have enabled models to perform sequences of actions, call APIs, and interact with tools.

  • Open-weight models and instruction-tuned datasets have created a fast-moving, open-source ecosystem with real competitive pressure.

This is where the frontier feels alive, not in the model internals, but in the stack we are building on top of them to make LLMs usable, steerable, and safe enough for production.


So, What Are We Actually Building?

We are clearly circling something powerful. These models can write, translate, debug, summarize, explain, simulate and increasingly automate. The utility leap since 2019 is undeniable.

But we are also running into the edges of what a next-token predictor can do.

  • There is no persistent memory, only the sliding window of the prompt.

  • There is no grounded world model, only patterns learned from the co-occurrence of text.

  • There is no true reasoning, only text that resembles reasoning when prompted the right way.

The AGI/ASI conversation feels premature. What we have are language machines that behave impressively within context, not systems that truly understand or adapt. We are still building all of this on top of one architectural breakthrough from eight years ago. Everything since has been extensions, tricks and scaffolding.

The hard truth is, we are still not sure what exact problem we are solving. We have built this powerful generative engine, and now we are looking around, trying to retrofit it into workflows, apps, businesses, and institutions.

We are in the “tool in search of a killer use case” phase. It is fascinating and chaotic and very real. But it is not yet intelligence.


Where It Goes From Here

A few things could actually break the loop:

  • A true architectural replacement for the Transformer, A.KA. The next breakthrough

  • Models that reason over time, not just over tokens

  • Forward learning, where models can learn during inference rather than just during training via backpropagation

  • Memory systems that go beyond the prompt, possibly persistent or dynamically constructed

  • Grounding in perceptual and physical reality

  • Better internal representations for abstraction and composition

There are early signs pointing in this direction. Google DeepMind calls this the “Age of Experience,” a shift from static training to models that adapt continuously. If that bears out, it could fundamentally change how models learn and reason.

Do not get me wrong, the amount of money flowing into AI right now might only be rivaled by the Space Race of the 1960s, where, despite global compute power rounding up to a modern iPhone, we still went to the moon (and yes, I do believe we did).

We are clearly standing at the edge of something huge, but we are nowhere close to general intelligence. What we have is a single architectural idea that we have stretched in every direction imaginable, hoping something magical emerges.

0
Subscribe to my newsletter

Read articles from Alex Kagan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alex Kagan
Alex Kagan

Principal Engineer and systems architect with 25+ years in tech, now writing at the intersection of AI, architecture, and innovation. I explore the implications of machine intelligence, interpret trends, and unpack what matters beneath the hype. Clear thinking on artificial thinking.