In our last post, we introduced the “attention mechanism” as the breakthrough that fixed the “bottleneck problem” in AI translation. We learned that by allowing models like the Transformer to focus on the most relevant parts of a source text, they could translate long, complex sentences with incredible accuracy.

But that was just the beginning. The power of “selective focus” goes far beyond translation. In this post, we’ll explore how this same core idea allows AI to “see” images, how it works under the hood, and how it paved the way for the conversational AI we interact with every day.

Attention Beyond Text: Teaching an AI to “See”

The magic of attention isn’t limited to words. The same principle can be used to “translate” an image into a sentence.

Imagine an AI tasked with describing a photo of our boy and his dog in the park. Without attention, the AI would have to look at the entire image at once and cram its meaning into a single, fixed summary before writing a caption. It might say, “A boy and a dog in a park.”

With attention, the model can focus on different parts of the image as it writes each word.

When it writes the word “boy,” it focuses its attention on the pixels that make up the boy.
When it’s ready to write “frisbee,” its attention shifts to the pixels making up the frisbee.
To describe the “sunny day,” it can look at the clear blue sky and the bright light in the image.

This allows for far richer, more detailed captions that feel more human.

Under the Hood: Attention in Action

So how does the AI decide where to “focus”? It’s an elegant three-step process of Scoring, Weighting, and Blending.

To make this concrete, let’s trace the translation of “my husband is very wise and wealthy” into French.

Imagine the model has already produced “mon…” and needs to generate the next word. During the Scoring step, it looks at the source text and realizes “husband” is the most relevant word. So, during Weighting and Blending, it creates a context vector that is heavily focused on the meaning of “husband.” This allows it to confidently predict the next word: “mari.”

Now, the magic happens. For the next word, the process repeats. Later in the sentence, when the model has generated “…très sage et…” (“…very wise and…”), it re-scores the original sentence. This time, the word “wealthy” gets the highest score. The new context vector is now all about “wealthy,” leading to the prediction “riche.”

This ability to dynamically shift its focus for every single word is what gives attention its power.

The Road to Transformers: From RNNs to Parallel Power

Before Transformers, the most common models for handling sequences were Recurrent Neural Networks (RNNs). Think of an RNN like a person reading a book one word at a time. It has a form of memory (a “hidden state”) that allows it to remember what it just read and use that context to understand the current word. The problem? It only knows about the past. It has no idea what words are coming next.

To improve this, developers created Bidirectional RNNs (BiRNNs). A BiRNN is simply two RNNs working together: one reads the sentence forward (left-to-right), and the other reads it backward (right-to-left).

This is a huge advantage. Consider the sentence: “The player took a bow…” Without knowing what comes next, “bow” is ambiguous. Is it a weapon, or a gesture of respect? The backward-reading RNN sees the end of the sentence (e.g., “…before the cheering crowd”) and immediately knows the context. By blending the knowledge of the past and the future, a BiRNN has a much deeper understanding of the sentence.

But even BiRNNs had two big problems:

They were slow: Processing word-by-word is like a bucket brigade — you can’t pass the tenth bucket until you’ve passed the first nine. This makes training on huge datasets very inefficient.
They could still forget: In a very long paragraph, the connection between the first and last words could become weak as information was passed step-by-step.

This is where the Transformer changed everything. Instead of processing word-by-word, its self-attention mechanism allows it to look at all words at the same time. This parallel processing is not only dramatically faster, but it also means the connection between any two words is always direct and strong.

This breakthrough solved the limitations of RNNs and made it possible to build the massive, powerful large language models that power today’s conversational AI bots, from website helpers to the voice assistant on your phone.

How AI Learned to See, Remember, and Converse

Attention Beyond Text: Teaching an AI to “See”

Under the Hood: Attention in Action

The Road to Transformers: From RNNs to Parallel Power

Subscribe to my newsletter

Belinda Marion Kobusingye

Belinda Marion Kobusingye