Ever wondered how an AI confidently answers analogies like:
Paris : France => Rome : ? and returns - Italy?

In this blog, we’ll break down how transformer-based AI models interpret and solve such relationships.

Flow Steps:

1. Tokenization - Breaking Text Into Understandable Units

What happens:

The input text is broken down into tokens, usually words or sub-words.
Each token is assigned a unique token ID from a predefined vocabulary.

Example:

"Paris"   → token ID: 1256  
"France"  → token ID: 3452  
"Rome"    → token ID: 1789

2. Vector Embedding Lookup – Turning Tokens Into Vectors

Once we have token IDs, each is mapped to a vector via an embedding matrix (a learned lookup table). These vectors represent each word’s semantic meaning based on how it's used in language.

Example:

V_paris  = [0.12, -0.83, 0.45, ..., 0.31]  
V_france = [0.22, -0.71, 0.32, ..., 0.28]  
V_rome   = [0.09, -0.85, 0.48, ..., 0.29]

Each vector might be 300 or 768 dimensions depending on the model.

3. Positional Encoding – Adding Sequence Awareness

Transformers process all tokens in parallel, so they don’t inherently know the order of the words.
“Paris is the first word” or “Rome comes last” is not obvious without extra help.

What Positional Encoding Does:

Injects position information into each token embedding.
Either using sine/cosine functions (as in the original Transformer paper) or learned positional vectors.

This is added to the embedding vector:

Final_input = Embedding + PositionalEncoding

Now the model knows both:

What the token is
Where it appears

4. Self Attention - Contextual Understanding

Now each enriched token vector is passed through layers of self-attention and feed-forward networks in the transformer.

This produces contextual embeddings, meaning the model understands:

How each word relates to others
The full meaning of the sequence

So "Rome" now carries contextual understanding like: “It’s part of a capital-country relationship.”

5. Semantic Vector Space - Understanding Meaning Through Distance

The transformer outputs vectors that exist in a semantic space, where similar concepts are close together.

For example:

Capital cities like "Paris", "Rome", "Berlin" have embeddings near each other.
Countries like "France", "Italy", "Germany" cluster similarly.

So the model can reason about relationships using spatial distance and direction.

6. Vector Arithmetic – Finding the Relationship

This is where the analogy logic kicks in.

We perform the operation:

// V_result + V_rome = V_france + V_paris
V_result = V_paris - V_france + V_rome

What does this mean?

V_paris - V_france computes the direction from a country to its capital.
Adding V_rome projects this relationship onto Rome’s country.

So the model is searching for the word vector that represents:
"The country of Rome, in the same way that France is the country of Paris."

7. Cosine Similarity – Finding the Nearest Word

Once we get V_result, we compare it with all other word vectors in the vocabulary using cosine similarity.

The vector closest to V_result (by angle/direction) is chosen as the output.
In this case, that vector corresponds to the word "Italy".

Flow Diagram

Detailed

Thank you for visiting

Transformers: How AI Solves: “Paris is to France as Rome is to ?”

Table of contents

Flow Steps:

1. Tokenization - Breaking Text Into Understandable Units

Example:

2. Vector Embedding Lookup – Turning Tokens Into Vectors

Example:

3. Positional Encoding – Adding Sequence Awareness

What Positional Encoding Does:

4. Self Attention - Contextual Understanding

5. Semantic Vector Space - Understanding Meaning Through Distance

6. Vector Arithmetic – Finding the Relationship

7. Cosine Similarity – Finding the Nearest Word

Flow Diagram

Detailed

Subscribe to my newsletter

Ram Bhardwaj

Ram Bhardwaj