Transformers: How AI Solves: “Paris is to France as Rome is to ?”

Table of contents
- Flow Steps:
- 1. Tokenization - Breaking Text Into Understandable Units
- 2. Vector Embedding Lookup – Turning Tokens Into Vectors
- 3. Positional Encoding – Adding Sequence Awareness
- 4. Self Attention - Contextual Understanding
- 5. Semantic Vector Space - Understanding Meaning Through Distance
- 6. Vector Arithmetic – Finding the Relationship
- 7. Cosine Similarity – Finding the Nearest Word
- Flow Diagram

Ever wondered how an AI confidently answers analogies like:
Paris : France => Rome : ? and returns - Italy?
In this blog, we’ll break down how transformer-based AI models interpret and solve such relationships.
Flow Steps:
1. Tokenization - Breaking Text Into Understandable Units
What happens:
The input text is broken down into tokens, usually words or sub-words.
Each token is assigned a unique token ID from a predefined vocabulary.
Example:
"Paris" → token ID: 1256
"France" → token ID: 3452
"Rome" → token ID: 1789
2. Vector Embedding Lookup – Turning Tokens Into Vectors
Once we have token IDs, each is mapped to a vector via an embedding matrix (a learned lookup table). These vectors represent each word’s semantic meaning based on how it's used in language.
Example:
V_paris = [0.12, -0.83, 0.45, ..., 0.31]
V_france = [0.22, -0.71, 0.32, ..., 0.28]
V_rome = [0.09, -0.85, 0.48, ..., 0.29]
Each vector might be 300 or 768 dimensions depending on the model.
3. Positional Encoding – Adding Sequence Awareness
Transformers process all tokens in parallel, so they don’t inherently know the order of the words.“Paris is the first word” or “Rome comes last”
is not obvious without extra help.
What Positional Encoding Does:
Injects position information into each token embedding.
Either using sine/cosine functions (as in the original Transformer paper) or learned positional vectors.
This is added to the embedding vector:
Final_input = Embedding + PositionalEncoding
Now the model knows both:
What the token is
Where it appears
4. Self Attention - Contextual Understanding
Now each enriched token vector is passed through layers of self-attention and feed-forward networks in the transformer.
This produces contextual embeddings, meaning the model understands:
How each word relates to others
The full meaning of the sequence
So "Rome"
now carries contextual understanding like: “It’s part of a capital-country relationship.”
5. Semantic Vector Space - Understanding Meaning Through Distance
The transformer outputs vectors that exist in a semantic space, where similar concepts are close together.
For example:
Capital cities like
"Paris"
,"Rome"
,"Berlin"
have embeddings near each other.Countries like
"France"
,"Italy"
,"Germany"
cluster similarly.
So the model can reason about relationships using spatial distance and direction.
6. Vector Arithmetic – Finding the Relationship
This is where the analogy logic kicks in.
We perform the operation:
// V_result + V_rome = V_france + V_paris
V_result = V_paris - V_france + V_rome
What does this mean?
V_paris - V_france
computes the direction from a country to its capital.Adding
V_rome
projects this relationship onto Rome’s country.
So the model is searching for the word vector that represents:
"The country of Rome, in the same way that France is the country of Paris."
7. Cosine Similarity – Finding the Nearest Word
Once we get V_result
, we compare it with all other word vectors in the vocabulary using cosine similarity.
The vector closest to
V_result
(by angle/direction) is chosen as the output.In this case, that vector corresponds to the word "Italy".
Flow Diagram
Detailed
Thank you for visiting
Subscribe to my newsletter
Read articles from Ram Bhardwaj directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ram Bhardwaj
Ram Bhardwaj
Sharing what I learn everyday