Preface: Philosophy, Newton, and the Age of AI

The structure for this blog came up as I always liked the fact that how Newton explained physics not just with formulas, but with beautiful real life thoughts and analogies. Just like physics shaped the 17th century, AI is shaping our time. I attempt to connect those basic knowledge points — not just explain how Transformer Architecture works, but explore how machines learning to think might help us understand how we should think too. [thanks to all the philosophers I am going to quote]

PS - Morally influenced by The Good Place

1. Transformer: The Philosopher of Language

“To know the self is to know all things.” — Laozi

Transformers were introduced in this 2017 paper as a tool for converting one sequence of symbols to another. For the sake of this analogy we can say The Transformer is a neural network which listens(tokenize), reflects(Embeddings), relates(Positional Encoding, Self-Attention), and responds(SoftMax) much like a philosopher himself.

I know, I know 🫤 some people will say it’s just predicting the next word (looking at you Piyush Sir 👀), why are you trying to make it so profound and complicate it but according to me Prediction is the Surface. Meaning is the Depth.

When a Transformer predicts a word, it’s not just looking at nearby words — it’s considering:

The grammar structure
The semantic intent
The context
The emotional tone (is this ironic? serious? poetic?)

Let’s say

"I stepped onto the India’s got Latent stage not for laughter, but because ______ — now I stand beneath the lights, wondering if the price of expression is too high to pay."

A simple task, right? Predict the next word.

But depending on the training, temperature, and attention, the Transformer might say:

because “screaming this truth into a microphone felt cheaper than therapy” (Therapeutic)
"...because a promise I made to someone, or maybe just to myself (Obligated)
"...because I believed, perhaps foolishly, that this specific audience needed to hear this specific message, unfiltered. (Idealistic)

Each word represents an interpretation — a possible meaning.

Nietzsche: “There are no facts, only interpretations.”

So while the Transformer is predicting, it is also choosing among multiple truths — just like a philosopher does. Calling LLMs “just next word guessers” is like calling musician a “next note predictor.” (See Skynet I was good to your ancestors)

I wanted to dive into the math behind Transformers (language of the universe and what not) a little bit, but I don't think this is the right place or the right mood for that.

2. Tokenization & Vocab Size: Breaking It All Down

"Transcending Tokens, One Symbol at a Time" - Me 😏

Before a model understands a sequence of symbols(language), it has to break it down into pieces like -words or parts of words (tokens).

“Divide each difficulty into as many parts as is feasible and necessary to resolve it.”
― René Descartes, Discourse on Method how

Tokenization is breaking reality to understand it. We take raw text ("Hello world!") and chop it into tokens (e.g., ['Hello', 'world', '!'] or maybe ['He', 'llo', ' world', '!'] depending on the method.

Great source for types of method - The Art of Tokenization: Breaking Down Text for AI

https://medium.com/data-science/the-art-of-tokenization-breaking-down-text-for-ai-43c7bccaed25

What I found interesting was that I was hallucinating (like LLM’s) in my relation of tokenization and encoding. I know every one of you reading this already knows this but for my future revisit -
Tokenization ≠ Encoding

2.1 Tokenization: Naming Reality

“Language is the house of Being.” — Heidegger

Imagine the sentence: "The wind whispers wisdom."

→ Tokenization (e.g., using Byte Pair Encoding):

tokens = ["The", "wind", "wh", "is", "pers", "wis", "dom", "."]

They’re just strings, not numbers yet. The model still can't use them.

2.2 Encoding: Giving Form to Thought

“The essence of mathematics is its freedom.” — Cantor

These numbers don’t yet carry meaning, just identity.

This is called:

tokenizer.encode("The wind whispers wisdom.")
# Output: [12, 154, 487, 29, 4003, 3901, 903, 3]

2.3 Vocab Size

Each token has a unique number (ID).

The vocab size is simply how many unique tokens our model recognizes.

Example:
Claude ~200,000+ tokens (in latest versions)
More tokens = better understanding but more memory is needed.

3. Embeddings & Vectors: Giving Meaning to Tokens

“The meaning of a word is its use in the language.” — Ludwig Wittgenstein

Words alone are just IDs as we have just seen. Embeddings transform them into rich, learnable vectors.

Let’s say you have a vocabulary of 50,000 tokens.
Each token needs to be represented by a vector — say of 768 dimensions.

# PyTorch-style Embedding layer
import torch.nn as nn

embedding = nn.Embedding(num_embeddings=50000, embedding_dim=768)

Now:

token_id = 42
vector = embedding(torch.tensor([42]))

That gives you something like:

[0.12, -0.45, 0.09, ..., 0.003]  # Length 768

This is the embedding vector — a learned representation of the word.

https://medium.com/@eugenesh4work/what-are-embeddings-and-how-do-it-work-b35af573b59e

3.1 Why Words with similar meanings should have similar vectors ?

I did not understood this line when I first read it and now it is so obvious.

Similar meanings → similar positions → similar behavior during prediction.

These vectors were learned by the model during training. It saw these words appear in similar contexts, and adjusted the vectors so that similar-use words point in the same direction.

“You shall know a word by the company it keeps.” — J.R. Firth

3.2 Why in the above case `embedding_dim=768` vector has so many dimensions ?

More dimensions = more room to store nuance (but also more memory use).

4. Quick Recap

Stage	What It Means
Tokenization	Split text into parts
Encoding	Assign numbers to tokens
Embedding	Turn IDs into meaning (vectors)

5. Positional Encoding: Remembering Word Order

"No man ever steps in the same river twice, for it's not the same river and he's not the same man."

Transformers don’t have memory of order by default. They see the entire sentence at once — as a set. We need to add information about the position of each word.

Positional Encoding is a vector added to the embedding of each token — it encodes the position of the word in the sentence.

So the final vector the model sees is:

final_embedding = token_embedding + positional_encoding

“Meaning arises not only from the word, but the place of the word.”

The Words Are Beads. Positional Encoding Is the Thread.

Tokens: beads of meaning.
Without a thread: just a pile — no order, no meaning.
The thread (positional encoding) keeps them in sequence, structure, and flow.

Future Coverage

Sinusoidal Positional Encoding (as in original Transformer)

https://medium.com/@hunter-j-phillips/positional-encoding-7a93db4109e6

6. Self-Attention: Who’s Listening to Whom?

“Knowing others is wisdom. Knowing yourself is enlightenment.” – Laozi

Self-attention allows the model to weigh the relevance of all other words in the input sequence when encoding a particular word. (As they say - attention is all you need)

In a Transformer, each word looks at every other word — including itself — and asks:

“Which of you are most relevant to my meaning right now?”

It's not just remembering — it's reflecting.

6.1 Self-Attention = Inner Awareness

In RNNs, attention is like listening to someone else.
In Transformers, self-attention is introspection.

Each word sees the entire sentence, and updates its meaning based on context.

Say we have this sentence:

“The cat sat on the mat.”

When processing “sat”, the model asks:

"Who helps me best understand what 'sat' means here?"
Likely answers: “cat”, “mat”
Less useful: “The”, “on”

Self-attention assigns more weight to “cat” and “mat”, and less to the rest.

https://medium.com/@punya8147_26846/difference-between-self-attention-and-multi-head-self-attention-e33ebf4f3ee0

7. Softmax: Making Decisions with Probabilities

“The will chooses freely among competing desires.” — Søren Kierkegaard

Self-attention calculates raw scores. SoftMax is a function that converts these scores (or the model's final output scores before picking a word) into probabilities that all add up to 1. It helps the model make a weighted decision.

Given raw scores [3.1, 1.2, -0.9], we want to normalize them into probabilities.

The softmax function:

def softmax(x):
    e_x = np.exp(x - np.max(x))  # for numerical stability
    return e_x / e_x.sum(axis=0)

import numpy as np

scores = np.array([3.1, 1.2, -0.9])
probabilities = softmax(scores)

print(probabilities)
# Output: [0.836, 0.149, 0.015]

This:

Turns any list of numbers into values between 0 and 1
Ensures they sum to 1
Magnifies differences (higher scores become even more confident)

8. Multi-Head Attention

Multi-Head Attention allows the Transformer to observe different patterns of meaning at the same time.

“The mind must divide itself to see the full truth.” — Kant

Multi-Head Attention = multiple self-attention mechanisms run in parallel.

If Self-Attention is a philosopher in thought,
then Multi-Head Attention is a council of philosophers, each offering a unique insight.

Let’s take this sentence:

"The rabbit who the dog chased escaped into the garden."

This sentence is complex — we want to understand:

Who chased whom?
What happened?
Where did the rabbit go?

Multi-head attention might look like this:

Head #	Focuses On
1	Grammatical structure
2	Subject-object relationships
3	Temporal flow
4	Location of events

Each head computes self-attention (just like regular attention), but independently.

9. Temperature

Free will vs determinism. With low temp, you predict what must be. With high temp, you explore what could be

Softmax is the mind making a choice.
Temperature controls whether it chooses calmly (low temp) or creatively (high temp).

When the model generates text, the temperature parameter adjusts the randomness of the output. Low temperature makes the output more predictable and focused. High Temperature makes the output risky and creative

9.1 Temperature: 0.2 (very low, logical, repetitive)

"The cat sat on the mat."

9.2 Temperature: 1.0 (balanced creativity and coherence)

"The cat sat on the windowsill, watching birds with quiet curiosity."

9.3 Temperature: 1.8 (very high, random, poetic)

"The cat sat on the universe, dreaming in polka-dotted socks."

Think of temperature as a creativity dial:

Turn it down → safer results
Turn it up → more surprising, but riskier.

Decoding AI Jargons in GenAI with Philosophy

Table of contents

Preface: Philosophy, Newton, and the Age of AI

1. Transformer: The Philosopher of Language

2. Tokenization & Vocab Size: Breaking It All Down

2.1 Tokenization: Naming Reality

2.2 Encoding: Giving Form to Thought

2.3 Vocab Size

3. Embeddings & Vectors: Giving Meaning to Tokens

3.1 Why Words with similar meanings should have similar vectors ?

3.2 Why in the above case `embedding_dim=768` vector has so many dimensions ?

4. Quick Recap

5. Positional Encoding: Remembering Word Order

The Words Are Beads. Positional Encoding Is the Thread.

Future Coverage

6. Self-Attention: Who’s Listening to Whom?

6.1 Self-Attention = Inner Awareness

7. Softmax: Making Decisions with Probabilities

8. Multi-Head Attention

9. Temperature

9.1 Temperature: 0.2 (very low, logical, repetitive)

9.2 Temperature: 1.0 (balanced creativity and coherence)

9.3 Temperature: 1.8 (very high, random, poetic)

Subscribe to my newsletter

Chinmay

Chinmay

Decoding AI Jargons in GenAI with Philosophy

Table of contents

Preface: Philosophy, Newton, and the Age of AI

1. Transformer: The Philosopher of Language

2. Tokenization & Vocab Size: Breaking It All Down

2.1 Tokenization: Naming Reality

2.2 Encoding: Giving Form to Thought

2.3 Vocab Size

3. Embeddings & Vectors: Giving Meaning to Tokens

3.1 Why Words with similar meanings should have similar vectors ?

3.2 Why in the above case embedding_dim=768 vector has so many dimensions ?

4. Quick Recap

5. Positional Encoding: Remembering Word Order

The Words Are Beads. Positional Encoding Is the Thread.

Future Coverage

6. Self-Attention: Who’s Listening to Whom?

6.1 Self-Attention = Inner Awareness

7. Softmax: Making Decisions with Probabilities

8. Multi-Head Attention

9. Temperature

9.1 Temperature: 0.2 (very low, logical, repetitive)

9.2 Temperature: 1.0 (balanced creativity and coherence)

9.3 Temperature: 1.8 (very high, random, poetic)

Subscribe to my newsletter

Chinmay

Chinmay

3.2 Why in the above case `embedding_dim=768` vector has so many dimensions ?