Scene:
It’s my first week mentoring a new team member in the IT department. He’s fresh out of college, excited, and slightly overwhelmed.

New Joiner:
“Shuvam bhaiya, I just joined the NLP team. Everyone’s talking about tokenization. I get it… But also don’t.”

Me:
“Think of tokenization like compiling code — before the compiler can run your Java or C program, it first breaks it into smaller chunks it understands.”

New Joiner:
“So… it’s like splitting sentences into words?”

Me:
“Yes, but not just words. Depending on the tokenizer, we can split into:

Words → 'I love mangoes' → ['I', 'love', 'mangoes']
Subwords → 'mangoes' → ['mango', 'es']
Characters → 'cat' → ['c', 'a', 't']

In programming terms, tokens are the basic units of processing for language models, just like keywords, identifiers, and operators are basic units in a programming language.”

Me:
“Say we feed the sentence:

I enjoy coding in Javascript.

A simple tokenizer might output:
['I', 'enjoy', 'coding', 'in', 'Javascript', '.']

But a Byte Pair Encoding (BPE) tokenizer (used in many modern AI models) might split 'coding' into ['cod', 'ing'] — This helps it handle rare or unknown words without failing.”

New Joiner:
“So tokens are not fixed as ‘words’ — they depend on the tokenizer design?”

Me:
“Exactly. Different models choose different granularities.”

Me:
“Once we split text into tokens, each token gets mapped to a unique integer ID.

Example:
I → 101
enjoy → 204
coding → 305

The model doesn’t understand text directly — it works with these numeric IDs.”

New Joiner:
“Ohhh… so that’s why we call it preprocessing.”

Me:
“Correct. Tokenization is often the very first step in the NLP pipeline, right before feeding data into an embedding layer or a neural network.”

Without tokenization, a model would see:
"IenjoycodinginPython." — one giant unreadable string.

With tokenization, it sees structured input it can process step by step.

New Joiner:
“So in simple terms: Tokenization = breaking text into processable chunks → converting to IDs → feeding to model?”

Me:
“Exactly. It’s like reading a book — you don’t swallow the whole page at once, you read word by word.”

New Joiner:
“Got it. Without tokenization, NLP is like trying to debug a program without indentation.”

Me:
“Bingo. And in AI, no tokens = no magic.”

In AI, no tokens = no magic.

Subscribe to my newsletter

Shuvam Sengupta

Shuvam Sengupta