In AI, no tokens = no magic.

Scene:
It’s my first week mentoring a new team member in the IT department. He’s fresh out of college, excited, and slightly overwhelmed.
New Joiner:
“Shuvam bhaiya, I just joined the NLP team. Everyone’s talking about tokenization. I get it… But also don’t.”
Me:
“Think of tokenization like compiling code — before the compiler can run your Java or C program, it first breaks it into smaller chunks it understands.”
New Joiner:
“So… it’s like splitting sentences into words?”
Me:
“Yes, but not just words. Depending on the tokenizer, we can split into:
Words →
'I love mangoes'
→['I', 'love', 'mangoes']
Subwords →
'mangoes'
→['mango', 'es']
Characters →
'cat'
→['c', 'a', 't']
In programming terms, tokens are the basic units of processing for language models, just like keywords, identifiers, and operators are basic units in a programming language.”
Me:
“Say we feed the sentence:
I enjoy coding in Javascript.
A simple tokenizer might output:['I', 'enjoy', 'coding', 'in', 'Javascript', '.']
But a Byte Pair Encoding (BPE) tokenizer (used in many modern AI models) might split 'coding'
into ['cod', 'ing']
— This helps it handle rare or unknown words without failing.”
New Joiner:
“So tokens are not fixed as ‘words’ — they depend on the tokenizer design?”
Me:
“Exactly. Different models choose different granularities.”
Me:
“Once we split text into tokens, each token gets mapped to a unique integer ID.
Example:I → 101
enjoy → 204
coding → 305
The model doesn’t understand text directly — it works with these numeric IDs.”
New Joiner:
“Ohhh… so that’s why we call it preprocessing.”
Me:
“Correct. Tokenization is often the very first step in the NLP pipeline, right before feeding data into an embedding layer or a neural network.”
Without tokenization, a model would see:"IenjoycodinginPython."
— one giant unreadable string.
With tokenization, it sees structured input it can process step by step.
New Joiner:
“So in simple terms: Tokenization = breaking text into processable chunks → converting to IDs → feeding to model?”
Me:
“Exactly. It’s like reading a book — you don’t swallow the whole page at once, you read word by word.”
New Joiner:
“Got it. Without tokenization, NLP is like trying to debug a program without indentation.”
Me:
“Bingo. And in AI, no tokens = no magic.”
Subscribe to my newsletter
Read articles from Shuvam Sengupta directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Shuvam Sengupta
Shuvam Sengupta
Passionate backend developer with expertise in Node.js, microservices architecture, Kafka, and Docker. Specializing in building scalable solutions for fleet management, EV charging infrastructure, and trip optimization and route planning. Experienced in developing real-time tracking systems, analytics platforms, and intelligent automation for supply chain and logistics. Skilled in leveraging advanced technologies like LangChain and OpenAI language models to drive innovation and efficiency