Tokenization

Think of tokenization like chopping up a sentence into Lego blocks, so the AI can build meaning block-by-block.
The process of breaking text into smaller pieces, called tokens, so a language model ( like chat-GPT , Gemini ) can understand and work with it.
What is Token?
a word (like
hello
)part of a word (like
un-
inunhappy
)a punctuation mark (like
!
,?
)even a space!
Why Tokenization Matters.
GPT doesn’t read words — It reads tokens as number.
Each token is converted to number, then processed by the neural network to provide result
(i.e. predict next token).
How string is tokenized.
Tokenizer → Play with the site to have a better understanding of tokens.
As you can see Hey is represented by 175196, and will always be fixed for “Hey”.
Similarly, all other numbers are generated corresponding to the string.
Different model have different algorithm to convert string to number. Like this current model split alphabet into group of 3 character and the strings are directly converted as a single token.
What after tokenization?
the generated token is passed to Transformer.
Transformer takes input process it and then generate output, which is the next predicted value.
Feedback the token generated by the Transformer, its feeded back to input. This process is continued till <EOS> ( End of string is not given by the transformer) meaning the output is completely generated.
Similarly, <BOS> ( Begin of string) is first generated for starting of prediction.
Summary
Breaking text into smaller pieces, with or without meaning is called tokenization.
GPT doesn't generate text all at once — it predicts one token at a time.
It calculates the probability of the next token, based on all the tokens before it.
Subscribe to my newsletter
Read articles from Naveen Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
