Tokenization

Naveen KumarNaveen Kumar
2 min read

Think of tokenization like chopping up a sentence into Lego blocks, so the AI can build meaning block-by-block.

The process of breaking text into smaller pieces, called tokens, so a language model ( like chat-GPT , Gemini ) can understand and work with it.

What is Token?

  • a word (like hello)

  • part of a word (like un- in unhappy)

  • a punctuation mark (like !, ?)

  • even a space!

Why Tokenization Matters.

  • GPT doesn’t read words — It reads tokens as number.

  • Each token is converted to number, then processed by the neural network to provide result
    (i.e. predict next token).

How string is tokenized.

Tokenizer → Play with the site to have a better understanding of tokens.

  • As you can see Hey is represented by 175196, and will always be fixed for “Hey”.

  • Similarly, all other numbers are generated corresponding to the string.

  • Different model have different algorithm to convert string to number. Like this current model split alphabet into group of 3 character and the strings are directly converted as a single token.

What after tokenization?

  • the generated token is passed to Transformer.

  • Transformer takes input process it and then generate output, which is the next predicted value.

  • Feedback the token generated by the Transformer, its feeded back to input. This process is continued till <EOS> ( End of string is not given by the transformer) meaning the output is completely generated.

  • Similarly, <BOS> ( Begin of string) is first generated for starting of prediction.

Summary

  • Breaking text into smaller pieces, with or without meaning is called tokenization.

  • GPT doesn't generate text all at once — it predicts one token at a time.

  • It calculates the probability of the next token, based on all the tokens before it.

0
Subscribe to my newsletter

Read articles from Naveen Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Naveen Kumar
Naveen Kumar