Explaining tokenization to a 1st year university student.

If we break down tokenization into two different parts: tokens + ization, where tokens mean a small chunk or piece of anything, and ization means converting to something, then if we combine both, tokenization means converting something into tokens or chunks or small pieces.

Now let's discuss why we need tokenization and where it's used. In the world of growing AI models like ChatGPT, Gemini, and Perplexity, tokenization is essential. These models rely on predictions based on tokenization. Here's how these models benefit from tokenization, following a few steps:

  1. They break a full sentence into tokens or small pieces. This can be done in different ways, like by each word or even parts of a word.

    • For example, “Hey, I love ice cream.”

    • When converted into tokens, it becomes [“Hey”, “,”, “I”, “love”, “Ice”, “cream”].

  2. Now the models will convert these tokens into numbers.

    • The tokens ["Hey", ",", "I", "love", "Ice", "cream"] will be converted to numbers like ["111", "12", "44", "5698", "986", "25896"]..
  3. Now, using these converted numbers, the models will predict the next word or, more accurately, the next token.

This is how tokenization actually works.

Earlier, we learned that tokenization involves breaking a full sentence into smaller tokens. But now the question is: how do we decide how to divide the sentence? Should it be by each character, each word, or something else? The answer is that each AI model uses a different technique for tokenization. Here are a few examples:

a) Word Tokenization

  • Splitting text by spaces into whole words.

  • Problem: It doesn't work well for unknown words or different forms of words.

  • "playing" and "played" are treated as completely different tokens.

b) Subword Tokenization

  • Break words into smaller chunks that can be combined.

  • "playing" might be split into "play" + "ing".

  • This helps the model understand rare words and new combinations.

c)Character Tokenization

  • Each character (letter, punctuation, space) is treated as a token.

  • "cat" becomes "c", "a", "t".

Now, with this, I want to conclude that tokenization is just a technique for AI models that helps them predict the next word, or more accurately, the next token.

0
Subscribe to my newsletter

Read articles from Vipul kant chaturvedi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vipul kant chaturvedi
Vipul kant chaturvedi