"Tokenization: Turning Text into Secret Pieces đź§©"

What is Tokenization?

Tokenization means breaking down text into smaller pieces called tokens.
Think of tokens like the little building blocks or puzzle pieces of words or characters.

Imagine This:

You have a sentence:
"Hello, how are you?"

When you tokenize it, you split it into parts, like:
["Hello", ",", "how", "are", "you", "?"]

Each of these parts is a token.

Why tokenize?

  • Computers don’t understand whole sentences naturally.

  • They understand numbers better.

  • Tokenization turns your text into numbers (tokens) that the computer can work with.

    How your code tokenizes:

    1. You type some text in the textarea (for example: "Hello world!").

    2. When you click "Tokenize & Decode", this happens inside your code:

      const tokenized = enc.encode(inputText);

      The enc.encode() function splits your text into tokens (numbers).

    3. Each token is a small piece representing part of your text.

    4. These tokens look like numbers, for example: [15496, 995] (these numbers correspond to “Hello” and “world”).

  1. You then see these tokens displayed in your app.

  2. Then the app uses:
    const decodedText = enc.decode(tokenized);
    This turns the tokens back into the original text.

  3. It’s like putting the puzzle pieces back together.

    Summary:

    • Tokenization breaks text into tokens (small pieces).

    • Tokens are numbers that computers can understand.

    • Your app shows these tokens and lets you see the decoded original text again.

0
Subscribe to my newsletter

Read articles from Shubham singh boura directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shubham singh boura
Shubham singh boura