The Complete Guide to Tokenization in Generative AI

Sandesh akSandesh ak
5 min read

1. What is Tokenization?

In Generative AI, tokenization is the process of breaking down raw text (or any sequence like code, speech, or even images) into small, meaningful units called tokens.

Think of tokens as the alphabet of the AI’s language. Just as we use letters to form words and sentences, AI uses tokens to represent and understand any input.


Why not use whole words as tokens?

Because language is complex:

  • Some words are very rare and might never have been seen during training.

  • Words can be combined in new ways.

  • Many words share roots: "connect", "connection", "connected". Instead of storing all separately, store reusable parts.


A Real Analogy

Imagine you’re teaching a robot to cook:

  • If you teach it every single recipe, it will need a huge memory.

  • If you teach it basic cooking steps (chop, boil, fry), it can combine them to make any recipe.

Tokens are like basic cooking steps for language.


Example

Text: "ChatGPT is amazing!"
Possible Tokens: ["Chat", "G", "PT", " is", " amazing", "!"]

Here:

  • "Chat" is one token

  • "G" is another token

  • Spaces are preserved in " is"

  • "amazing" is split into " amazing" (with space) because spacing matters.


2. Why Do We Need Tokenization?

Tokenization is crucial in GenAI for four main reasons:

  1. AI models understand numbers, not text
    Each token is mapped to an integer ID.
    "Chat"1234, "G"987.

  2. Smaller vocabulary = More efficient learning
    Instead of remembering every possible word, the AI remembers chunks and combines them.

  3. Better handling of new words
    "Blockchain" may not be in the vocabulary, but "Block" + "chain" will be.

  4. Performance and Cost
    APIs like OpenAI charge per token. Smaller prompts = lower cost & faster output.


3. Types of Tokenization

Generative AI models don’t all tokenize in the same way. Here are the main approaches:


A. Word-Level Tokenization

  • Breaks on spaces/punctuation.

  • Example:
    "I love AI"["I", "love", "AI"]

  • Pros: Simple to understand.

  • Cons: Can’t handle new or rare words well. "unfathomable" would be a single unknown token.


B. Character-Level Tokenization

  • Every single character is a token.

  • Example:
    "AI"["A", "I"]

  • Pros: Works with any text.

  • Cons: Sentences become very long (slow to process).


C. Subword-Level Tokenization (The industry standard for GPT, LLaMA, etc.)

  • Splits words into frequent chunks found in training.

  • Common algorithms:

    • BPE (Byte Pair Encoding)

    • WordPiece

    • SentencePiece

  • Example:
    "ChatGPT"["Chat", "G", "PT"]

  • Pros:

    • Works for any text

    • Handles new words

    • Vocabulary stays small


4. Real-World Applications of Tokenization

Tokenization isn’t just theory — it’s used in nearly every AI system you interact with daily:

  1. Chatbots – Converting your message into tokens so the AI can process it.

  2. Search Engines – Splitting your query into searchable terms.

  3. Machine Translation – Breaking sentences into tokens for accurate translation.

  4. Code Generation – Tokenizing source code for programming assistants like Copilot.

  5. Speech-to-Text – Tokenizing recognized speech before AI processes it.


5. Common Challenges in Tokenization

Even with advanced algorithms, tokenization has real challenges:

  1. Languages without spaces (Chinese, Japanese) → Token boundaries are harder to detect.

  2. Emojis and special symbols → Some tokenizers can break them incorrectly.

  3. Token Limits → GPT models have maximum token capacities (e.g., GPT-4 Turbo: 128k tokens).

  4. High Cost → Large inputs = higher API bills.

  5. Loss of Meaning → Incorrect splits can change interpretation.


6. Simple Example: Tokenization in Action (Dry Run)

Let’s tokenize:

"ChatGPT is amazing!"

Step 1 – Vocabulary Lookup

The tokenizer has a pre-built vocabulary:

"Chat"     → 101
"G"        → 102
"PT"       → 103
" is"      → 104
" amazing" → 105
"!"        → 106

Step 2 – Tokenize

Tokens: ["Chat", "G", "PT", " is", " amazing", "!"]

Step 3 – Convert to IDs

Token IDs: [101, 102, 103, 104, 105, 106]

Step 4 – Model Processes IDs

The AI never sees the raw text — it works only with these integers.


Step 5 – Decode Output

Model’s output token IDs → Converted back to human text.


7. JavaScript Code Examples


A. Simple Space Tokenizer (for learning only)

function simpleTokenizer(text) {
  return text.split(/\s+/);
}

console.log(simpleTokenizer("ChatGPT is amazing!"));
// ["ChatGPT", "is", "amazing!"]

B. Using tiktoken for Real GPT Tokenization

// npm install @dqbd/tiktoken
import { encoding_for_model } from "@dqbd/tiktoken";

const text = "ChatGPT is amazing!";
const encoder = encoding_for_model("gpt-3.5-turbo");

const tokens = encoder.encode(text);
console.log("Tokens:", tokens);
console.log("Token Count:", tokens.length);
console.log("Decoded:", encoder.decode(tokens));

encoder.free();

C. Token Counting for Cost

function estimateCost(tokenCount, pricePerThousand) {
  return (tokenCount / 1000) * pricePerThousand;
}

console.log("Estimated Cost:", estimateCost(100, 0.002), "USD");
// Example: 100 tokens at $0.002/1k → 0.0002 USD

8. Advanced Concepts in Tokenization

1. Byte Pair Encoding (BPE)

  • Starts from single characters.

  • Merges the most frequent pairs.

  • Gradually builds larger chunks like "ing", "tion".


2. SentencePiece

  • Treats text as raw bytes (no pre-splitting).

  • Can handle multiple languages without special rules.


3. Detokenization

  • Reverse process — turning token IDs back into readable text.

4. Token Streaming

  • Model sends tokens as they’re generated (good for live chats).

5. Special Tokens

  • <BOS> – Beginning of sequence

  • <EOS> – End of sequence

  • <PAD> – Padding
    Used internally by models for structure.


9. Conclusion

Tokenization is the foundation of Generative AI — without it, models wouldn’t understand or produce meaningful output.

  • For efficiency, use subword tokenization.

  • For cost control, measure token counts before sending prompts.

  • For prompt engineering, remember: fewer tokens = faster & cheaper responses.

Understanding tokenization helps you design better AI applications, optimize costs, and avoid model truncation issues. It’s not just a preprocessing step — it’s part of how AI thinks.

0
Subscribe to my newsletter

Read articles from Sandesh ak directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sandesh ak
Sandesh ak