Making My Own Tokenizer from Scratch in Python

I wanted to understand how tokenization works, so I tried building a simple version myself.

No libraries. Just plain Python.
This tokenizer converts each letter (and even spaces) into a custom token.

Step 1: Create a Custom Vocabulary

Each character has a code I made up:

vocab = {
    'a': 'ht45',
    'b': 'rt345',
    'c': 'zz12',
    'd': 'lk78',
    'e': 'uv99',
    'f': 'wx21',
    'g': 'mn03',
    'h': 'aa22',
    'i': 'bb33',
    'j': 'cc44',
    'k': 'dd55',
    'l': 'ee66',
    'm': 'ff77',
    'n': 'gg88',
    'o': 'hh99',
    'p': 'ii11',
    'q': 'jj22',
    'r': 'kk33',
    's': 'll44',
    't': 'mm55',
    'u': 'nn66',
    'v': 'oo77',
    'w': 'pp88',
    'x': 'qq99',
    'y': 'rr00',
    'z': 'ss11',
    ' ': 'spc99'
}

Step 2: Encoding Text

This function turns text into tokens:

def encode(text):
    result = []
    for c in text:
        if 'A' <= c <= 'Z':
            c = chr(ord(c) + 32)  # convert to lowercase
        if c in vocab:
            result.append(vocab[c])
        else:
            result.append('<unk>')
    return result

Step 3: Decoding Back

This one brings the tokens back to text:

def decode(tokens):
    rev = {}
    for key in vocab:
        rev[vocab[key]] = key

    out = ''
    for t in tokens:
        if t in rev:
            out += rev[t]
        else:
            out += '?'
    return out

Example

msg = "hello ashwin hegde"
enc = encode(msg)
print("Encoded:", enc)

dec = decode(enc)
print("Decoded:", dec)

Output:

Encoded: ['aa22', 'uv99', 'ee66', 'ee66', 'hh99', 'spc99', 'ht45', 'll44', 'aa22', 'pp88', 'bb33', 'gg88', 'spc99', 'aa22', 'uv99', 'mn03', 'lk78', 'uv99']
Decoded: hello ashwin hegde

Summary

This was a small project I did to learn how tokenizers work at the basic level. It helped me understand the process better.

That’s all for now — thanks for reading!

Building a Tokenizer in Python