Building a Tokenizer in Python

2 min read
Making My Own Tokenizer from Scratch in Python
I wanted to understand how tokenization works, so I tried building a simple version myself.
No libraries. Just plain Python.
This tokenizer converts each letter (and even spaces) into a custom token.
Step 1: Create a Custom Vocabulary
Each character has a code I made up:
vocab = {
'a': 'ht45',
'b': 'rt345',
'c': 'zz12',
'd': 'lk78',
'e': 'uv99',
'f': 'wx21',
'g': 'mn03',
'h': 'aa22',
'i': 'bb33',
'j': 'cc44',
'k': 'dd55',
'l': 'ee66',
'm': 'ff77',
'n': 'gg88',
'o': 'hh99',
'p': 'ii11',
'q': 'jj22',
'r': 'kk33',
's': 'll44',
't': 'mm55',
'u': 'nn66',
'v': 'oo77',
'w': 'pp88',
'x': 'qq99',
'y': 'rr00',
'z': 'ss11',
' ': 'spc99'
}
Step 2: Encoding Text
This function turns text into tokens:
def encode(text):
result = []
for c in text:
if 'A' <= c <= 'Z':
c = chr(ord(c) + 32) # convert to lowercase
if c in vocab:
result.append(vocab[c])
else:
result.append('<unk>')
return result
Step 3: Decoding Back
This one brings the tokens back to text:
def decode(tokens):
rev = {}
for key in vocab:
rev[vocab[key]] = key
out = ''
for t in tokens:
if t in rev:
out += rev[t]
else:
out += '?'
return out
Example
msg = "hello ashwin hegde"
enc = encode(msg)
print("Encoded:", enc)
dec = decode(enc)
print("Decoded:", dec)
Output:
Encoded: ['aa22', 'uv99', 'ee66', 'ee66', 'hh99', 'spc99', 'ht45', 'll44', 'aa22', 'pp88', 'bb33', 'gg88', 'spc99', 'aa22', 'uv99', 'mn03', 'lk78', 'uv99']
Decoded: hello ashwin hegde
Summary
This was a small project I did to learn how tokenizers work at the basic level. It helped me understand the process better.
That’s all for now — thanks for reading!
0
Subscribe to my newsletter
Read articles from Ashwin Hegde directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
