Counting tokens at scale using tiktoken

Dhaval SinghDhaval Singh
2 min read

Tiktoken is one of the most popular tokenizers out there. This is a really nice and simple cookbook that shows how to use it.

Recently I was optimizing our token counting function that is often used to chunk data to send to embedding models(need precise count), cut-off older text(needs approx count),etc. But one problem I ran into was, for large strings like 10MB, the memory can spike a lot. For 4o, the peak memory can go upto ~80MB and CPU wall time upto ~1.3s! This can be a potential problem when scale is high for your services.

There are 2 major ways you can deal with it.

1. Use approximations wherever possible

For most Gpt models, dividing the length of text by 4 is good enough. In Claude the number is 3.5 and Gemini is again 4.

text = "Large string ....."
token_count = len(text) // 4

For safety, you can keep some buffer depending on what kind of text(emojis, end tokens, etc) you are dealing with, but for most cases this will work fine.

2. Use encode_to_numpy in tiktoken

Sadly encode_to_numpy is not talked about enough. Not surprised, since this is not mentioned in the docs. But as the code suggests for encode_to_num:

"""Encodes a string into tokens, returning a numpy array.

Avoids the overhead of copying the token buffer into a Python list.”””

So, if all you want to do, is find the number of tokens, using encode_to_numpy like this

text = "Large string ....."
token_count = model.encode_to_numpy(text).shape[0]

is pretty fast and efficient. This barely takes any memory or CPU and is much faster too!

1
Subscribe to my newsletter

Read articles from Dhaval Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dhaval Singh
Dhaval Singh