Last month, I published a package on PyPi - TezzCrawler - a simple CLI tool to convert any website to LLM ready draft for building a RAG capabilities on any website. What spiraled next, was hundreds of hours of analysis on how PyPi works and what happens in the background when you publish a package on PyPi.

The issue that caused me to dig a rabbit hole

Few weeks ago, I got curious on how the package is doing. Did anyone other than me even downloaded the package or it is just a piece of junk in the discarded pile of hundreds of thousands of PyPi packages. And I copied a simple python script from StackOverflow to get total download stats and got this

Python script and final results for calculating downloads from PyPi

Well, to be honest, this is a few too many. When I ran this script, I was expecting to see a number like 10 or 15. Never in my life had I thought it would be more than 3,400!

As you’d think, I was elated seeing the number like this. I didn’t give it much thought until after a few days. TezzCrawler works fine and gets the work done, but it is neither the only package that solves this problem, is not the fastest crawler, and most importantly, I had not even mentioned it to anyone that I have published it, nobody other than me knew it even existed in the past month. So how did a tool with 0 marketing get these many downloads? I’d sure want to know in order to replicate the results in other projects.

And then started a deep dive which eventually left me about $10 poorer.

PyPi Stats

There are a number of analytic tools where you can go and check your package’s statistics. Officially, there are 4 that are promoted but only 2 of them provided download numbers as part of their statistics (and both gave different numbers and none matched with the API provided numbers 😭)

ClickHouse, an open source data warehousing tool, maintains a dashboard called ClickPy for analysing any Python package’s statistics. Searching on the platform gave me a cumulative download count of 2,400. A staggering 1000 difference than what the API counts are!

ClickPy dashboard

Oh cool, the package is highly popular in the US, Canada, China, and Russia.

Anyways, this sparked another debate in my head. Why are the numbers so much different?

To get to the bottom of it once and for all, I went to PypiStats in the hopes that I’ll get all my answers there (since they are the ones that provided the Python API I used at the first place). But I left with more questions I started with.

PyPi stats doesn’t provide a cumulative download but only a daily, weekly, and monthly count. But look closely, The last month count as per PyPi Stats is 861 but ClickPy dashboard screenshot from before shows a total download count of 2,100. (WTF is that difference?)

Also, this brought in a new terminology as well - Mirrors. I knew what mirrors are, just didn’t expect PyPi to be using it (either I was stupid or I am stupid to not realize this earlier 🥲)

Here’s a key detail - anybody can boot up their own PyPi server. This server is essentially what is called a Mirror. Depending on how many of the packages from the main PyPi server are reflected onto the Mirror, there are essentially 3 segregations of mirrors - Private, Partial, or Public. Public Mirrors are exact 1-1 replica of PyPi while Private Mirrors are hosted by companies internally for the packages they allow their teams to use for development.

There’s a tool called Bandersnatch that allows you to replicate any Python package to a Mirror. (This information will come in handy in a minute)

So, the download count mentioned on PyPiStats is without Mirrors. 861 downloads from without Mirrors and 2100 downloads with Mirrors. So that means, ~1300 mirrors duplicated the package to the Mirror during the last month and 861 actual downloads happened. I got my actual download count what I was after and I should call it a day now.

But, hold on a second (meme gif)

861 downloads is still a lot more than 10-15 downloads I was expecting! Did I miss something during the analysis?

The analysis where I lost my money

Okay, till now, I’ve got 3 things - anybody can create a Mirror and a sync between Mirror and main PyPi server can trigger a “download” count in PyPi stats; if someone downloads from a Mirror, it doesn’t reflect in PyPi stats; and ClickPy is a waste of a dashboard.

But the main quest “How many downloads did actually happen for my package?” was still unfulfilled. And at this stage, I found the holy grail - PyPi migrates all of its logs to a BigQuery dataset.

The BigQuery dataset of PyPi consists of 3 tables - package metadata, download events, and download request metadata. Running a really simple query on BigQuery like the below one was a little out of my budget (It was going to process 15TB of data to execute) and I didn’t want to burn my entire budget on 1 query.

SELECT count(*)
FROM bigquery-public-data.pypi.file_downloads
WHERE file.project = 'tezzcrawler'

So I had to get a little creative.

SELECT COUNT(*) as download_frequency,
  DATE_TRUNC(DATE(timestamp), MONTH) AS month
FROM bigquery-public-data.pypi.file_downloads
WHERE project = 'tezzcrawler'
GROUP BY month
ORDER BY month DESC

Running the query, I get… 2,408 downloads. Exactly what ClickPy tells. But wait, I can segregate on how the package was downloaded as well.

SELECT 
  DISTINCT(details.installer.name) as installer,
  COUNT(*) as download_frequency
FROM bigquery-public-data.pypi.file_downloads
WHERE project = 'tezzcrawler'
GROUP BY installer
ORDER BY download_frequency DESC

To save on costs while writing the article, I added a time period in all the queries to get the screenshot of the results to explain the findings. This above query, was executed on a 30 day period.

Here’s what I found on executing the above query.

This is more on the lines of what I was actually expecting. The actual downloads are 39, that happened via pip install command. The rest are some other activities with Bandersnatch being a Mirror. My guess is some bots monitoring PyPi packages directly on the main server instead of first creating a Mirror. Unfortunately, the user agent is not tracked in the download request, only the TLS protocol is and that doesn’t really tell much about the remaining downloads.

So, there we have it, 39 downloads in the last month for TezzCrawler. The actual count, the honest number, and not what these dashboards were feeding my ego earlier. Now I can have some peace, but more importantly, I've learned a valuable lesson in the importance of data accuracy and the nuances of open-source package distribution.

For fellow developers and open-source enthusiasts, take away this: don't be misled by vanity metrics. Dig deeper, question the numbers, and understand the ecosystem your project operates within. It might not always be as glamorous as thousands of downloads, but the true measure of your project's impact lies in its genuine usage and the value it brings to its users. With this newfound understanding, I'm excited to focus on what really matters – improving TezzCrawler for those 39 users, and hopefully, many more to come.

Every Thursday, 2:30PM IST, I’ll share 1 article directly to your mailbox. The next few articles are things I learnt while developing my own RAG Framework, which I would have definitely missed had I stuck to using pre-developed frameworks and probably spent days debugging. If this resonates with you, signup for the free newsletter.

What I learned about PyPi from maintaining an Open-Source Package

The issue that caused me to dig a rabbit hole

PyPi Stats

The analysis where I lost my money

Subscribe to my newsletter

Japkeerat Singh

Japkeerat Singh