Reading 20k excel files per minute with Node

Recently I was working on an adhoc activity which involved writing a script that would extract some cell data from around 75k excel files, assume these files are readily available on your disk. Because these are user provided files, some of these are corrupt and can cause a range of issues when parsed. In an attempt to optimize the data extraction script, I ended up at a point where the final script ran 30X faster and handled errors gracefully. Let’s look at the approach.
Excel files are like bombs
Parsing excel files is like running user provided code on your system, when these files are read using excel libs, these can cause issues like heap overflow, blocking of main thread, unforeseen exceptions, etc. The most popular excel libraries in the JS ecosystem are SheetJs and ExcelJs, both of them face issues like blocking of main thread & heap overflow.
How do we read them then ?
Because these files can block the main thread (infinite loop), cause heap overflow and throw unknown exception during parsing, we need to run them in a container type environment, we can use NodeJs worker threads like a container. NodeJs allows us to specify the memory we want to allocate to the worker beyond which the worker would terminate. If the worker goes in infinite loop we can add a timer in the main thread which would terminate the worker if it doesn’t respond after a few seconds.
Approach 1: Spawn worker thread when reading file
We can limit the worker thread’s memory by specifying it in resourceLimits
property (see this) when creating the worker. If the worker goes in infinite loop, we need to specify a timeout after which we will terminate the worker. In NodeJs, a worker thread gets automatically terminated once the script inside it has completed running.
You can find the code for this approach here. When run on 20k excel files (with sizes ranging from 200 to 300KB), with 40 corrupt files, it completes in 40 minutes.
Approach 2: Re-use the same thread
In the previous approach we were spawning a new thread for each excel file, because the worker exited after running the script, to make the worker run indefinitely we can add a really long interval timer in the worker. Now we want to push another file into the worker as soon as it has completed processing the last file or thrown an exception.
How do we do it ?
We can use the EventEmitter
to inform other part of code that has to push a new file to worker for processing. With this we can have a completely even driven system which would process files, listen to events to make decisions of when to push file to worker, when to terminate worker, etc.
You can find the code for this approach here. When run on 20k excel files (with sizes ranging from 200 to 300KB), with 40 corrupt files, it completes in 5 minutes 🔥 . Wow, this is much better than our previous script’s performance.
Approach 3: Use all cores !
We are still not using all the resources, we can process files parallely in multiple threads, for an octacore system, we can spawn 7 parallel threads and leave the last core for the main thread.
This also adds complexity because now we want to manage a pool of 7 workers, manage their timeouts and send messages to correct worker which becomes available. Writing this kind of code using even driven methodology is really easy. Whenever a worker exists or sends any kind of message, we use it’s threadId to take an action.
You can find the code for this approach here. When run on 20k excel files (with sizes ranging from 200 to 300KB), with 40 corrupt files, it completes in around 1 minute 💀 .
The scripts were run on a Macbook Air M3 and took a maximum RAM of 600MB.
Conclusion
The approach is very similar to how message queues work in a distributed system, the queue consumers pull message from queue as soon as they have completed processing previous message, if a consumer/worker goes down, it can be restarted. Multiple consumers can pull from the messages queue without any delay. This is a good example of how good Nodes IO is, if used well, it can be used to complete IO tasks from hours to minutes.
The source code for the examples can be found in this repository, give it a star if you learned something interesting.
Subscribe to my newsletter
Read articles from Sudheer Tripathi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sudheer Tripathi
Sudheer Tripathi
I am a fullstack engineer and spend most of my time building stuff with NodeJs & PHP. My daily work involves, brainstorming, discovery, running peer reviews, writing feature code, unit tests, etc and helping the team where needed.