5-Minute Coffee Tip #2: Use Command Line for Efficient CSV File Splitting

Table of contents

When I want to use recent data from SimFin or when I have to do data cleanup at work, if the CSV files are too big (+300MB) and if there are several million records, the only way to work with this much data is to do it in command line.
No spreadsheet program can handle that many records - if you don’t believe me, go ahead and watch how spectacularly it will crash.
So I typically have to split it into smaller files.
This is how I did it.
Unzip it into a directory
I unzipped the file us-shareprices-daily.csv
and saved that huge sucker in a directory that I wanted my split files to also reside in. The smaller files will be created using command line, as there is no way to do it manually (and why would you 🤨).
I decided to split the files into 700,000 lines per file.
Here’s the command to do that:
split -l 700000 -d us-shareprices-daily.csv us_spdaily_
Then I was left with the following 8 files from the split:
process_csv % ls
us-shareprices-daily.csv
us_spdaily_00 us_spdaily_04
us_spdaily_01 us_spdaily_05
us_spdaily_02 us_spdaily_06
us_spdaily_03 us_spdaily_07
Add CSV file suffix
After splitting it, I looped through all the new files and added the .csv
suffix to all of them.
for i in $(find us_spdaily_*); do mv $i "$i.csv"; done
process_csv % ls
us-shareprices-daily.csv
us_spdaily_00.csv us_spdaily_04.csv
us_spdaily_01.csv us_spdaily_05.csv
us_spdaily_02.csv us_spdaily_06.csv
us_spdaily_03.csv us_spdaily_07.csv
Now the files are all under 50MB.
I deleted the original large file from the directory, as I don't need it anymore.
process_csv % ls
us_spdaily_00.csv us_spdaily_04.csv
us_spdaily_01.csv us_spdaily_05.csv
us_spdaily_02.csv us_spdaily_06.csv
us_spdaily_03.csv us_spdaily_07.csv
I only added the header row to the first file, us_spdaily_00.csv
and used it to copy to the others.
Then, I copied the header line from the first generated file and pasted this line at the beginning of each of the new, smaller files so that they too will also contain the header line, because I didn’t want to go into each and every one of them and do this manually.
for i in $(find . -type f -name "us_spdaily_*.csv" -not -name "us_spdaily_00.csv");
do echo -e "$(head -1 us_spdaily_00.csv)\n$(cat $i)" > $i;
done
To save large files, I found out that Github also doesn't like anything larger than 50MB so I installed Git Large File Storage using Homebrew, so that I can save them to my repository without Github freaking out, just in case.
But having the ability to do this with a few commands is great for doing it quickly, as most repositories and APIs don’t like files that exceed 25MB in size.
A self-congratulatory sip of coffee never tasted so good. 😌☕️
Subscribe to my newsletter
Read articles from Shani Rivers directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Shani Rivers
Shani Rivers
I'm a data enthusiast with web development and graphic design experience. I know how to build a website from scratch, but I sometimes work with entrepreneurs and small businesses with their Wix or Squarespace sites, so basically I help them to establish their online presence. When I'm not doing that, I'm studying and blogging about data engineering, data science or web development. Oh, and I'm wife and a mom to a wee tot.