5-Minute Coffee Tip #2: Use Command Line for Efficient CSV File Splitting

Shani RiversShani Rivers
2 min read

When I want to use recent data from SimFin or when I have to do data cleanup at work, if the CSV files are too big (+300MB) and if there are several million records, the only way to work with this much data is to do it in command line.

No spreadsheet program can handle that many records - if you don’t believe me, go ahead and watch how spectacularly it will crash.

So I typically have to split it into smaller files.

This is how I did it.

Unzip it into a directory

I unzipped the file us-shareprices-daily.csv and saved that huge sucker in a directory that I wanted my split files to also reside in. The smaller files will be created using command line, as there is no way to do it manually (and why would you 🤨).

I decided to split the files into 700,000 lines per file.

Here’s the command to do that:

split -l 700000 -d us-shareprices-daily.csv us_spdaily_

Then I was left with the following 8 files from the split:

process_csv % ls
us-shareprices-daily.csv    
us_spdaily_00        us_spdaily_04
us_spdaily_01        us_spdaily_05
us_spdaily_02       us_spdaily_06
us_spdaily_03       us_spdaily_07

Add CSV file suffix

After splitting it, I looped through all the new files and added the .csv suffix to all of them.

for i in $(find us_spdaily_*); do mv $i "$i.csv"; done
process_csv % ls
us-shareprices-daily.csv     
us_spdaily_00.csv        us_spdaily_04.csv
us_spdaily_01.csv        us_spdaily_05.csv
us_spdaily_02.csv       us_spdaily_06.csv
us_spdaily_03.csv       us_spdaily_07.csv

Now the files are all under 50MB.

I deleted the original large file from the directory, as I don't need it anymore.

process_csv % ls
us_spdaily_00.csv        us_spdaily_04.csv
us_spdaily_01.csv        us_spdaily_05.csv
us_spdaily_02.csv       us_spdaily_06.csv
us_spdaily_03.csv       us_spdaily_07.csv

I only added the header row to the first file, us_spdaily_00.csv and used it to copy to the others.

Then, I copied the header line from the first generated file and pasted this line at the beginning of each of the new, smaller files so that they too will also contain the header line, because I didn’t want to go into each and every one of them and do this manually.

for i in $(find . -type f -name "us_spdaily_*.csv" -not -name "us_spdaily_00.csv");
    do echo -e "$(head -1 us_spdaily_00.csv)\n$(cat $i)" > $i;
done

To save large files, I found out that Github also doesn't like anything larger than 50MB so I installed Git Large File Storage using Homebrew, so that I can save them to my repository without Github freaking out, just in case.

But having the ability to do this with a few commands is great for doing it quickly, as most repositories and APIs don’t like files that exceed 25MB in size.

A self-congratulatory sip of coffee never tasted so good. 😌☕️

0
Subscribe to my newsletter

Read articles from Shani Rivers directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shani Rivers
Shani Rivers

I'm a data enthusiast with web development and graphic design experience. I know how to build a website from scratch, but I sometimes work with entrepreneurs and small businesses with their Wix or Squarespace sites, so basically I help them to establish their online presence. When I'm not doing that, I'm studying and blogging about data engineering, data science or web development. Oh, and I'm wife and a mom to a wee tot.