Big Data Processing with Dask and Pandas

How I processed millions of Medicare records on a regular laptop without rage-quitting.

Ever stared at a 2GB CSV file like it was a boss-level enemy in a video game?

Same.

Working with large healthcare datasets can be like trying to sip from a firehose.

This post is for the brave souls who tried pandas.read_csv(“your_soul.csv”) and ended up with their laptops wheezing like an asthmatic goat. Here’s how I handled a dataset that was thicc — not in size, but in sheer memory consumption.

In my case, I needed to filter, enrich, and analyze Medicare Provider and Service data — each CSV file several gigabytes in size.

That’s where tools like Dask, Pandas, and some smart processing logic came to the rescue.

🔧 Tools I Used (And Why They’re Awesome)

🐍 Python

Python is the natural choice for data processing. Clean syntax, rich ecosystem, and great community support.

🐼 Pandas

Ideal for working with moderate-sized tabular data once filtered or aggregated. I used it to:

Handle final cleaning
Merge smaller DataFrames
Export results to Excel

Dask

The real MVP.🥷 Dask is like Pandas, but parallelized. It handles large datasets out-of-core (i.e., doesn’t load the whole dataset into RAM). I used Dask to:

Load and filter multi-GB CSVs
Apply column-based operations
Work with lazy computation to avoid memory overload

CSV + Excel

Processed and exported data first to CSV (faster write), then converted to Excel for usability by non-tech stakeholders.

🧠 The Setup

First things first, if you’re dealing with huge CSVs: don’t use Pandas for everything. Pandas is great. It’s like a Swiss Army knife. But you don’t build a skyscraper with a Swiss Army knife.

Use Dask. It’s like Pandas went to the gym.

import dask.dataframe as dd

df = dd.read_csv("massive_file.csv", assume_missing=True)

The assume_missing=True tells Dask, “Hey, treat everything like it might be missing. Because life is unpredictable.”

Step 1: Filtering Without Dying

Say you only care about stuff from Texas (because Texas is big, like this dataset):

tx_df = df[df["State"] == "TX"]

Dask doesn’t actually do this yet — it plans to do it. Lazy evaluation, baby. The filtering happens only when you call .compute() .

tx_df_computed = tx_df.compute()

Now you’ve got only the Texas-sized chunk of your data — both literally and metaphorically.

Step 2: Merging Like a Pro

Time to bring in more data. Dask can merge, but it’s not always smooth, so I did this:

import pandas as pd

other_df = pd.read_csv("smaller_but_still_annoying.csv")
merged_df = tx_df_computed.merge(other_df, on="NPI", how="left")

Big joins are easier when one of the DataFrames is small-ish. This is the ol’ “divide and conquer” move.

Step 3: Flags, Tags, and Minor Wizardry

Now I needed to tag rows where a specific condition was met — let’s say some magic codes like ‘15271’ to ‘15278’ exist:

magic_codes = [str(i) for i in range(15271, 15279)]

merged_df["flag"] = merged_df["Codes"].apply(
    lambda codes: 1 if isinstance(codes, list) and any(code in magic_codes for code in codes) else "",
)

This works like a charm — assuming you enjoy functional programming and a mild amount of pain.

Step 4: Save It Without Exploding RAM

Don’t save directly to Excel — that’s a trap. Save to CSV first:

merged_df.to_csv("temp_output.csv", index=False)

Then use Pandas to turn it into Excel:

df = pd.read_csv("temp_output.csv")
df.to_excel("final_output.xlsx", index=False)

Why this detour? Because to_excel() hates big data more than I hate surprise deadlines.

Final Touch: Timing the Beast

If you want to see how long your code took — so you can feel productive even when it’s still running:

import time
start = time.time()
# your code
print(f"Done in {(time.time() - start) / 60:.2f} minutes.")

— — — too long; didn’t read

Use Dask for big CSVs.
Use .compute() when you’re ready to process.
Merge smartly — use Pandas when one file is small.
Tag things with apply() like a boss.
Avoid Excel until the very end.
Track time. Brag accordingly.

If this helped you tame the CSV kraken, let me know. If it crashed your PC — also let me know, so we can share memes about it.

Efficient Big Data Processing in Python: A Practical Guide Using Dask and Pandas