Pandas vs. PySpark: When Bigger Isn’t Always Better

Soheil SheybaniSoheil Sheybani
5 min read

Introduction

In the data engineering world, bigger usually means better. More power, more scalability, and more buzzwords! PySpark, for instance, is typically our trusty steed when tackling massive datasets on distributed clusters like Databricks. But what if, in our race to scale everything to infinity, we've overlooked simpler, faster solutions?

Here's a story where simplicity won big—and PySpark learned a lesson in humility.


The Great Excel Report Battle

I was tasked with creating an Excel report based on a moderately sized dataset—around 100,000 rows. My initial weapon of choice? PySpark DataFrames running in Databricks. After all, PySpark thrives on handling large datasets, and our environment was already set up. Easy choice, right?

Well, not exactly.

The report involved filtering data by certificate types, applying various transformations and sorting criteria, and finally writing each filtered set to separate Excel sheets (132 sheets in total). Here's a glimpse of the PySpark code I confidently started with:

def write_dataframe_to_excel(writer, df, certificate):
    df = df.withColumn(
        "Status Priority",
        F.when(F.col("Status") == "Non-Comply", 0)
        .when(F.col("Status") == "Warning", 1)
        .when(F.col("Status") == "In-Future", 2)
        .otherwise(3)
    ).withColumn(
        "Sorted Expiry Date",
        F.coalesce(
            F.to_date(F.col("Certificate Expiry Date"), "dd/MM/yyyy"),
            F.to_date(F.lit("01/01/1900"), "dd/MM/yyyy")
        )
    ).orderBy("Status Priority", "Sorted Expiry Date")
    df = df.drop("certificate_code", "certificate_name", "Compliance Status", "Status Priority", "Sorted Expiry Date")
    df.toPandas().to_excel(writer, index=False, sheet_name=certificate, startrow=2)

with pd.ExcelWriter(excel_path, engine="openpyxl") as writer:
    for certificate in main_certificate_codes:
        df = compliance_df.filter(F.col("certificate_code") == certificate)
        if certificate in ["BOSIET", "HUET"]:
            df = add_caebs_columns(df)
        df = df.filter(F.col("Compliance Status").isin(["Non-Comply", "Warning", "In-Future"]))
        write_dataframe_to_excel(writer, df, certificate)

Initially, it looked perfect. But execution took nearly 30 minutes! How was PySpark, the champion of big data, struggling with a modest Excel report?

Enter Pandas: The David to PySpark’s Goliath

I started suspecting that I might be over-engineering my solution. After all, was distributed computing even necessary for a dataset of just 100,000 rows?

So, I decided to rewrite the logic using Pandas DataFrames, bypassing PySpark’s overhead entirely. The results were nothing short of astonishing:

def write_pd_dataframe_to_excel(writer, df, certificate):
    df["Status Priority"] = df["Status"].map({
        "Non-Comply": 0,
        "Warning": 1,
        "In-Future": 2
    }).fillna(3)
    df["Sorted Expiry Date"] = pd.to_datetime(
        df["Certificate Expiry Date"], 
        format="%d/%m/%Y", 
        errors="coerce"
    ).fillna(pd.to_datetime("1900-01-01"))
    df = df.sort_values(by=["Status Priority", "Sorted Expiry Date"])
    df.drop(columns=[
        "certificate_code", "certificate_name", "Compliance Status", "Status Priority", "Sorted Expiry Date"
    ], inplace=True)
    df.to_excel(writer, index=False, sheet_name=certificate, startrow=2)

compliance_df = add_caebs_columns(compliance_df)
compliance_df = compliance_df.filter(
    F.col("Compliance Status").isin(["Non-Comply", "Warning", "In-Future"])
)
pd_df = compliance_df.toPandas()
pd_df["PIN"] = pd.to_numeric(pd_df["PIN"], errors="coerce")
with pd.ExcelWriter(excel_path, engine="openpyxl") as writer:
    for certificate in main_certificate_codes:
        df = pd_df[pd_df["certificate_code"] == certificate]
        if certificate not in ["BOSIET", "HUET"]:
            df = df.drop(columns=["CAEBS Issue Date", "CAEBS Expiry Date"])
        write_pd_dataframe_to_excel(writer, df, certificate)

This Pandas solution reduced execution time from 30 minutes to just 2 minutes! No distributed cluster, no overhead, just straightforward Python and Pandas magic.


Why Pandas Crushed PySpark in This Case

Why did Pandas outperform PySpark by such a massive margin? The key lies in the nature of the data and the overhead associated with distributed computing.

  • Overhead of Distributed Processing: PySpark is designed for distributed computing across multiple nodes, involving tasks like data serialization/deserialization, task distribution, and network communication. For smaller datasets like ours, this overhead becomes a significant burden, drastically affecting performance.

  • Absence of JVM Overhead: PySpark runs on the Java Virtual Machine (JVM), which introduces additional overhead and complexity. Pandas, being purely Python-based, avoids this altogether.

  • In-Memory Operations: Pandas operates entirely within memory on a single node, making computations significantly faster.

  • Data Locality: Pandas capitalizes on data locality, ensuring minimal data movement and latency, enhancing processing speed significantly compared to PySpark's distributed nature.


Has Pandas ‌Been Abused In Data Engineering?

Well No! However, Pandas won the game in this scenario by margin, it isn't always the optimal solution. Pandas have some breaking points that might make to to think of other game players like Polar or PySpark:

  • Memory Constraints: Pandas loads all data into RAM. If your dataset size exceeds available memory, Pandas can break or severely slow down performance.
  • Complex Joins on Large Data: Performing complex joins or operations on large datasets might exceed Pandas' capabilities, causing performance degradation.
  • Parallel and Distributed Processing: Pandas is single-threaded, and If parallel and distributed computation is necessary (like ETL on massive datasets), Pandas doesn't scale well.
  • Large-scale I/O Operations: Pandas struggles with massive read/write operations, particularly from distributed sources or formats like Parquet, Avro, or ORC.
  • Real-time Data Processing: Pandas is not optimized for streaming or real-time processing, where PySpark excels.
  • Columnar Optimization: Unlike Polars or PySpark, Pandas doesn't have built-in support for optimized columnar operations, limiting speed in certain operations. If you have lots of columnar operations on a large dataset, Pandas is not a good choice.

Lessons Learned: Choose Your Tools Wisely

This isn’t a "Pandas always beats PySpark on small datasets" scenario—it’s about picking the right tool for the job. Before going for a tool, you need to consider these factors holistically:

  • Data Volume: Does your dataset comfortably fit in RAM?
  • Computation Type: Are you performing complex transformations or simple operations?
  • Latency Requirements: Is real-time or near-real-time performance critical?
  • Infrastructure Available: Do you have access to distributed computing resources?
  • Complexity of Joins and Aggregations: Are complex operations frequent?

Use Pandas for quick, ad-hoc tasks on small-to-medium datasets. Opt for Polars for medium-large datasets on a single powerful machine. Choose PySpark for distributed computing, large datasets, or real-time analytics.

Below is a quick comparison table:

FeaturePandasPySparkPolars
Memory EfficiencyModerateHighVery High
ParallelismNoDistributedMulti-threaded
Dataset Size HandlingSmall-MediumLarge-Very LargeMedium-Large
I/O PerformanceModerateHighHigh
Real-Time ProcessingPoorExcellentModerate
Ease of UseExcellentModerateGood
Columnar OptimizationLimitedGoodExcellent

So, next time you're faced with a data engineering task, pause before automatically reaching for the biggest tool in your toolkit. Sometimes, the simplest approach is not just sufficient—it's superior!

0
Subscribe to my newsletter

Read articles from Soheil Sheybani directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Soheil Sheybani
Soheil Sheybani