PySpark: Read Large CSV files efficiently

Scenario

You have a large CSV file (100GB+ of data) with millions of records. Loading the file without optimization causes memory issues and slow performance.

Solution: Use Partitioning & Parquet for Faster Processing

Step 1: Read the Large CSV in PySpark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("OptimizeCSV").getOrCreate()

# Read CSV file with optimal options
df = spark.read.option("header", "true") \
               .option("inferSchema", "true") \
               .csv("path_location/large_data.csv")

df.display(5)

πŸ”Ή Why?

  • inferSchema=True automatically detects column types.

  • CSV files are slow, so we’ll convert them to Parquet.

Step 2: Convert CSV to Parquet & Partition Data

df.write.mode("overwrite") \
      .partitionBy("year") \
      .parquet("path_location/optimized_data/")

Or

df.write.mode("overwrite") \
      .format("parquet") \
      .partitionBy("year") \
      .save("path_location/optimized_data/")
  • Parquet is 10x faster than CSV.
  • Partitioning by "year" speeds up queries.

The large dataset is now efficiently stored & processed faster! πŸš€

0
Subscribe to my newsletter

Read articles from Venkatesh Marella directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Venkatesh Marella
Venkatesh Marella

πŸ“Œ About Me: I am a Data Solution Engineer with 12+ years of experience in Big Data, Cloud (Azure & AWS), and AI-driven data solutions. Passionate about building scalable ETL pipelines, optimizing Spark jobs, and leveraging AI for data automation. I have worked across industries like finance, gaming, automotive, and healthcare, helping businesses make data-driven decisions efficiently. πŸ“Œ What I Write About: PySpark & Big Data Processing πŸ—οΈ Optimizing ETL & Data Pipelines ⚑ Cloud Engineering (Azure & AWS) ☁️ Streaming & Real-Time Data (Kafka, Spark Streaming) πŸ“‘ AI & Machine Learning in Data Engineering πŸ€– πŸ“Œ Why Follow Me? I share real-world data engineering challenges and hands-on solutions to help fellow engineers overcome bottlenecks and optimize data workflows. Let’s build robust, scalable, and cost-efficient data systems together! Follow for updates on cutting-edge data engineering topics!