PySpark: Read Large CSV files efficiently

Scenario
You have a large CSV file (100GB+ of data) with millions of records. Loading the file without optimization causes memory issues and slow performance.
Solution: Use Partitioning & Parquet for Faster Processing
Step 1: Read the Large CSV in PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("OptimizeCSV").getOrCreate()
# Read CSV file with optimal options
df = spark.read.option("header", "true") \
.option("inferSchema", "true") \
.csv("path_location/large_data.csv")
df.display(5)
πΉ Why?
inferSchema=True
automatically detects column types.CSV files are slow, so weβll convert them to Parquet.
Step 2: Convert CSV to Parquet & Partition Data
df.write.mode("overwrite") \
.partitionBy("year") \
.parquet("path_location/optimized_data/")
Or
df.write.mode("overwrite") \
.format("parquet") \
.partitionBy("year") \
.save("path_location/optimized_data/")
- Parquet is 10x faster than CSV.
- Partitioning by "year" speeds up queries.
The large dataset is now efficiently stored & processed faster! π
Subscribe to my newsletter
Read articles from Venkatesh Marella directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Venkatesh Marella
Venkatesh Marella
π About Me: I am a Data Solution Engineer with 12+ years of experience in Big Data, Cloud (Azure & AWS), and AI-driven data solutions. Passionate about building scalable ETL pipelines, optimizing Spark jobs, and leveraging AI for data automation. I have worked across industries like finance, gaming, automotive, and healthcare, helping businesses make data-driven decisions efficiently. π What I Write About: PySpark & Big Data Processing ποΈ Optimizing ETL & Data Pipelines β‘ Cloud Engineering (Azure & AWS) βοΈ Streaming & Real-Time Data (Kafka, Spark Streaming) π‘ AI & Machine Learning in Data Engineering π€ π Why Follow Me? I share real-world data engineering challenges and hands-on solutions to help fellow engineers overcome bottlenecks and optimize data workflows. Letβs build robust, scalable, and cost-efficient data systems together! Follow for updates on cutting-edge data engineering topics!