Modes of handling corrupt data

Today, we have a substantial amount of data, and it's not necessary that all the records are free from corruption. PySpark provides us with three modes for handling corrupted data.
Let's delve into it!
Permissive mode -
In this approach, PySpark will assign null values to the corrupted records in while reading. This is suitable for scenarios where a few corrupted records will not hinder your ability to gain insights.
spark.read.option("mode", "permissive").csv("testData.csv")
Drop Malformed Mode -
This mode is most suitable for situations where there is a stringent requirement for data quality and no tolerance for corruption. PySpark drops the rows containing malformed records during the reading process.
spark.read.option("mode", "dropMalformed").json("testData.json")
FailFast Mode -
When we cannot afford any errors, this mode quickly identifies and rectifies the corrupted data from the beginning.
spark.read.option("mode", "FAILFAST").parquet("testData.parquet")
By default, PySpark is configured in Permissive mode, but we have the flexibility to select the appropriate mode based on our specific requirements.
Happy handling !
Subscribe to my newsletter
Read articles from Harshita Chaudhary directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Harshita Chaudhary
Harshita Chaudhary
Hii, My name is Harshita Chaudhary, I am a data engineer. I have joined Hashnode to share my day to day learnings with you guys, I like to write tech blogs on the fundamental topics of data engineering in a concise manner adding proper and practical examples. Happy learning and Keep Querying !!