Build your first ETL pipeline using PySpark

Overview
In this session, weβll walk through building a simple ETL (Extract, Transform, Load) pipeline using PySpark
Steps Involved
Setup Spark Environment
Read Data from CSV
Cleanse and Transform the CSV data
Write final results to target location
Setup Spark Environment
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SampleETL").getOrCreate()
Import Libraries: The pyspark.sql
library is imported to use the SparkSession
class, which is essential for any Spark application
Read Data from CSV
input_path="stroage_path_to_the_location" df = spark.read.option("header","true").csv(input_path) display(df)
Cleanse and Transform Data
Drop Null Values
df_cleansed=df.dropna(["Column 1","Column 2"])
Filter Columns
from pyspark.sql.functions import lit df_filtered=df_cleansed.where("Column 3 == lit('TEXT') ")
Add New Columns
df_addition=df_filtered.withColumn("CreateDt", current_timestamp())
Write Data to Target Location
taget_path_location=f"/mnt/silver/storage1/container_location" df_addition.write.mode("overwrite").format("parquet").save(f"{taget_path_location}")
Subscribe to my newsletter
Read articles from Venkatesh Marella directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Venkatesh Marella
Venkatesh Marella
π About Me: I am a Data Solution Engineer with 12+ years of experience in Big Data, Cloud (Azure & AWS), and AI-driven data solutions. Passionate about building scalable ETL pipelines, optimizing Spark jobs, and leveraging AI for data automation. I have worked across industries like finance, gaming, automotive, and healthcare, helping businesses make data-driven decisions efficiently. π What I Write About: PySpark & Big Data Processing ποΈ Optimizing ETL & Data Pipelines β‘ Cloud Engineering (Azure & AWS) βοΈ Streaming & Real-Time Data (Kafka, Spark Streaming) π‘ AI & Machine Learning in Data Engineering π€ π Why Follow Me? I share real-world data engineering challenges and hands-on solutions to help fellow engineers overcome bottlenecks and optimize data workflows. Letβs build robust, scalable, and cost-efficient data systems together! Follow for updates on cutting-edge data engineering topics!