Build your first ETL pipeline using PySpark

Overview

In this session, we’ll walk through building a simple ETL (Extract, Transform, Load) pipeline using PySpark

Steps Involved

Setup Spark Environment
Read Data from CSV
Cleanse and Transform the CSV data
Write final results to target location

Setup Spark Environment

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SampleETL").getOrCreate()

Import Libraries: The pyspark.sql library is imported to use the SparkSession class, which is essential for any Spark application

Read Data from CSV

  input_path="stroage_path_to_the_location"
  df = spark.read.option("header","true").csv(input_path)
  display(df)

Cleanse and Transform Data

Drop Null Values

  df_cleansed=df.dropna(["Column 1","Column 2"])

Filter Columns

  from pyspark.sql.functions import lit
  df_filtered=df_cleansed.where("Column 3 == lit('TEXT') ")

Add New Columns

  df_addition=df_filtered.withColumn("CreateDt", current_timestamp())

Write Data to Target Location

  taget_path_location=f"/mnt/silver/storage1/container_location"
  df_addition.write.mode("overwrite").format("parquet").save(f"{taget_path_location}")

Subscribe to my newsletter

Read articles from Venkatesh Marella directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Venkatesh Marella

📌 About Me: I am a Data Solution Engineer with 12+ years of experience in Big Data, Cloud (Azure & AWS), and AI-driven data solutions. Passionate about building scalable ETL pipelines, optimizing Spark jobs, and leveraging AI for data automation. I have worked across industries like finance, gaming, automotive, and healthcare, helping businesses make data-driven decisions efficiently. 📌 What I Write About: PySpark & Big Data Processing 🏗️ Optimizing ETL & Data Pipelines ⚡ Cloud Engineering (Azure & AWS) ☁️ Streaming & Real-Time Data (Kafka, Spark Streaming) 📡 AI & Machine Learning in Data Engineering 🤖 📌 Why Follow Me? I share real-world data engineering challenges and hands-on solutions to help fellow engineers overcome bottlenecks and optimize data workflows. Let’s build robust, scalable, and cost-efficient data systems together! Follow for updates on cutting-edge data engineering topics!

Build your first ETL pipeline using PySpark

Table of contents

Overview

Setup Spark Environment

Read Data from CSV

Cleanse and Transform Data

Subscribe to my newsletter

Venkatesh Marella

Venkatesh Marella