Build your first ETL pipeline using PySpark

Overview

In this session, we’ll walk through building a simple ETL (Extract, Transform, Load) pipeline using PySpark

Steps Involved

  1. Setup Spark Environment

  2. Read Data from CSV

  3. Cleanse and Transform the CSV data

  4. Write final results to target location

Setup Spark Environment

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SampleETL").getOrCreate()

Import Libraries: The pyspark.sql library is imported to use the SparkSession class, which is essential for any Spark application

Read Data from CSV

  •   input_path="stroage_path_to_the_location"
      df = spark.read.option("header","true").csv(input_path)
      display(df)
    

Cleanse and Transform Data

  • Drop Null Values

      df_cleansed=df.dropna(["Column 1","Column 2"])
    
  • Filter Columns

      from pyspark.sql.functions import lit
      df_filtered=df_cleansed.where("Column 3 == lit('TEXT') ")
    
  • Add New Columns

      df_addition=df_filtered.withColumn("CreateDt", current_timestamp())
    
  • Write Data to Target Location

      taget_path_location=f"/mnt/silver/storage1/container_location"
      df_addition.write.mode("overwrite").format("parquet").save(f"{taget_path_location}")
    
0
Subscribe to my newsletter

Read articles from Venkatesh Marella directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Venkatesh Marella
Venkatesh Marella

πŸ“Œ About Me: I am a Data Solution Engineer with 12+ years of experience in Big Data, Cloud (Azure & AWS), and AI-driven data solutions. Passionate about building scalable ETL pipelines, optimizing Spark jobs, and leveraging AI for data automation. I have worked across industries like finance, gaming, automotive, and healthcare, helping businesses make data-driven decisions efficiently. πŸ“Œ What I Write About: PySpark & Big Data Processing πŸ—οΈ Optimizing ETL & Data Pipelines ⚑ Cloud Engineering (Azure & AWS) ☁️ Streaming & Real-Time Data (Kafka, Spark Streaming) πŸ“‘ AI & Machine Learning in Data Engineering πŸ€– πŸ“Œ Why Follow Me? I share real-world data engineering challenges and hands-on solutions to help fellow engineers overcome bottlenecks and optimize data workflows. Let’s build robust, scalable, and cost-efficient data systems together! Follow for updates on cutting-edge data engineering topics!