pyspark_architecture

Irshad AlamIrshad Alam
4 min read

Skip to content

Explain pyspark architecture in detail from scratch in with examples

PySpark Architecture: A Detailed Explanation

PySpark is the Python API for Apache Spark, an open-source distributed computing framework designed for large-scale data processing. Understanding PySpark's architecture helps grasp how it handles distributed computations effectively.


1. Components of PySpark Architecture

The architecture consists of several key components:

a. Driver Program

  • The Driver Program is the central node that:

    • Initiates the Spark application.

    • Contains the main code and controls the SparkContext.

    • Converts the user code (in Python) into tasks that can be executed on the cluster.

b. SparkSession

  • The SparkSession acts as the entry point for programming in PySpark.

  • It encapsulates all functionalities for creating and managing the lifecycle of Spark applications.

c. Cluster Manager

  • The Cluster Manager allocates resources across applications. Supported managers include:

    • Standalone Manager (default Spark cluster).

    • YARN (Yet Another Resource Negotiator, used with Hadoop).

    • Mesos.

    • Kubernetes.

d. Executors

  • Executors are worker processes launched on cluster nodes.

  • Each executor:

    • Runs tasks assigned by the driver.

    • Stores data in memory or disk for future use (caching).

    • Reports task status back to the driver.

e. Task

  • Tasks are units of work executed on the executors.

  • Spark divides jobs into smaller tasks based on transformations and actions applied on RDDs, DataFrames, or Datasets.

f. Distributed Storage

  • Spark uses distributed file systems like HDFS, Amazon S3, or Azure Blob Storage to store large datasets.

  • Data is divided into partitions across the cluster nodes.


2. PySpark Execution Process

The PySpark application execution follows these steps:

a. Build a SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Example") \
    .master("local[*]") \
    .getOrCreate()
  • The appName identifies the application.

  • master specifies where the cluster is located (local[*] means using all CPU cores).

b. Load Data

df = spark.read.csv("data.csv", header=True, inferSchema=True)
  • Load a dataset from the distributed file system or local storage.

c. Transformations and Actions

  • Transformations: Lazily evaluated operations that define a new dataset from an existing one (e.g., filter, map).

  • Actions: Trigger computation and return results (e.g., show, count).

Example:

# Transformation
filtered_df = df.filter(df["age"] > 30)

# Action
filtered_df.show()

d. Job Execution Flow

  1. Job: A high-level query or action triggers a job.

  2. Stage: Each job is divided into stages based on transformations.

  3. Tasks: Stages are divided into tasks based on data partitions.


3. PySpark Cluster Modes

a. Local Mode

  • Executes on a single machine.

  • Ideal for development and testing.

spark = SparkSession.builder.master("local[*]").getOrCreate()

b. Standalone Cluster Mode

  • Spark's native cluster manager.

  • Requires setting up Spark on multiple nodes.

c. YARN Mode

  • Used with Hadoop to distribute tasks across a cluster.

d. Kubernetes Mode

  • Runs Spark applications in Kubernetes containers.

4. Resilient Distributed Dataset (RDD)

The core abstraction in Spark is the RDD:

  • Immutable: Once created, it cannot be modified.

  • Distributed: Split into partitions across the cluster.

  • Fault-tolerant: Automatically recovers from failures.

Example:

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x * 2).collect()

5. DataFrames and Datasets

  • DataFrame: Distributed collection of data organized into named columns.

  • Dataset: Strongly typed API available in languages like Scala and Java.

Example:

df = spark.read.json("data.json")
df.filter(df["salary"] > 50000).groupBy("department").count().show()

6. PySpark Use Case Example

Task: Find the top 5 most expensive products from a dataset.

  1. Data Setup:
data = [("Laptop", 1000), ("Phone", 800), ("Tablet", 500), ("Monitor", 200), ("Keyboard", 100)]
columns = ["Product", "Price"]

df = spark.createDataFrame(data, columns)
  1. Transformations:
sorted_df = df.orderBy(df["Price"].desc())
  1. Action:
sorted_df.show(5)

Output:

+--------+-----+
| Product|Price|
+--------+-----+
|  Laptop| 1000|
|   Phone|  800|
|  Tablet|  500|
| Monitor|  200|
|Keyboard|  100|
+--------+-----+

7. Fault Tolerance

Spark can recover lost partitions using lineage information stored in RDDs.


8. Caching and Persistence

  • Cache data for faster computation using cache() or persist().

Example:

df.cache()
df.count()  # First action triggers computation
df.show()   # Uses cached data

9. PySpark on Cloud

  • PySpark integrates with cloud services like AWS EMR, Databricks, and Google Dataproc.

Conclusion

PySpark's architecture is designed for scalability, fault tolerance, and ease of use for big data processing. With its distributed execution model and rich APIs, PySpark is a powerful tool for data engineers and analysts.

0
Subscribe to my newsletter

Read articles from Irshad Alam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Irshad Alam
Irshad Alam