pyspark_architecture

Explain pyspark architecture in detail from scratch in with examples
PySpark Architecture: A Detailed Explanation
PySpark is the Python API for Apache Spark, an open-source distributed computing framework designed for large-scale data processing. Understanding PySpark's architecture helps grasp how it handles distributed computations effectively.
1. Components of PySpark Architecture
The architecture consists of several key components:
a. Driver Program
The Driver Program is the central node that:
Initiates the Spark application.
Contains the main code and controls the SparkContext.
Converts the user code (in Python) into tasks that can be executed on the cluster.
b. SparkSession
The
SparkSession
acts as the entry point for programming in PySpark.It encapsulates all functionalities for creating and managing the lifecycle of Spark applications.
c. Cluster Manager
The Cluster Manager allocates resources across applications. Supported managers include:
Standalone Manager (default Spark cluster).
YARN (Yet Another Resource Negotiator, used with Hadoop).
Mesos.
Kubernetes.
d. Executors
Executors are worker processes launched on cluster nodes.
Each executor:
Runs tasks assigned by the driver.
Stores data in memory or disk for future use (caching).
Reports task status back to the driver.
e. Task
Tasks are units of work executed on the executors.
Spark divides jobs into smaller tasks based on transformations and actions applied on RDDs, DataFrames, or Datasets.
f. Distributed Storage
Spark uses distributed file systems like HDFS, Amazon S3, or Azure Blob Storage to store large datasets.
Data is divided into partitions across the cluster nodes.
2. PySpark Execution Process
The PySpark application execution follows these steps:
a. Build a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark Example") \
.master("local[*]") \
.getOrCreate()
The
appName
identifies the application.master
specifies where the cluster is located (local[*]
means using all CPU cores).
b. Load Data
df = spark.read.csv("data.csv", header=True, inferSchema=True)
- Load a dataset from the distributed file system or local storage.
c. Transformations and Actions
Transformations: Lazily evaluated operations that define a new dataset from an existing one (e.g.,
filter
,map
).Actions: Trigger computation and return results (e.g.,
show
,count
).
Example:
# Transformation
filtered_df = df.filter(df["age"] > 30)
# Action
filtered_df.show()
d. Job Execution Flow
Job: A high-level query or action triggers a job.
Stage: Each job is divided into stages based on transformations.
Tasks: Stages are divided into tasks based on data partitions.
3. PySpark Cluster Modes
a. Local Mode
Executes on a single machine.
Ideal for development and testing.
spark = SparkSession.builder.master("local[*]").getOrCreate()
b. Standalone Cluster Mode
Spark's native cluster manager.
Requires setting up Spark on multiple nodes.
c. YARN Mode
- Used with Hadoop to distribute tasks across a cluster.
d. Kubernetes Mode
- Runs Spark applications in Kubernetes containers.
4. Resilient Distributed Dataset (RDD)
The core abstraction in Spark is the RDD:
Immutable: Once created, it cannot be modified.
Distributed: Split into partitions across the cluster.
Fault-tolerant: Automatically recovers from failures.
Example:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x * 2).collect()
5. DataFrames and Datasets
DataFrame: Distributed collection of data organized into named columns.
Dataset: Strongly typed API available in languages like Scala and Java.
Example:
df = spark.read.json("data.json")
df.filter(df["salary"] > 50000).groupBy("department").count().show()
6. PySpark Use Case Example
Task: Find the top 5 most expensive products from a dataset.
- Data Setup:
data = [("Laptop", 1000), ("Phone", 800), ("Tablet", 500), ("Monitor", 200), ("Keyboard", 100)]
columns = ["Product", "Price"]
df = spark.createDataFrame(data, columns)
- Transformations:
sorted_df = df.orderBy(df["Price"].desc())
- Action:
sorted_df.show(5)
Output:
+--------+-----+
| Product|Price|
+--------+-----+
| Laptop| 1000|
| Phone| 800|
| Tablet| 500|
| Monitor| 200|
|Keyboard| 100|
+--------+-----+
7. Fault Tolerance
Spark can recover lost partitions using lineage information stored in RDDs.
8. Caching and Persistence
- Cache data for faster computation using
cache()
orpersist()
.
Example:
df.cache()
df.count() # First action triggers computation
df.show() # Uses cached data
9. PySpark on Cloud
- PySpark integrates with cloud services like AWS EMR, Databricks, and Google Dataproc.
Conclusion
PySpark's architecture is designed for scalability, fault tolerance, and ease of use for big data processing. With its distributed execution model and rich APIs, PySpark is a powerful tool for data engineers and analysts.
Subscribe to my newsletter
Read articles from Irshad Alam directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
