Spark Simplified: Architecture for Data Enthusiasts


Apache Spark has become a cornerstone of big data processing, offering speed, scalability, and flexibility. Whether you're just starting out or looking to deepen your understanding, this guide will walk you through the core architecture of Spark in a clear and approachable way.
What is Apache Spark?
Apache Spark is distributed computing system designed for fast and flexible large-scale data processing. It supports Python, Java, R and Scala and provides high-level APIs for working with structured and unstructured data.
Spark Architecture
Core Components
Driver Program [Brain]
Master node, controls everything.
Creates SparkContext.
Converts user code into a DAG of stages, then into tasks.
Sends tasks to executors.
Tracks and aggregates results.
Cluster Manager [Machines]
Allocates resources to your Spark applications.
Types → Standalone, YARN, Mesos, K8s.
Executors [Transformations and Actions]
Runs on worker node.
Has its own JVM.
Stores Cached data.
Returns result back to the driver.
Internal Concepts
DAGs(Directed Acyclic Graphs) [Plan of Action]
Spark builds a DAG of transformations instead of executing line by line.
Stages are only triggered by actions (.collect() or .show()).
Helps with lazy evaluation and optimization.
Catalyst Optimizer
Optimizes logical and physical query plans.
Applies rules to reorder filters, pushdowns predicates and optimize joins.
Tungsten Engine [Execution Engine]
- Uses off heap memory, code generation and Binary processing for speed.
PySpark Job Execution: step-by-step
Driver runs your script and builds a DAG.
Plan is optimized by Catalyst optimizer.
Tungsten Engine compiles physical plan to JVM bytecode.
Spark creates a stage for a sequence of narrow transformations bounded by wide transformations.
Tasks (per partition) are scheduled and sent to executors.
Executors processes the data and returns the result back to the driver.
Wrapping Up
Apache Spark might seem intimidating at first, but once you break it down into its core parts—drivers, executors, DAGs, and optimizers—it starts to make a lot more sense. Hopefully, this guide gave you a clearer picture of how Spark works behind the scenes.
If you're just getting started, don’t worry about mastering everything at once. Try running a few simple jobs, explore the Spark UI, and see how the pieces fit together. The more you experiment, the more intuitive it becomes.
Thanks for reading! If you found this helpful or have questions, feel free to drop a comment or reach out. I’d love to hear how you’re using Spark or what you'd like to learn next.
Subscribe to my newsletter
Read articles from Varchasv Hoon directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Varchasv Hoon
Varchasv Hoon
Hello! I am Varchasv Hoon, a DevOps Engineer. I like to share my knowledge and experience with others through my blog. My goal is to help others learn and master DevOps technologies and provide them with the resources and guidance they need to succeed in their own projects.