Apache Spark - Tutorial 2
Spark Submit
From Now on we are going to use spark Submit frequently So that we are going to learn the Syntax for Spark Submit first,
Once the Spark application build is completed, we use to execute that application via the spark-submit command.
spark-submit\
--master <master-url>\
--deploy_mode <cluster or client>\
--conf <key:value>\
--driver-memory ng\
--executor-memory ng\
--executor-cores n\
--num-executors n\
--jars <comma seperated jar file names>\
examplePySpark.py <arg1> <arg2>
Master
Here we need to add our Spark master node URL
Deploy Mode:
There are two types of deployment modes available,
Client Mode: When we add the client as deploy mode, the Driver Program will be created in Our Client Machine. This is useful when you are working on the development of a Spark application and it should not be followed when your application goes into Production. As we know our executors always contact with Driver Program on executions so the client configuration is not advisable for Production.
Cluster Mode: In Cluster mode, your Driver Program will be created on any one of the Worker nodes, So that the communication between the executors will be faster. This mode is used in most of the Production.
The Default Mode is Client
Conf
Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes.
Driver Memory
Here you can add how much driver memory need for your application.
Executor Memory
This is used to configure the executory Memory, Example 4g,
Executor Cores
How many CPU cores do you need for your application
Number of Executors
Can be used to configure the required number of executors
Jars
Path to a bundled jar including your application and all dependencies
Now dive into the most important core concept of Spark.
RDD - Resilient Distributed DataSet
RDDs are the main logical data units in Spark
They have distributed collections of objects which are stored in memory or on disks of different nodes of a cluster.
Features of RDD
Resilient: Rdd's track data lineage information(shown below) to recover from failure state automatically. It is also known as Fault Tolerant.
Distributed: Data residing in an RDD can be stored on multiple nodes.
Lazy Evaluation: Data does not get loaded into memory until any action is called on.
Immutability: Data stored on the RDDs are in read-only mode, you can not change it.
In-Memory Computation: An RDD stores any intermediate data in RAM, so that faster access can be provided to your application.
Transformations and Actions
Transformations
It is a function that produces a new RDD from existing RDD, It takes RDD as input and produces RDD as output
The transformation will not be called immediately instead it will create a DAG based on reference code until the Action called
There are two types of Transformations available, Narrow and Wide Transformations
Narrow Transformations
In Narrow transformation, all the records required to compute on the single partition will be available on the single parent partition. Shuffle will not happen here.
The below-listed function uses the narrow transformations
map() filter() union()
Wide Transformation
In Wide transformation, all the records required to compute on the single partition will be available on the multiple parent partition. Due to this data shuffling will happen here.
The below-listed function uses the wide transformations
groupByKey() reducebyKey() join() repartition() coalesce()
Actions
Action is the way of sending data from executors to Driver after computations from transformations.
When an action is called, a new RDD will not be created like in transformations instead it will bring actual data.
Action brings the spark of laziness into motion and this result will be stored in memory or external storage.
Demo PySpark application
spark = SparkSession().getOrCreate()
empRdd = spark.read_csv('emp.csv')
deptRdd = spark.read_csv('dept.csv')
filterEmpRdd = empRdd.filter("country = 'India'")
indiaDeptSalaryRdd = filterEmpRdd.groupBy('dept_id').agg(sum(), col(salary))
deptLevelSalary = indiaDeptSalaryRdd.join(deptRdd, 'dept_id', 'inner_join')
deptLevelSalary.show()
Subscribe to my newsletter
Read articles from Sivaraman Arumugam directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Sivaraman Arumugam
Sivaraman Arumugam
I am a data engineer who is responsible for designing, building, maintaining, and testing the infrastructure and systems that are used to store, process, and analyze data. I work closely with data scientists and analysts to ensure that the data pipelines and systems are able to support the data needs of an organization. I have a strong background in computer science and software engineering, and skilled in programming languages such as Python, Java, and SQL also familiar with database systems and big data technologies like Hadoop, Spark, and NoSQL databases. Some of my key responsibilities as a data engineer: Designing and building data pipelines to extract, transform, and load data from various sources Setting up and maintaining data storage and processing systems, including data warehouses and data lakes Collaborating with data scientists and analysts to understand their data needs and ensure that the data infrastructure can support their requirements Performing data quality checks and troubleshooting any issues that arise Implementing security and privacy measures to protect sensitive data