Getting Started with PySpark
Apache Spark is a powerful distributed computing framework commonly used for big data processing, ETL (Extract, Transform, Load), and building machine learning pipelines. It supports various programming languages, including Scala, Java, and Python, making it a versatile choice for data processing tasks. In this tutorial, we'll focus on installing Apache Spark on a MacOs machine and running Spark jobs using PySpark with Jupyter notebook.
Installing Apache Spark
Before diving into Spark, we need to ensure that we have all the necessary components installed.
1. Install Java
Java is a prerequisite for running Apache Spark. You can use the Homebrew package manager to install it:
brew cask install java8
To verify the Java installation, run:
java -version
2. Install Command Line Tools
Ensure that you have Xcode Command Line Tools installed by running:
xcode-select --install
3. Install Scala
Scala is another essential component for Spark. Install it using Homebrew:
brew install scala
To verify the Scala installation, run:
scala -version
4. Install Apache Spark Package
Now, let's install Apache Spark itself. With Homebrew, this is a breeze:
brew install apache-spark
This command installs Apache Spark along with its dependencies, including PySpark.
5. Setting Environment Variables
To make Spark easily accessible from the command line, you'll want to add some environment variables. First, find out where Apache Spark is installed on your system:
brew info apache-spark
Assuming you're using the Zsh terminal, add the following lines to your ~/.zshrc
file:
export SPARK_HOME=/opt/homebrew/Cellar/apache-spark/3.5.0/libexec
export PATH=$PATH:$SPARK_HOME/sbin:$SPARK_HOME/bin
export PYTHONPATH=/opt/homebrew/Cellar/apache-spark/3.5.0/libexec/python
Save the file and run:
source ~/.zshrc
If you're using Bash, export these same variables in your ~/.bash_profile
file instead:
After saving the file, run:
source ~/.bash_profile
$SPARK_HOME contains the path to the spark home directory, then we add the $SPARK_HOME/bin and $SPARK_HOME/sbin paths which contains the scripts to $PATH env vairable
With these environment variables set, you can now access Spark commands from the terminal.
To start Spark's Master UI, run:
cd $SPARK_HOME/sbin
./start-all.sh
This will launch the Spark Master UI at http://localhost:8080
.
In mac to allow spark master start script to connect to localhost:22, go to system settings>sharing>enable remote login
To stop the Master UI, use:
./stop-all.sh
Running Spark Jobs
1. Spark Submit
Now that you have Apache Spark installed, let's run a simple Spark job using spark-submit
. First, create a Python script, let's call it test.py
:
from pyspark import SparkContext
sc = SparkContext("local", "PySpark Test")
print("Hello from Spark")
print("Spark Context >> ", sc)
To execute this script using spark-submit
, use the following command:
spark-submit test.py
Make sure you're in the same directory as the test.py
file or provide the full path to the script.
2. PySpark with Jupyter Notebook
PySpark can also be integrated with Jupyter Notebook for interactive data analysis and exploration.
Install Required Packages
Before you can use PySpark in Jupyter Notebook, install the necessary packages:
pip install findspark notebook
Start Jupyter Notebook
Launch Jupyter Notebook by running:
jupyter-notebook
Initialize PySpark in Jupyter
In a Jupyter Notebook cell, import and initialize findspark
:
import findspark
findspark.init()
This step helps locate the path to your Apache Spark installation and sets it up for your Jupyter session.
Now, you can import PySpark and create a Spark session:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Spark Session >> ", spark)
You've successfully set up PySpark in Jupyter Notebook! You can now use PySpark to process data interactively. note: Always execute the findspark.init() prior to importing pyspark
spark_df = spark.sql("SELECT 'Hello from Spark' AS test_message")
spark_df.show()
Extras
To simplify running PySpark within Jupyter Notebook, you can configure it to start automatically with Jupyter. To do this, export the following environment variables to your ~/.zshrc
or ~/.bash_profile
:
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'
Now, when you run the pyspark
command in your terminal, it will start a Jupyter Notebook session with PySpark preconfigured.
With these steps, you've set up Apache Spark and PySpark on your macOS machine and are ready to start working with distributed data processing and analysis. You can explore Spark's vast capabilities for big data processing and machine learning right from your local environment. Happy Sparking!
Subscribe to my newsletter
Read articles from Rahul Das directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by