Remote Spark Submission using Apache Livy


What is Apache Livy?
Apache Livy is an open-source REST service for managing long-running Apache Spark jobs. It allows you to submit, manage, and interact with Spark jobs over a RESTful interface, making it easier to integrate Spark with other applications and services.
Setting Up Apache Livy
Install Apache Livy:
Download the Livy release from the official Apache Livy repository.
Extract the archive and place it in a directory on your cluster.
tar -zxvf livy-<version>.tar.gz
Configure Livy:
- Navigate to the Livy configuration directory.
cd livy-<version>/conf
- Copy the template configuration files.
cp livy.conf.template livy.conf
cp livy-env.sh.template livy-env.sh
- Edit
livy.conf
to configure Livy. At a minimum, you need to set the URL for your Spark master:
livy.server.port = 8998
livy.server.host = <hostname>
livy.spark.master = yarn
livy.spark.deployMode = cluster
- Edit
livy-env.sh
to set environment variables as needed. For example:
export SPARK_HOME=/path/to/spark
export HADOOP_CONF_DIR=/path/to/hadoop/etc/hadoop
Start Livy Server:
- Start the Livy server by running the following command:
./bin/livy-server start
- Check the logs to ensure Livy started correctly:
tail -f logs/livy--server.out
Submitting a PySpark Job via Livy
Prepare Your PySpark Script:
- Place your PySpark script in a location accessible by the cluster (e.g., HDFS or S3).
Submit Job via REST API:
- From your client machine, you can submit a PySpark job using HTTP requests.
Example PySpark Script (example.py
):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
# Sample PySpark code
data = [("James", "Smith"), ("Anna", "Rose"), ("Robert", "Williams")]
columns = ["First Name", "Last Name"]
df = spark.createDataFrame(data, columns)
df.show()
spark.stop()
Submitting the Job:
import requests
import json
# Define the Livy URL
livy_url = 'http://<livy-server>:8998/batches'
# Define the payload for the POST request
payload = {
"file": "hdfs:///path/to/example.py",
"className": "org.apache.spark.examples.SparkPi",
"args": [],
"conf": {
"spark.jars.packages": "com.databricks:spark-avro_2.11:3.0.1"
}
}
# Define the headers
headers = {'Content-Type': 'application/json'}
# Send the POST request
response = requests.post(livy_url, data=json.dumps(payload), headers=headers)
# Print the response
print(response.json())
Managing and Monitoring Jobs
Check Job Status:
- You can check the status of your submitted job by sending a GET request to the Livy server.
batch_id = response.json()['id']
status_url = f"http://<livy-server>:8998/batches/{batch_id}"
response = requests.get(status_url)
print(response.json())
List All Jobs:
- List all running and completed jobs by sending a GET request.
list_url = 'http://<livy-server>:8998/batches'
response = requests.get(list_url)
print(response.json())
Kill a Job:
- You can kill a running job by sending a DELETE request.
delete_url = f"http://<livy-server>:8998/batches/{batch_id}"
response = requests.delete(delete_url)
print(response.json())
Benefits of Using Apache Livy
Ease of Integration: Livy provides a REST interface that can be easily integrated with other applications, services, and programming languages.
Remote Job Submission: Allows you to submit Spark jobs from any machine without needing direct access to the Spark cluster.
Job Management: Provides features to manage, monitor, and control Spark jobs.
Multi-language Support: Supports multiple languages including Python, Scala, and Java.
Conclusion
Using Apache Livy allows you to submit and manage Spark jobs remotely through a RESTful interface, making it a robust solution for your requirement to call a PySpark app from a client machine without having the code on the client. This setup not only simplifies the interaction with the Spark cluster but also enhances the flexibility and scalability of job submissions.
Subscribe to my newsletter
Read articles from Sivaraman Arumugam directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sivaraman Arumugam
Sivaraman Arumugam
I am a data engineer who is responsible for designing, building, maintaining, and testing the infrastructure and systems that are used to store, process, and analyze data. I work closely with data scientists and analysts to ensure that the data pipelines and systems are able to support the data needs of an organization. I have a strong background in computer science and software engineering, and skilled in programming languages such as Python, Java, and SQL also familiar with database systems and big data technologies like Hadoop, Spark, and NoSQL databases. Some of my key responsibilities as a data engineer: Designing and building data pipelines to extract, transform, and load data from various sources Setting up and maintaining data storage and processing systems, including data warehouses and data lakes Collaborating with data scientists and analysts to understand their data needs and ensure that the data infrastructure can support their requirements Performing data quality checks and troubleshooting any issues that arise Implementing security and privacy measures to protect sensitive data