Hello Spark on Minikube
Minikube is a beginner-friendly tool that lets you run a Kubernetes cluster on your local machine, making it easy to start learning and experimenting with Kubernetes without needing a complex setup. It creates a single-node cluster inside a virtual machine, simulating a real Kubernetes environment. This allows you to develop, test, and explore Kubernetes features locally before moving to a production environment. With Minikube, you can quickly spin up and manage your own Kubernetes cluster, making it an essential tool for anyone new to container orchestration.
In this article, I will demonstrate on how to run custom Spark application on Minikube. We will create a Docker image by extending the standard Spark image from Bitnami, adding our example Spark application and running the whole setup on minikube.
Prerequisites
In order to follow along with this example, you would need the following tools to be installed on your machine:
Docker
Minikube
Spark
In my setup, I am using docker for building application images and as VM driver for Minikube. The other VM drivers that Minikube supports are - Hyperkit, VirtualBox and Podman. Installation of Docker and Minikube is pretty straight forward too. If not installed on your system already, you can refer to the official installation guides for installing these tools:
We also need to setup Spark on our machine to leverage on spark-submit
utility for submitting jobs to Minikube. Installing Spark on linux is also very simple. Follow the below steps to set it up on your machine:
Download Spark from the official website
- I am using Spark 3.5.1 for this demo
Use the below commands to download and setup Spark in your system:
# cd to your working directory (I setup up mine in the below directory) cd /Users/krrohit/Learning/Kubernetes # download Spark 3.5.1 wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz # untar the package tar -xzf spark-3.5.1-bin-hadoop3.tgz # create symbolic link ln -s spark-3.5.1-bin-hadoop3.tgz spark # export SPARK_HOME export SPARK_HOME="`pwd`/spark"
If you follow the steps correctly, your folder structure would look like this:
krrohit@Kumars-Mac-mini Kubernetes % ls -lrt -rw-r--r--@ 1 krrohit staff 400446614 Jul 26 07:57 spark-3.5.1-bin-hadoop3.tgz drwxr-xr-x@ 18 krrohit staff 576 Jul 26 07:58 spark-3.5.1-bin-hadoop3 lrwxr-xr-x@ 1 krrohit staff 23 Aug 10 18:28 spark -> spark-3.5.1-bin-hadoop3 krrohit@Kumars-Mac-mini Kubernetes % export SPARK_HOME="`pwd`/spark" krrohit@Kumars-Mac-mini Kubernetes % echo $SPARK_HOME /Users/krrohit/Learning/Kubernetes/spark
That's it! This is the bare-minimum setup that we need for this demo.
Action Time
Now our prerequisites are satisfied, lets quickly jump on the real stuffs! Now we have to start our Minikube cluster, create Docker image for our Spark application, and submit. Let's do it step by step:
Start minikube cluster
minikube start --vm-driver=docker --cpus=5 --memory=6000
It should produce output like this:
krrohit@Kumars-Mac-mini Kubernetes % minikube start --vm-driver=docker --cpus=5 --memory=6000 ๐ minikube v1.33.0 on Darwin 14.5 (arm64) ๐ minikube 1.33.1 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.33.1 ๐ก To disable this notice, run: 'minikube config set WantUpdateNotification false' โจ Using the docker driver based on existing profile ๐ Starting "minikube" primary control-plane node in "minikube" cluster ๐ Pulling base image v0.0.43 ... ๐ Restarting existing docker container for "minikube" ... ๐ณ Preparing Kubernetes v1.30.0 on Docker 26.0.1 ... ๐ Verifying Kubernetes components... โช Using image gcr.io/k8s-minikube/storage-provisioner:v5 ๐ Enabled addons: default-storageclass, storage-provisioner ๐ Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
You can also verify the same by running:
kubectl cluster-info
Expected output would be like this:
krrohit@Kumars-Mac-mini Kubernetes % kubectl cluster-info Kubernetes control plane is running at https://127.0.0.1:49679 CoreDNS is running at https://127.0.0.1:49679/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Do take a note of the port number that you see in your Minikube setup (
49679
in my case). We would need this port number while specifying Spark Master URI. It keeps changing everytime you re-create Minikube cluster.Create your folder for your Spark application. You can later this folder to manage your Spark application for adding more things to your project.
# create folder for managing spark application mkdir spark_etl cd spark_etl
Create dummy Spark application file named
spark_etl.py
that creates a dummydataframe
and prints it on console as shown:# spark_etl.py from pyspark.sql import SparkSession def etl(): # Initialize a Spark session spark = SparkSession.builder \ .appName("Dummy ETL Process") \ .getOrCreate() # Create a dummy DataFrame data = [("Alice", 30), ("Bob", 28), ("Cathy", 25)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) # Print the DataFrame df.show() # Stop the Spark session spark.stop() def main(): etl() if __name__ == "__main__": main()
Create
Dockerfile
to ship our dummy application as a Docker imageFROM bitnami/spark:3.5.0 WORKDIR /opt COPY spark_etl.py /opt/
Create the Docker image named
customspark:3.5.0
usingDockerfile
created above:docker build -t customspark:3.5.0 .
Please note:
You should be in
spark_etl
folder (or the folder name you gave for managing your Spark application) before using thedocker build
command.If you are on Mac with Apple Silicon (arm based architecture), you can tweak the
docker build
command like this:docker buildx build --platform=linux/arm64 -t customspark:3.5.0 .
The output for the above command should look like this:
krrohit@Kumars-Mac-mini Kubernetes % docker buildx build --platform=linux/arm64 -t customspark:3.5.0 ./spark_etl [+] Building 0.0s (8/8) FINISHED docker:desktop-linux => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 179B 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load metadata for docker.io/bitnami/spark:3.5.0 0.0s => [1/3] FROM docker.io/bitnami/spark:3.5.0 0.0s => [internal] load build context 0.0s => => transferring context: 71B 0.0s => CACHED [2/3] WORKDIR /opt 0.0s => CACHED [3/3] COPY spark_etl.py /opt/ 0.0s => exporting to image 0.0s => => exporting layers 0.0s => => writing image sha256:42e0f26cacf921e2ce7a60ffd692c0934fa8ec25b0dc16716dc68becd9bdd72e 0.0s => => naming to docker.io/library/customspark:3.5.0 0.0s krrohit@Kumars-Mac-mini Kubernetes %
You can check if your image was built successfully using the
docker images
command:docker images | grep images
The output would look like this:
krrohit@Kumars-Mac-mini Kubernetes % docker images | grep images REPOSITORY TAG IMAGE ID CREATED SIZE customspark 3.5.0 24570c0d4cfa 3 days ago 1.73GB bitnami/spark 3.5.0 de9ced01ed7b 5 months ago 1.73GB
Also, you might want to note that if it's your first time building this image, Docker will download the
bitnami/spark:3.5.0
and it might take some time depending on your network.
Import Docker image into Minikube using below commands:
eval $(minikube docker-env)
Above command is used to configure your shell to use the Docker daemon inside the Minikube virtual machine. Let's see what all images are present using
docker images
command again:krrohit@Kumars-Mac-mini Kubernetes % docker images REPOSITORY TAG IMAGE ID CREATED SIZE registry.k8s.io/kube-apiserver v1.30.0 181f57fd3cdb 3 months ago 112MB registry.k8s.io/kube-controller-manager v1.30.0 68feac521c0f 3 months ago 107MB registry.k8s.io/kube-proxy v1.30.0 cb7eac0b42cc 3 months ago 87.9MB registry.k8s.io/kube-scheduler v1.30.0 547adae34140 3 months ago 60.5MB registry.k8s.io/etcd 3.5.12-0 014faa467e29 6 months ago 139MB registry.k8s.io/coredns/coredns v1.11.1 2437cf762177 12 months ago 57.4MB registry.k8s.io/pause 3.9 829e9de338bd 22 months ago 514kB gcr.io/k8s-minikube/storage-provisioner v5 ba04bb24b957 3 years ago 29MB
But where is the
customspark:3.5.0
image that we just built in the above step? Actually, we just changed the configuration in our shell to point to Docker daemon present inside Minikube (and not the Docker that is installed on our machine). This is the reason we do not see the image that we built using the Docker daemon (installed on our machine).To import the Docker image from outside Minikube, you can either -
pull
the image from DockerHub or useminikube image load
command to copy image from Docker daemon of our machine. Let's do the same and verify if the image gets pulled between the two environments:minikube image load customspark:3.5.0
Let's list the images and validate.
krrohit@Kumars-Mac-mini Kubernetes % minikube image load customspark:3.5.0 krrohit@Kumars-Mac-mini Kubernetes % docker images REPOSITORY TAG IMAGE ID CREATED SIZE customspark 3.5.0 0496e9178425 3 days ago 1.73GB registry.k8s.io/kube-apiserver v1.30.0 181f57fd3cdb 3 months ago 112MB registry.k8s.io/kube-controller-manager v1.30.0 68feac521c0f 3 months ago 107MB registry.k8s.io/kube-proxy v1.30.0 cb7eac0b42cc 3 months ago 87.9MB registry.k8s.io/kube-scheduler v1.30.0 547adae34140 3 months ago 60.5MB registry.k8s.io/etcd 3.5.12-0 014faa467e29 6 months ago 139MB registry.k8s.io/coredns/coredns v1.11.1 2437cf762177 12 months ago 57.4MB registry.k8s.io/pause 3.9 829e9de338bd 22 months ago 514kB gcr.io/k8s-minikube/storage-provisioner v5 ba04bb24b957 3 years ago 29MB
To configure the shell to use Docker daemon from local host again, you can always use
eval $(minikube docker-env -u)
command.Running the application on Minikube:
Spark expects a
kubernetes service account
when we want to run it on Kubernetes. if we do not create the same we would see errors likeExternal scheduler cannot be instantiated
io.fabric8.kubernetes.client.KubernetesClientException
Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-etl-py-eeae8c915332553f-driver" is forbidden: User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default".
To avoid these erros, we need to create Kubernetes service account, and on minikube, that's very straight forward too:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
The command would produce below results:
krrohit@Kumars-Mac-mini Kubernetes % kubectl create serviceaccount spark
serviceaccount/spark created
krrohit@Kumars-Mac-mini Kubernetes % kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
clusterrolebinding.rbac.authorization.k8s.io/spark-role created
Let us finally, run our Spark application by issuing:
${SPARK_HOME}/bin/spark-submit --master k8s://https://127.0.0.1:49679 --deploy-mode cluster \
--conf spark.kubernetes.container.image=customspark:3.5.0 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=2 local:///opt/spark_etl.py
Breakdown of the above command:
--master k8s://https://127.0.0.1:49679
: this is the URI for Spark master. Note that it starts withk8s
signalling Spark that we are using Kubernetes as our cluster manager.49679
is the port number on which Kubernetes API server is listening to requests in our Minikube cluster.--deploy-mode cluster
: to specify that we intend to runspark-submit
in cluster mode.--conf spark.kubernetes.container.image=customspark:3.5.0
: the image using which we would launch our Spark application. Kubernetes would use this image to createdriver
andexecutors
while launching our application.--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
: specifies the service account that should be used while running Spark application.--conf spark.executor.instances=2
: specifies the number of executors that we want for our Spark application.local:///opt/spark_etl.py
:local:///
specifies the location of our source code bundled in the Docker image we are using to run the Spark application. This can be changedd tohdfs
,s3a
and evenhttps
basic on the usecase and environment setup.
If you followed the step upto this point correctly, you would see an output like this:
krrohit@Kumars-Mac-mini Kubernetes % ${SPARK_HOME}/bin/spark-submit --master k8s://https://127.0.0.1:49679 --deploy-mode cluster \
--conf spark.kubernetes.container.image=customspark:3.5.0 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=2 local:///opt/spark_etl.py
24/08/15 08:38:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/15 08:38:10 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
24/08/15 08:38:11 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
24/08/15 08:38:12 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: spark-etl-py-da235991537879be-driver
namespace: default
labels: spark-app-name -> spark-etl-py, spark-app-selector -> spark-0fa73161ceb4468dbe5108f4fb11bb8f, spark-role -> driver, spark-version -> 3.5.1
pod uid: e2f44312-3c62-4038-a2ff-b627308efb1f
creation time: 2024-08-15T00:38:11Z
...
...
24/08/15 08:38:12 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: spark-etl-py-da235991537879be-driver
namespace: default
labels: spark-app-name -> spark-etl-py, spark-app-selector -> spark-0fa73161ceb4468dbe5108f4fb11bb8f, spark-role -> driver, spark-version -> 3.5.1
pod uid: e2f44312-3c62-4038-a2ff-b627308efb1f
creation time: 2024-08-15T00:38:11Z
...
...
24/08/15 08:38:21 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: spark-etl-py-da235991537879be-driver
namespace: default
labels: spark-app-name -> spark-etl-py, spark-app-selector -> spark-0fa73161ceb4468dbe5108f4fb11bb8f, spark-role -> driver, spark-version -> 3.5.1
pod uid: e2f44312-3c62-4038-a2ff-b627308efb1f
creation time: 2024-08-15T00:38:11Z
service account name: spark
volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-vwwss
node name: minikube
start time: 2024-08-15T00:38:11Z
phase: Succeeded
container status:
container name: spark-kubernetes-driver
container image: customspark:3.5.0
container state: terminated
container started at: 2024-08-15T00:38:12Z
container finished at: 2024-08-15T00:38:20Z
exit code: 0
termination reason: Completed
...
...
24/08/15 08:38:21 INFO LoggingPodStatusWatcherImpl: Application spark_etl.py with application ID spark-0fa73161ceb4468dbe5108f4fb11bb8f and submission ID default:spark-etl-py-da235991537879be-driver finished
24/08/15 08:38:21 INFO ShutdownHookManager: Shutdown hook called
24/08/15 08:38:21 INFO ShutdownHookManager: Deleting directory /private/var/folders/p0/c8q1_p9s4bs6x9tbt3fssj_h0000gn/T/spark-53578c84-1937-4792-8259-02b129224ed3
krrohit@Kumars-Mac-mini Kubernetes %
Let us verify the pods that were launched for running this application by running
kubectl get pods
:krrohit@Kumars-Mac-mini Kubernetes % kubectl get pods NAME READY STATUS RESTARTS AGE spark-etl-py-0edeb59153728267-driver 0/1 Completed 0 1m18s
But wait, where is the
dataframe
that we printed inspark_etl.py
? And why we do not see the executors being created! Lets check more details in the logs on whether the requested number executors (2 in our case) were launched or not?We can check the logs for the above driver pod using:
kubectl logs spark-etl-py-0edeb59153728267-driver
krrohit@Kumars-Mac-mini Kubernetes % kubectl logs spark-etl-py-0edeb59153728267-driver spark 00:31:42.11 INFO ==> spark 00:31:42.11 INFO ==> Welcome to the Bitnami spark container spark 00:31:42.11 INFO ==> Subscribe to project updates by watching https://github.com/bitnami/containers spark 00:31:42.11 INFO ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues spark 00:31:42.11 INFO ==> 24/08/15 00:31:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/08/15 00:31:43 INFO SparkContext: Running Spark version 3.5.0 24/08/15 00:31:43 INFO SparkContext: OS info Linux, 6.3.13-linuxkit, aarch64 24/08/15 00:31:43 INFO SparkContext: Java version 17.0.10 24/08/15 00:31:43 INFO ResourceUtils: ============================================================== 24/08/15 00:31:43 INFO ResourceUtils: No custom resources configured for spark.driver. 24/08/15 00:31:43 INFO ResourceUtils: ============================================================== 24/08/15 00:31:43 INFO SparkContext: Submitted application: Dummy ETL Process 24/08/15 00:31:43 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 24/08/15 00:31:43 INFO ResourceProfile: Limiting resource is cpus at 1 tasks per executor ... ... 24/08/15 00:31:43 INFO Utils: Successfully started service 'SparkUI' on port 4040. 24/08/15 00:31:43 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file 24/08/15 00:31:44 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes for ResourceProfile Id: 0, target: 2, known: 0, sharedSlotFromPendingPods: 2147483647. ... ... 24/08/15 00:31:45 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found for 10.244.0.18:54704 24/08/15 00:31:45 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found for 10.244.0.17:55718 24/08/15 00:31:46 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.17:55720) with ID 1, ResourceProfileId 0 24/08/15 00:31:46 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.18:54714) with ID 2, ResourceProfileId 0 ... ... 24/08/15 00:31:48 INFO CodeGenerator: Code generated in 7.086 ms +-----+---+ | Name|Age| +-----+---+ |Alice| 30| | Bob| 28| |Cathy| 25| +-----+---+ ... ... 24/08/15 00:31:48 INFO SparkContext: Successfully stopped SparkContext 24/08/15 00:31:49 INFO ShutdownHookManager: Shutdown hook called 24/08/15 00:31:49 INFO ShutdownHookManager: Deleting directory /var/data/spark-55b00b7e-b523-4c21-84e5-97a91dd6a04b/spark-5a9a19d0-5ab1-4208-8436-4d32d5a8f561 24/08/15 00:31:49 INFO ShutdownHookManager: Deleting directory /tmp/spark-e2ca15cb-ba11-4d2e-b8d5-63e47c7ebc0d 24/08/15 00:31:49 INFO ShutdownHookManager: Deleting directory /var/data/spark-55b00b7e-b523-4c21-84e5-97a91dd6a04b/spark-5a9a19d0-5ab1-4208-8436-4d32d5a8f561/pyspark-6c2382eb-b8c0-4f56-9f20-d2ee2cc64361 krrohit@Kumars-Mac-mini Kubernetes %
Aah! Now we can see our
dataframe
printed! โค๏ธAlso, if we look carefully, we will see in the same logs that 2 executors were registered (try searching for
Registered executor
in the logs). ๐Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.17:55720) with ID 1, ResourceProfileId 0 Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.18:54714) with ID 2, ResourceProfileId 0
But we still didn't see the pods on which the executors were launched! Well, by default, the Spark property
spark.kubernetes.executor.deleteOnTermination
istrue
and it basically means that once the application is over, the executor pods will be removed automatically! If, for debug purpose, we want to see the logs on executor pods, we can set this property tofalse
so that the executor pods are not deleted automatically behind the scenes. With this property, thespark-submit
command would look like this:${SPARK_HOME}/bin/spark-submit --master k8s://https://127.0.0.1:49679 --deploy-mode cluster \ --conf spark.kubernetes.container.image=customspark:3.5.0 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.executor.deleteOnTermination=false \ --conf spark.executor.instances=2 local:///opt/spark_etl.py
If you use above command to make spark-submit, and check for pods at the end of the execution, you can see that this time the executor pods were not removed.
krrohit@Kumars-Mac-mini Kubernetes % kubectl get pods NAME READY STATUS RESTARTS AGE dummy-etl-process-48ad599153788a2d-exec-1 0/1 Completed 0 14s dummy-etl-process-48ad599153788a2d-exec-2 0/1 Completed 0 14s spark-etl-py-0edeb59153728267-driver 0/1 Completed 0 6m48s spark-etl-py-da235991537879be-driver 0/1 Completed 0 17s
Dockerfile
,spark_etl.py
and all the commands used above are also documented in my Github repository. So feel free to check them out.
Summary
So yeah, that is all we need to run your first Spark application in Minikube. So not only managed to run Spark application on Minikube successfully, but we also learnt about checking logs and executor pods management.
If you read this article till the end, and it added some value to you, please consider leaving a feedback in comment section.
We will be building more cool data platform stuffs using these technologies in the upcoming posts in this series.
Until then, cheers! ๐ฅ
Subscribe to my newsletter
Read articles from Kumar Rohit directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Kumar Rohit
Kumar Rohit
I am a Data Engineer by profession and a lifelong learner by passion. To begin, I'd like to share some of the problems that have kept me pondering for a while and the valuable lessons I have learnt along the way.