JARK Stack for Gen AI on Kubernetes


You should have heard of LAMP, MEAN, MERN or even JAM stack. But what is this new JARK stack?
Let’s unpack that. JARK stands for Jupyter, Argo, Ray, and Kubernetes. It can help teams launch gen‑AI pipelines in production-ready infrastructure.
Let’s unpack this further!
The JARK stack brings together four open-source building blocks to manage end‑to‑end generative AI workloads:
JupyterHub: Multi-user notebook environment for experimentation, fine‑tuning, and analysis.
Argo Workflows: YAML‑defined DAG pipelines connecting notebook stages to training, packaging, and serving.
Ray & Ray Serve: Scalable distributed compute across GPU/CPU; Ray Serve handles high‑throughput inference.
Kubernetes (EKS): The orchestrator; handles scheduling, scaling, GPU node provisioning (often via Karpenter), multi-tenancy, and resilience.
This stack originated from AWS’s re:Invent 2023 presentation "Deploying Generative Models on Amazon EKS".
Why Teams Are Building JARK on EKS
Experimentation + Production in One Place
JupyterHub serves as a collaboration point for data scientists and ML engineers. They can interactively prototype models, fine-turn hyperparameters, and version notebooks—all inside the same Kubernetes cluster used for production workloads.
Declarative Pipelines That Scale
With Argo Workflows, you can "define once, run anywhere." Want to run fine-tuning, validation, packaging, and deployment as steps? Argo manages it as a directed graph, running in containers with GPU scheduling support.
Distributed Compute & Scalable Serving
For model training and inference, Ray lets tasks scale across nodes. Ray Serve adds autoscaling, streaming, and robust APIs for serving large language models or vision engines.
Cloud-Native and Cost-Aware
EKS nodes can autoscale thanks to Karpenter or managed nodegroups. GPU nodes can spin up on demand, minimizing idle costs. Kubernetes also provides namespaces, RBAC, monitoring, and resource isolation.
Sample JARK stack deployment with EKS Auto Mode
The following is the broad overview of EKS Auto Mode cluster architecture.
Deploy EKS Auto Mode Cluster with autoModeConfig
:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: jark-autocluster
region: us-west-2
version: "1.27"
autoModeConfig:
enabled: true
nodeRoleARN: arn:aws:iam::<account‑id>:role/EKSAutoNodeRole
nodePools:
- general-purpose
- system
managed: false
vpc:
nat:
gateway: Single
Then:
create cluster -f eks-cluster-aut-mode.yaml
This provisions cluster where Karpenter auto-provisions node pools and handles upgrades (~every 21 days) automatically.
Custom Node Pool Creation
In Auto Mode, node pools are replaced by provisioners configured via Karpenter. To handle GPU workloads and isolate system pods, define two provisioner CRDs.
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: jark-gpu
spec:
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["g5.xlarge","g4dn.xlarge"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
provider:
subnetSelector:
karpenter.sh/discovery: <your-vpc>
securityGroupSelector:
kubernetes.io/cluster/jark-autocluster: shared
taints:
- key: "ray.io/node-type"
value: "worker"
effect: "NoSchedule"
limits:
resources:
cpu: "1000"
memory: "2000Gi"
ttlSecondsAfterEmpty: 60
And CPU/general-purpose:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: jark-cpu
spec:
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["m5.xlarge","m5.2xlarge"]
ttlSecondsAfterEmpty: 30
Apply them:
bashCopyEditkubectl apply -f gpu-provisioner.yaml
kubectl apply -f cpu-provisioner.yaml
Install JupyterHub, Argo and Ray
Install JupyterHub
Use Helm chart from the AWS Data‑on‑EKS blueprint to deploy JupyterHub
helm repo add data-on-eks https://awslabs.github.io/data-on-eks
helm repo update
helm install jhub data-on-eks/jupyterhub \
--namespace jupyterhub --create-namespace \
--set singleuser.image.name=jupyter/minimal-notebook \
--set proxy.service.type=LoadBalancer
This sets up multi-user JupyterHub with a public LB endpoint .
Install Argo Workflows
create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/install.yaml
Verify argo is deployed.
kubectl get pods -n argo
kubectl port-forward svc/argo-server -n argo 2746:2746
Install Ray Operator and Ray Serve
helm repo add ray https://ray-project.github.io/kuberay-helm/
helm repo update
helm install ray-operator ray/ray-operator --namespace ray-system --create-namespace
Deploy a Ray cluster manifest (ray-cluster.yaml
) to run Serve workers on GPU nodes and Ray head:
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: jark-ray
namespace: default
spec:
rayVersion: "2.9.1"
headGroupSpec:
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:latest
ports:
- containerPort: 8265
workerGroupSpecs:
- groupName: gpu-workers
minReplicas: 0
maxReplicas: 2
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:latest
resources:
limits:
nvidia.com/gpu: 1
args: ["ray", "start", "--address=$(RAY_HEAD_IP):6379"]
tolerations:
- key: "ray.io/node-type"
operator: "Exists"
effect: "NoSchedule"
Apply:
kubectl apply -f ray-cluster.yaml
Sample Model from Hugging Face & Serve Inference
Create a Ray Serve application
In serve_app.py
:
from ray import serve
from transformers import pipeline
print("Loading HuggingFace model...")
model = pipeline("text-generation", model="gpt2")
serve.start()
@serve.deployment(route_prefix="/generate")
class GenModel:
def __call__(self, request):
data = request.json()
return {"result": model(data["prompt"], max_length=50)[0]["generated_text"]}
GenModel.deploy()
Build Docker image and push to ECR (inside a notebook or CI pipeline)
bashCopyEditdocker build -t ghserve:latest -f Dockerfile .
aws ecr create-repository --repository-name ghserve
docker tag ghserve:latest <account>.dkr.ecr.us-west-2.amazonaws.com/ghserve:latest
docker push <URI>/ghserve:latest
Deploy via Kubernetes manifest ray-serve-app.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: serve-app
spec:
replicas: 1
selector:
matchLabels:
app: serve-app
template:
metadata:
labels:
app: serve-app
spec:
containers:
- name: serve-app
image: <URI>/ghserve:latest
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: serve-service
spec:
type: LoadBalancer
selector:
app: serve-app
ports:
- port: 80
targetPort: 8000
Apply:
kubectl apply -f ray-serve-app.yaml
Test inference
curl http://$(kubectl get svc serve-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello world"}'
Now, In summary, we have:
An EKS cluster in Auto Mode, scaling nodes automatically via Karpenter.
Custom provisioners for GPU worker isolation and cost‑effective scaling.
A functional JARK stack: JupyterHub, Argo, and Ray Serve.
A sample Hugging Face model served via Ray, with inference exposed through LoadBalancer.
Subscribe to my newsletter
Read articles from Cerulean Cloud directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Cerulean Cloud
Cerulean Cloud
Cloud | AWS | DevOps | AI 📍 Toronto 🇨🇦 🚀 Cloud Architect @ AWS 👨🏽🏫 Professor