JARK Stack for Gen AI on Kubernetes

Cerulean CloudCerulean Cloud
4 min read

You should have heard of LAMP, MEAN, MERN or even JAM stack. But what is this new JARK stack?

Let’s unpack that. JARK stands for Jupyter, Argo, Ray, and Kubernetes. It can help teams launch gen‑AI pipelines in production-ready infrastructure.

Let’s unpack this further!

The JARK stack brings together four open-source building blocks to manage end‑to‑end generative AI workloads:

  • JupyterHub: Multi-user notebook environment for experimentation, fine‑tuning, and analysis.

  • Argo Workflows: YAML‑defined DAG pipelines connecting notebook stages to training, packaging, and serving.

  • Ray & Ray Serve: Scalable distributed compute across GPU/CPU; Ray Serve handles high‑throughput inference.

  • Kubernetes (EKS): The orchestrator; handles scheduling, scaling, GPU node provisioning (often via Karpenter), multi-tenancy, and resilience.

This stack originated from AWS’s re:Invent 2023 presentation "Deploying Generative Models on Amazon EKS".

Why Teams Are Building JARK on EKS

Experimentation + Production in One Place

JupyterHub serves as a collaboration point for data scientists and ML engineers. They can interactively prototype models, fine-turn hyperparameters, and version notebooks—all inside the same Kubernetes cluster used for production workloads.

Declarative Pipelines That Scale

With Argo Workflows, you can "define once, run anywhere." Want to run fine-tuning, validation, packaging, and deployment as steps? Argo manages it as a directed graph, running in containers with GPU scheduling support.

Distributed Compute & Scalable Serving

For model training and inference, Ray lets tasks scale across nodes. Ray Serve adds autoscaling, streaming, and robust APIs for serving large language models or vision engines.

Cloud-Native and Cost-Aware

EKS nodes can autoscale thanks to Karpenter or managed nodegroups. GPU nodes can spin up on demand, minimizing idle costs. Kubernetes also provides namespaces, RBAC, monitoring, and resource isolation.

Sample JARK stack deployment with EKS Auto Mode

The following is the broad overview of EKS Auto Mode cluster architecture.

Deploy EKS Auto Mode Cluster with autoModeConfig:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: jark-autocluster
  region: us-west-2
  version: "1.27"
autoModeConfig:
  enabled: true
  nodeRoleARN: arn:aws:iam::<account‑id>:role/EKSAutoNodeRole
  nodePools:
    - general-purpose
    - system
managed: false
vpc:
  nat:
    gateway: Single

Then:

create cluster -f eks-cluster-aut-mode.yaml

This provisions cluster where Karpenter auto-provisions node pools and handles upgrades (~every 21 days) automatically.

Custom Node Pool Creation

In Auto Mode, node pools are replaced by provisioners configured via Karpenter. To handle GPU workloads and isolate system pods, define two provisioner CRDs.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: jark-gpu
spec:
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["g5.xlarge","g4dn.xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
  provider:
    subnetSelector:
      karpenter.sh/discovery: <your-vpc>
    securityGroupSelector:
      kubernetes.io/cluster/jark-autocluster: shared
  taints:
    - key: "ray.io/node-type"
      value: "worker"
      effect: "NoSchedule"
  limits:
    resources:
      cpu: "1000"
      memory: "2000Gi"
  ttlSecondsAfterEmpty: 60

And CPU/general-purpose:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: jark-cpu
spec:
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["m5.xlarge","m5.2xlarge"]
  ttlSecondsAfterEmpty: 30

Apply them:

bashCopyEditkubectl apply -f gpu-provisioner.yaml
kubectl apply -f cpu-provisioner.yaml

Install JupyterHub, Argo and Ray

Install JupyterHub

Use Helm chart from the AWS Data‑on‑EKS blueprint to deploy JupyterHub

helm repo add data-on-eks https://awslabs.github.io/data-on-eks
helm repo update
helm install jhub data-on-eks/jupyterhub \
  --namespace jupyterhub --create-namespace \
  --set singleuser.image.name=jupyter/minimal-notebook \
  --set proxy.service.type=LoadBalancer

This sets up multi-user JupyterHub with a public LB endpoint .

Install Argo Workflows

create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/install.yaml

Verify argo is deployed.

kubectl get pods -n argo
kubectl port-forward svc/argo-server -n argo 2746:2746

Install Ray Operator and Ray Serve

helm repo add ray https://ray-project.github.io/kuberay-helm/
helm repo update
helm install ray-operator ray/ray-operator --namespace ray-system --create-namespace

Deploy a Ray cluster manifest (ray-cluster.yaml) to run Serve workers on GPU nodes and Ray head:

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: jark-ray
  namespace: default
spec:
  rayVersion: "2.9.1"
  headGroupSpec:
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:latest
            ports:
              - containerPort: 8265
  workerGroupSpecs:
    - groupName: gpu-workers
      minReplicas: 0
      maxReplicas: 2
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-worker
              image: rayproject/ray:latest
              resources:
                limits:
                  nvidia.com/gpu: 1
              args: ["ray", "start", "--address=$(RAY_HEAD_IP):6379"]
          tolerations:
            - key: "ray.io/node-type"
              operator: "Exists"
              effect: "NoSchedule"

Apply:

kubectl apply -f ray-cluster.yaml

Sample Model from Hugging Face & Serve Inference

Create a Ray Serve application

In serve_app.py:

from ray import serve
from transformers import pipeline

print("Loading HuggingFace model...")
model = pipeline("text-generation", model="gpt2")

serve.start()
@serve.deployment(route_prefix="/generate")
class GenModel:
    def __call__(self, request):
        data = request.json()
        return {"result": model(data["prompt"], max_length=50)[0]["generated_text"]}

GenModel.deploy()

Build Docker image and push to ECR (inside a notebook or CI pipeline)

bashCopyEditdocker build -t ghserve:latest -f Dockerfile .
aws ecr create-repository --repository-name ghserve
docker tag ghserve:latest <account>.dkr.ecr.us-west-2.amazonaws.com/ghserve:latest
docker push <URI>/ghserve:latest

Deploy via Kubernetes manifest ray-serve-app.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: serve-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: serve-app
  template:
    metadata:
      labels:
        app: serve-app
    spec:
      containers:
      - name: serve-app
        image: <URI>/ghserve:latest
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: serve-service
spec:
  type: LoadBalancer
  selector:
    app: serve-app
  ports:
    - port: 80
      targetPort: 8000

Apply:

kubectl apply -f ray-serve-app.yaml

Test inference

curl http://$(kubectl get svc serve-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello world"}'

Now, In summary, we have:

  • An EKS cluster in Auto Mode, scaling nodes automatically via Karpenter.

  • Custom provisioners for GPU worker isolation and cost‑effective scaling.

  • A functional JARK stack: JupyterHub, Argo, and Ray Serve.

  • A sample Hugging Face model served via Ray, with inference exposed through LoadBalancer.

0
Subscribe to my newsletter

Read articles from Cerulean Cloud directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Cerulean Cloud
Cerulean Cloud

Cloud | AWS | DevOps | AI 📍 Toronto 🇨🇦 🚀 Cloud Architect @ AWS 👨🏽‍🏫 Professor