Achieving Zero-Downtime Live Migration for Stateful Workloads with Drafter: An In-Depth Guide

Dhairya PatelDhairya Patel
10 min read

The Challenge with Stateful Workloads in Cloud

In today's cloud-native world, managing stateful workloads like databases presents unique challenges. While solutions like Kubernetes excel at handling stateless applications, managing stateful services—especially when it comes to live migration between nodes or even different cloud providers—remains complex.

One particular pain point is running databases on spot instances. While spot instances offer significant cost savings (often 60-90% cheaper than regular instances), their temporary nature poses a major challenge for stateful applications. When a cloud provider reclaims a spot instance, you typically have just minutes to migrate your stateful workload without losing data or experiencing downtime.

Enter Drafter: A New Approach to VM Migration

Drafter, an open-source project by Loophole Labs, offers a novel solution to this challenge. It's a compute primitive that enables live migration of virtual machines between heterogeneous nodes with effectively zero downtime. Think of it as a way to "lift and shift" running applications—including their entire state—between different machines or even different cloud providers.

Key Features That Make Drafter Special

  1. Fast Migration: Drafter achieves migrations in under 100ms within the same data center and around 500ms for cross-continental migrations

  2. Heterogeneous Support: Works across different cloud providers and CPU models (with PVM support)

  3. No Hardware Virtualization Requirement: Using PVM (Platform Virtualization Module), Drafter can run on instances without hardware virtualization support

  4. OCI Image Support: Can run container images directly as VMs without nested virtualization overhead

A Practical Example: Migrating Redis Between GCP Instances

Let's walk through a real-world example of using Drafter to migrate a Redis instance between two Google Cloud Platform (GCP) virtual machines with zero downtime.

The Setup

Creating GCP Instances in Different Regions

First, let's create two instances in different regions: one in us-east1 (South Carolina) and another in europe-west2 (London).

Through GCP Console

  1. Source Instance (US East):

    • Go to Google Cloud Console → Compute Engine → VM Instances

    • Click "CREATE INSTANCE"

    • Configure:

        Name: redis-source
        Region: us-east1 (South Carolina)
        Zone: us-east1-b
        Machine configuration: c2-standard-4 (4 vCPU, 16 GB memory)
        Boot disk:
          - Operating System: Rocky Linux
          - Version: Rocky Linux 9
          - Size: 20 GB
      
  2. Destination Instance (London):

    • Same process but with:

        Name: redis-destination
        Region: europe-west2 (London)
        Zone: europe-west2-a
      
    • Keep all other settings identical to ensure compatibility

  3. Networking:

    • Create a Firewall rule for migration:

        Name: allow-drafter-migration
        Network: default
        Targets: Specified target tags
        Target tags: drafter-migration
        Source IP ranges: [Source instance IP]/32
        Protocols and ports: tcp:1337
      
    • Add network tags to both instances:

      • Edit each instance

      • Add network tag: drafter-migration

Using gcloud CLI

For those who prefer command line:

# Create source instance in US East
gcloud compute instances create redis-source \
    --project=your-project-id \
    --zone=us-east1-b \
    --machine-type=c2-standard-4 \
    --network-interface=network-tier=PREMIUM,subnet=default \
    --maintenance-policy=MIGRATE \
    --provisioning-model=STANDARD \
    --service-account=your-service-account \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --tags=drafter-migration \
    --create-disk=auto-delete=yes,boot=yes,device-name=redis-source,image=rocky-linux-9-optimized-gcp,mode=rw,size=20,type=pd-balanced \
    --no-shielded-secure-boot \
    --shielded-vtpm \
    --shielded-integrity-monitoring \
    --reservation-affinity=any

# Create destination instance in London
gcloud compute instances create redis-destination \
    --project=your-project-id \
    --zone=europe-west2-a \
    --machine-type=c2-standard-4 \
    --network-interface=network-tier=PREMIUM,subnet=default \
    --maintenance-policy=MIGRATE \
    --provisioning-model=STANDARD \
    --service-account=your-service-account \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --tags=drafter-migration \
    --create-disk=auto-delete=yes,boot=yes,device-name=redis-destination,image=rocky-linux-9-optimized-gcp,mode=rw,size=20,type=pd-balanced \
    --no-shielded-secure-boot \
    --shielded-vtpm \
    --shielded-integrity-monitoring \
    --reservation-affinity=any

# Create firewall rule for migration
gcloud compute firewall-rules create allow-drafter-migration \
    --direction=INGRESS \
    --priority=1000 \
    --network=default \
    --action=ALLOW \
    --rules=tcp:1337 \
    --source-tags=drafter-migration \
    --target-tags=drafter-migration

Verifying Connectivity

After creating the instances:

  1. SSH into source instance:

     gcloud compute ssh redis-source --zone=us-east1-b
    
  2. Test connectivity to destination:

     # From source instance
     ping $(gcloud compute instances describe redis-destination \
         --zone=europe-west2-a \
         --format='get(networkInterfaces[0].networkIP)')
    
  3. Check latency:

     # This will help you estimate migration time
     time nc -zv $(gcloud compute instances describe redis-destination \
         --zone=europe-west2-a \
         --format='get(networkInterfaces[0].networkIP)') 1337
    

Now we have two GCP instances running Rocky Linux 9:

Both instances use PVM for cross-machine compatibility and Drafter for orchestrating the migration.

Step 1: Setting Up PVM Support

First, we need to prepare our instances with PVM support. On both instances:

# Update system packages
sudo dnf update -y

# Add the PVM repository for Rocky Linux 9 on GCP
sudo dnf config-manager --add-repo 'https://loopholelabs.github.io/linux-pvm-ci/rocky/gcp/repodata/linux-pvm-ci.repo'

# Install the PVM kernel package
sudo dnf install -y kernel-6.7.12_pvm_host_rocky_gcp-1.x86_64

# Set the PVM kernel as default and configure kernel parameters
sudo grubby --set-default /boot/vmlinuz-6.7.12-pvm-host-rocky-gcp
sudo grubby --copy-default --args="pti=off nokaslr lapic=notscdeadline" --update-kernel /boot/vmlinuz-6.7.12-pvm-host-rocky-gcp

# Regenerate initramfs
sudo dracut --force --kver 6.7.12-pvm-host-rocky-gcp

# Configure kernel module blacklisting
sudo tee /etc/modprobe.d/kvm-intel-amd-blacklist.conf <<EOF
blacklist kvm-intel
blacklist kvm-amd
EOF

# Enable PVM module
echo "kvm-pvm" | sudo tee /etc/modules-load.d/kvm-pvm.conf

# Reboot the instance
sudo reboot

#After reboot, verify PVM installation:
# Check if PVM module is loaded
lsmod | grep pvm

# Expected output should show:
# kvm_pvm               57344  0
# kvm                 1388544  1 kvm_pvm
```

Step 2: Installing Drafter Components

Drafter consists of several components that work together:

mkdir -p /tmp/drafter-install

# Download and install Drafter binaries
for BINARY in drafter-nat drafter-forwarder drafter-snapshotter drafter-packager drafter-runner drafter-registry drafter-mounter drafter-peer drafter-terminator; do
    echo "Downloading $BINARY..."
    curl -L -o "/tmp/drafter-install/${BINARY}" "https://github.com/loopholelabs/drafter/releases/latest/download/${BINARY}.linux-$(uname -m)"
    echo "Installing $BINARY..."
    sudo install -v "/tmp/drafter-install/${BINARY}" /usr/local/bin
done

# Download and install Firecracker with PVM support
for BINARY in firecracker jailer; do
    echo "Downloading $BINARY..."
    curl -L -o "/tmp/drafter-install/${BINARY}" "https://github.com/loopholelabs/firecracker/releases/download/release-main-live-migration-pvm/${BINARY}.linux-$(uname -m)"
    echo "Installing $BINARY..."
    sudo install -v "/tmp/drafter-install/${BINARY}" /usr/local/bin
done

# Configure sudo path to include /usr/local/bin
sudo tee /etc/sudoers.d/preserve_path << EOF
Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin
EOF

# Load NBD module with increased device count
sudo modprobe nbd nbds_max=4096

# Clean up temporary files
rm -rf /tmp/drafter-install

# Verify installations
for CMD in drafter-nat drafter-forwarder drafter-snapshotter drafter-packager drafter-runner drafter-registry drafter-mounter drafter-peer drafter-terminator firecracker jailer; do
    echo "Checking $CMD version..."
    $CMD --version || echo "$CMD version check not supported"
done

Step 3: Preparing the Redis VM

On the source instance:

# Install redis cli
sudo dnf install -y redis

### Preparing the Environment
# Create working directories
mkdir -p out/blueprint out/package out/instance-0/{overlay,state}

# Download DrafterOS with PVM support
curl -Lo out/drafteros-oci.tar.zst "https://github.com/loopholelabs/drafter/releases/latest/download/drafteros-oci-$(uname -m)_pvm.tar.zst"
curl -Lo out/oci-valkey.tar.zst "https://github.com/loopholelabs/drafter/releases/latest/download/oci-valkey-$(uname -m).tar.zst"

# Extract DrafterOS blueprint
drafter-packager --package-path out/drafteros-oci.tar.zst --extract --devices '[
  {
    "name": "kernel",
    "path": "out/blueprint/vmlinux"
  },
  {
    "name": "disk",
    "path": "out/blueprint/rootfs.ext4"
  }
]'

sudo drafter-packager --package-path out/oci-valkey.tar.zst --extract --devices '[
  {
    "name": "oci",
    "path": "out/blueprint/oci.ext4"
  }
]'

Step 4: The Migration Process

The actual migration is remarkably simple:

  1. On source instance:

Network Setup and Snapshot Creation

# Open new terminal. This will start NAT for network connectivity
sudo drafter-nat --host-interface eth0 # Replace eth0 with the network interface you want to route outgoing traffic from the VMs to
# Open new terminal and run it. This will create initial snapshot
sudo drafter-snapshotter --netns ark0 --cpu-template T2 --devices '[
  {
    "name": "state",
    "output": "out/package/state.bin"
  },
  {
    "name": "memory",
    "output": "out/package/memory.bin"
  },
  {
    "name": "kernel",
    "input": "out/blueprint/vmlinux",
    "output": "out/package/vmlinux"
  },
  {
    "name": "disk",
    "input": "out/blueprint/rootfs.ext4",
    "output": "out/package/rootfs.ext4"
  },
  {
    "name": "config",
    "output": "out/package/config.json"
  },
  {
     "name": "oci",
     "input": "out/blueprint/oci.ext4",
     "output": "out/package/oci.ext4"
   }
]'

Starting Redis with Migration Support

#  Open new terminal and run it. This will start Redis with migration enabled
sudo drafter-peer --netns ark0 --raddr '' --laddr ':1337' --devices '[
  {
    "name": "state",
    "base": "out/package/state.bin",
    "overlay": "out/instance-0/overlay/state.bin",
    "state": "out/instance-0/state/state.bin",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "memory",
    "base": "out/package/memory.bin",
    "overlay": "out/instance-0/overlay/memory.bin",
    "state": "out/instance-0/state/memory.bin",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "kernel",
    "base": "out/package/vmlinux",
    "overlay": "out/instance-0/overlay/vmlinux",
    "state": "out/instance-0/state/vmlinux",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "disk",
    "base": "out/package/rootfs.ext4",
    "overlay": "out/instance-0/overlay/rootfs.ext4",
    "state": "out/instance-0/state/rootfs.ext4",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "config",
    "base": "out/package/config.json",
    "overlay": "out/instance-0/overlay/config.json",
    "state": "out/instance-0/state/config.json",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "oci",
    "base": "out/package/oci.ext4",
    "overlay": "out/instance-0/overlay/oci.ext4",
    "state": "out/instance-0/state/oci.ext4",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  }
]'

Start port forwarding on source vm for redis

# Open new terminal and run it. This will set up port forwarding for Redis
sudo drafter-forwarder --port-forwards '[
  {
    "netns": "ark0",
    "internalPort": "6379",
    "protocol": "tcp",
    "externalAddr": "127.0.0.1:3333"
  }
]'

Connect to redis and add dummy data

redis-cli -p 3333 ping
redis-cli -p 3333

# Once connected, you can run these commands:
SET message "Hello from Rocky Linux!"
SET user "rockyuser"
SET counter 42

# Check the data
KEYS *
GET message
GET user
GET counter

Destination VM

# Open new terminal. This will start NAT for network connectivity
sudo drafter-nat --host-interface eth0 # Replace eth0 with the network interface you want to route outgoing traffic from the VMs to

Migration Process

# Open new terminal and run it. This will Start migration receiver (replace FIRST_INSTANCE_IP)
sudo drafter-peer --netns ark1 --raddr 'SOURCE_VM_IP:1337' --laddr '' --devices '[
  {
    "name": "state",
    "base": "out/package/state.bin",
    "overlay": "out/instance-1/overlay/state.bin",
    "state": "out/instance-1/state/state.bin",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "memory",
    "base": "out/package/memory.bin",
    "overlay": "out/instance-1/overlay/memory.bin",
    "state": "out/instance-1/state/memory.bin",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "kernel",
    "base": "out/package/vmlinux",
    "overlay": "out/instance-1/overlay/vmlinux",
    "state": "out/instance-1/state/vmlinux",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "disk",
    "base": "out/package/rootfs.ext4",
    "overlay": "out/instance-1/overlay/rootfs.ext4",
    "state": "out/instance-1/state/rootfs.ext4",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "config",
    "base": "out/package/config.json",
    "overlay": "out/instance-1/overlay/config.json",
    "state": "out/instance-1/state/config.json",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  },
  {
    "name": "oci",
    "base": "out/package/oci.ext4",
    "overlay": "out/instance-1/overlay/oci.ext4",
    "state": "out/instance-1/state/oci.ext4",
    "blockSize": 65536,
    "expiry": 1000000000,
    "maxDirtyBlocks": 200,
    "minCycles": 5,
    "maxCycles": 20,
    "cycleThrottle": 500000000,
    "makeMigratable": true,
    "shared": false
  }
]'

After migration completes

# Open new terminal and run it. This will  Set up port forwarding on second instance
sudo drafter-forwarder --port-forwards '[
  {
    "netns": "ark1",
    "internalPort": "6379",
    "protocol": "tcp",
    "externalAddr": "127.0.0.1:3333"
  }
]'

By doing this you will see in source VM logs that migration is started.

Verify Data on destination VM

### Open new terminal and connect to redis and check instance 1 data is there or not
```bash
redis-cli -p 3333 ping
redis-cli -p 3333

# Check the data
KEYS *
GET message
GET user
GET counter
```

During migration, Drafter:

  1. Creates a snapshot of the running VM

  2. Transfers the VM state between instances

  3. Resumes the VM on the destination

  4. All while keeping Redis running and accepting connections

The Technical Magic Behind It

Drafter achieves this feat through several innovative approaches:

  1. Custom Firecracker Fork: Uses an optimized version of AWS's Firecracker VMM

  2. PVM Integration: Enables migration between different CPU models

  3. Hybrid Migration Strategy: Uses both pre-copy and post-copy strategies for optimal performance

  4. Block-Level Synchronization: Efficiently transfers only changed memory blocks

Real-World Applications

This technology opens up several interesting possibilities:

  1. Spot Instance Optimization: Run databases on spot instances, migrating before termination

  2. Cross-Cloud Migration: Move workloads between cloud providers without downtime

  3. Maintenance Windows: Eliminate downtime during hardware maintenance

  4. Geographic Optimization: Move workloads closer to users during peak times

Benchmarks and Performance

In our Redis migration example:

  • Local data center migration: ~100ms

  • Cross-regional migration: ~500ms

  • Zero data loss during migration

  • Continuous availability of Redis service

Security Considerations

When implementing Drafter in production:

  1. Network Security:

    • Use TLS for migration traffic

    • Implement proper firewall rules

    • Consider VPC peering for cross-region migrations

  2. Access Control:

    • Use proper file permissions for VM images

    • Implement authentication for migration endpoints

    • Follow principle of least privilege

Future Possibilities

The implications of this technology are significant:

  1. Multi-Cloud Orchestration: Seamless workload movement between clouds

  2. Cost Optimization: Dynamic workload placement based on spot pricing

  3. Geographic Redundancy: Easy movement of stateful services between regions

  4. Development Workflows: Instant environment replication with state

Conclusion

Drafter represents a significant advancement in managing stateful workloads in cloud environments. By enabling true zero-downtime migration of entire VMs, it solves one of the most challenging aspects of cloud computing: managing state across heterogeneous infrastructure.

Whether you're looking to optimize costs with spot instances, implement cross-cloud strategies, or simply need a better way to manage stateful workloads, Drafter provides a powerful new tool in your cloud architecture toolkit.

Resources

My Github. Feel free to comment or connect with me for any issue.

Remember: While the setup might seem complex, the benefits of zero-downtime migration and the ability to run stateful workloads on spot instances can lead to significant cost savings and operational improvements in your cloud infrastructure.

0
Subscribe to my newsletter

Read articles from Dhairya Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dhairya Patel
Dhairya Patel

A DevOps Engineer with 2 year of experience as infrastructure support (AWS, Linux, Azure), DevOps (Build, CICD & Release Management)