Host Setup for Qemu KVM GPU Passthrough with VFIO on Linux


GPU passthrough shouldn't feel like sorcery. If you've ever lost a weekend to half-working configs, random resets, or a guest that only boots when the moon is right, this guide is for you. I have pulled lots of hair while hardening the CloudRift VM service for a variety of consumer (RTX 4090, 5090, PRO 6000) and data center (H100, B200) GPUs, so writing this guide to help you avoid common pitfalls.
I'll focus specifically on the host node configuration for GPU passthrough. Thus, this guide is relevant regardless of whether you're using Proxmox or plain libvirt/QEMU. The provided instructions have been tested on Ubuntu 22.04 and 24.04 with various NVIDIA GPUs.
To keep this guide manageable, I won't delve into lower-level details, such as specific domain XML tricks, Linux kernel builds, or GPU firmware flashing. In most cases, you don't need to fiddle with those.
1. Remove NVIDIA drivers
The first step is to remove the NVIDIA drivers. It is not required, but NVIDIA drivers tend to cause issues with passthrough in one way or another, so it's better to remove them altogether.
If you're configuring your own work PC with multiple GPUs, skip this step as without NVIDIA drivers you won't be able to run UI applications. In this case, the passthrough robustness is likely not a priority for you. However, I strongly recommend removing NVIDIA drivers on headless servers.
If the NVIDIA driver is installed from the repository, you can remove it using the following commands:
sudo apt-get remove --purge '^nvidia-.*'
sudo apt autoremove
If you've installed the driver using the RUN file, remove it using:
sudo /usr/bin/nvidia-uninstall
Remove configs if any.
sudo rm -rf /etc/X11/xorg.conf
sudo rm -rf /etc/modprobe.d/nvidia*.conf
sudo rm -rf /lib/modprobe.d/nvidia*.conf
Reboot the system after driver removal
sudo reboot
2. Check BIOS, IOMMU Support and IOMMU Group Assignment
The next step is to check virtualization and IOMMU support. We need to check four things:
Virtualization is enabled (AMD-Vi / Intel VT-D options are enabled in bios). If present, enable "Above 4G decoding" and "Resizable BAR (ReBAR)" options in BIOS as well.
IOMMU is active (groups exist).
Each GPU and its audio function are isolated in their own IOMMU group.
GPU groups contain only GPU/video-audio functions and PCI bridges — no NICs, NVMe, SATA, etc.
You can use the following handy-dandy script to check those preconditions.
AI goes overboard when generating helper scripts, doesn't it? I can't complain, though. It provides a lot of useful information.
#!/usr/bin/env bash
# VFIO host sanity check: IOMMU support + GPU-containing groups
set -u # don't use -e so greps that find nothing don't abort
# --- helpers ---------------------------------------------------------------
have() { command -v "$1" >/dev/null 2>&1; }
read_klog() {
if have journalctl; then journalctl -k -b 0 2>/dev/null
else dmesg 2>/dev/null
fi
}
trim() { sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//'; }
# --- 1) CPU vendor + boot flags -------------------------------------------
CPU_VENDOR="$(
(lscpu 2>/dev/null | awk -F: '/Vendor ID/{print $2}' | trim) ||
(grep -m1 'vendor_id' /proc/cpuinfo 2>/dev/null | awk '{print $3}')
)"
[ -z "${CPU_VENDOR}" ] && CPU_VENDOR="(unknown)"
CMDLINE="$(cat /proc/cmdline 2>/dev/null || echo '')"
HAS_INTEL_FLAG=$(echo "$CMDLINE" | grep -q 'intel_iommu=on' && echo yes || echo no)
HAS_AMD_FLAG=$(echo "$CMDLINE" | grep -q 'amd_iommu=on' && echo yes || echo no)
HAS_PT_FLAG=$(echo "$CMDLINE" | grep -q 'iommu=pt' && echo yes || echo no)
# --- 2) Kernel log signals ------------------------------------------------
KLOG="$(read_klog)"
DISABLED_MSG=$(echo "$KLOG" | egrep -i 'IOMMU.*disabled by BIOS|DMAR:.*disabled|AMD-Vi:.*disabled' || true)
ENABLED_MSG=$(echo "$KLOG" | egrep -i 'DMAR: IOMMU enabled|AMD-Vi:.*IOMMU.*enabled|IOMMU: .*enabled' || true)
IR_MSG=$(echo "$KLOG" | egrep -i 'Interrupt remapping enabled' || true)
# --- 3) IOMMU groups presence --------------------------------------------
GROUPS_DIR="/sys/kernel/iommu_groups"
GROUP_COUNT=0
if [ -d "$GROUPS_DIR" ]; then
GROUP_COUNT=$(find "$GROUPS_DIR" -mindepth 1 -maxdepth 1 -type d 2>/dev/null | wc -l | awk '{print $1}')
fi
# Heuristic: active if groups exist (>0). Logs help explain state.
IOMMU_ACTIVE="no"
[ "$GROUP_COUNT" -gt 0 ] && IOMMU_ACTIVE="yes"
# --- 4) Report summary ----------------------------------------------------
echo "=== IOMMU Summary ==="
echo "CPU vendor : $CPU_VENDOR"
echo "Kernel cmdline : $CMDLINE"
echo "Boot flags : intel_iommu=$HAS_INTEL_FLAG amd_iommu=$HAS_AMD_FLAG iommu=pt=$HAS_PT_FLAG"
echo "Groups directory : $GROUPS_DIR (exists: $([ -d "$GROUPS_DIR" ] && echo yes || echo no))"
echo "IOMMU group count : $GROUP_COUNT"
echo "Kernel says enabled : $([ -n "$ENABLED_MSG" ] && echo yes || echo no)"
echo "Interrupt remapping : $([ -n "$IR_MSG" ] && echo yes || echo no)"
echo "Kernel says disabled : $([ -n "$DISABLED_MSG" ] && echo yes || echo no)"
echo "IOMMU ACTIVE? : $IOMMU_ACTIVE"
echo
if [ -n "$ENABLED_MSG" ]; then
echo "--- Kernel enable lines ---"
echo "$ENABLED_MSG"
echo
fi
if [ -n "$DISABLED_MSG" ]; then
echo "--- Kernel disable lines ---"
echo "$DISABLED_MSG"
echo
fi
# --- 5) Original: list only GPU-containing groups -------------------------
echo "=== GPU-Containing IOMMU Groups ==="
if [ ! -d "$GROUPS_DIR" ] || [ "$GROUP_COUNT" -eq 0 ]; then
echo "(no IOMMU groups found)"
else
declare -A GPU_COUNT_BY_GROUP=()
group_warnings=()
for g in "$GROUPS_DIR"/*; do
[ -d "$g" ] || continue
group_num=$(basename "$g")
gpu_found=false
device_lines=""
non_gpu_non_bridge=false
gpu_count_in_this_group=0
for d in "$g"/devices/*; do
[ -e "$d" ] || continue
pci_addr=$(basename "$d")
# -nns prints class code [XXXX] and vendor:device [vvvv:dddd]
line=$(lspci -nns "$pci_addr" 2>/dev/null || echo "$pci_addr (unlisted)")
device_lines+="$line"$'\n'
# Extract first [...] which is the class code, e.g. 0300, 0302, 0403, 0604, 0600
class_code=$(echo "$line" | awk -F'[][]' '{print $2}')
# Detect GPUs / 3D controllers and their HDA audio functions
if echo "$line" | grep -qE 'VGA compatible controller|3D controller'; then
gpu_found=true
gpu_count_in_this_group=$((gpu_count_in_this_group+1))
fi
# Allowlist: 0300(VGA), 0302(3D), 0403(HDA audio), 0600(host bridge), 0604(PCI bridge)
case "$class_code" in
0300|0302|0403|0600|0604) : ;;
*) non_gpu_non_bridge=true ;;
esac
done
if $gpu_found; then
echo "IOMMU Group $group_num:"
echo "$device_lines"
# Track GPUs per group
GPU_COUNT_BY_GROUP["$group_num"]=$gpu_count_in_this_group
# Warn if unexpected devices share the group with the GPU
if $non_gpu_non_bridge; then
group_warnings+=("WARN: Group $group_num contains non-GPU, non-audio, non-bridge devices (consider different slot/CPU root complex or ACS).")
fi
fi
done
# Post-checks
# 1) Each GPU should be alone (one GPU per group)
shared_groups=()
for gnum in "${!GPU_COUNT_BY_GROUP[@]}"; do
if [ "${GPU_COUNT_BY_GROUP[$gnum]}" -gt 1 ]; then
shared_groups+=("$gnum")
fi
done
if [ "${#shared_groups[@]}" -gt 0 ]; then
echo
echo "WARN: Multiple GPUs share these IOMMU groups: ${shared_groups[*]} (prefer one GPU per group for VFIO)."
fi
# 2) Any non-bridge co-residents?
if [ "${#group_warnings[@]}" -gt 0 ]; then
echo
printf "%s\n" "${group_warnings[@]}"
fi
fi
Here is what a good summary should look like:
=== IOMMU Summary ===
CPU vendor : AuthenticAMD
Kernel cmdline : BOOT_IMAGE=/boot/vmlinuz-6.8.0-71-generic root=/dev/mapper/vgroot-lvroot ro systemd.unified_cgroup_hierarchy=false default_hugepagesz=1G hugepages=576 hugepagesz=1G nomodeset video=efifb:off iommu=pt pci=realloc pcie_aspm=off amd_iommu=on vfio-pci.ids=10de:0000,10de:204b,10de:22e8,10de:2bb1 modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel
Boot flags : intel_iommu=no amd_iommu=yes iommu=pt=yes
Groups directory : /sys/kernel/iommu_groups (exists: yes)
IOMMU group count : 57
Kernel says enabled : no
Interrupt remapping : no
Kernel says disabled : no
IOMMU ACTIVE? : yes
=== GPU-Containing IOMMU Groups ===
IOMMU Group 13:
c1:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
c1:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)
IOMMU Group 16:
c6:00.0 PCI bridge [0604]: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge [1a03:1150] (rev 06)
c7:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 52)
IOMMU Group 27:
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
81:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)
IOMMU Group 42:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)
IOMMU Group 54:
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
41:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)
As we can see, IOMMU support is enabled, and all GPUs and their corresponding audio devices are in separate IOMMU groups.
Sometimes you may see PCI bridges in the GPU IOMMU group. This is normal.
=== GPU-Containing IOMMU Groups ===
IOMMU Group 13:
40:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
40:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
41:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)
IOMMU Group 32:
20:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
20:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
25:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
25:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)
3. Leverage 1G Huge Pages
This step is optional. However, if you have more than 512GB of RAM on your system, it is highly-encouraged. From experience, aside from providing performance benefit, 1GB huge pages make the VM startup much more reliable on high-memory systems.
Rule of thumb
< 128 GB RAM: usually skip (benefit is small).
128–512 GB: optional; can reduce latency jitter.
\> 512 GB: recommended for reliability and predictable performance.
Why 1 GiB pages help
Fewer page-table walks → fewer TLB misses.
Lower page management overhead.
More predictable VM start times on large RAM allocations.
3.1 Check Huge Page Support
To confirm 1G huge page support on your system, check the pdpe1gb CPU flag.
grep -m1 pdpe1gb /proc/cpuinfo >/dev/null && echo "✓ CPU supports 1GiB pages" || echo "✗ No 1GiB page support"
3.2 Allocate Huge Pages
Determine how much memory you want to reserve for the VMs. You need to reserve that much memory for huge pages plus a buffer.
Note that the memory reserved for huge pages will not be usable on the host system.
For example, if you want to dedicate 2000GB
to virtual machines with a 80 GB
buffer, you would need 2080
huge pages.
I use the following empirically validated table to determine the huge page configuration on a high-memory multi-GPU system.
Total System RAM | VM Allocation | Buffer | Huge Pages | Left for System |
768 GB | 640 (8x80) GB | 60 GB | 700 | 68 GB |
1024 GB | 800 (8x100) GB | 80 GB | 880 | 144 GB |
1256 GB | 1040 (8x130) GB | 100 GB | 1140 | 116 GB |
1512 GB | 1280 (8x160) GB | 120 GB | 1300 | 212 GB |
2048 GB | 1760 (8x220) GB | 160 GB | 1920 | 128 GB |
4096 GB | 3680 (8*460) GB | 200 GB | 3880 | 216 GB |
Is there a reliable formula to determine the huge page buffer size? Good question. If you know one, let me know in the comments. It makes sense that we need to leave some memory for the system, but it feels that the gap between the memory dedicated to VM allocation and the number of huge pages is unnecessary. After VM startup, we'll see that the system has allocated the exact number of requested huge pages, so why do we need a buffer, and how big should it be? Is it because of the fragmentation? Empirically, I've confirmed that it is needed. Without a buffer, I occasionally encountered OOM errors.
Run the following command to allocate 2000 pages (it will take a while):
echo 2000 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
To check that huge pages were allocated, run grep -i huge /proc/meminfo
. Look at Hugepagesize
and Hugetlb
values. They tell the huge page size and the total amount of RAM allocated for huge pages. You should see output like this:
AnonHugePages: 79872 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 2080
HugePages_Free: 1580
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 2181038080 kB
To deallocate, invoke:
echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
3.3 Make Huge Pages Persistent
Edit the /etc/default/grub
file and modify the line containing GRUB_CMDLINE_LINUX
.
Add default_hugepagesz=1G hugepagesz=1G hugepages=<num>
to the GRUB_CMDLINE_LINUX
options. The <num>
is the number of huge pages to allocate. For example:
GRUB_CMDLINE_LINUX="... default_hugepagesz=1G hugepagesz=1G hugepages=200"
Be careful. If you specify more huge pages than the system can allocate, the machine will not boot.
Update the GRUB changes, reboot, and verify that huge pages are allocated (or do this in the end).
sudo update-grub
sudo reboot
3.4 (Optional) Mount Huge Page Table
Many systems already have /dev/hugepages
. If not, or if you want a dedicated mount:
sudo mkdir -p /mnt/hugepages-1G
sudo mount -t hugetlbfs -o pagesize=1G none /mnt/hugepages-1G
Check that the mount point is present by running grep hugetlbfs /proc/mounts
.
You should see something like:
hugetlbfs /dev/hugepages hugetlbfs rw,nosuid,nodev,relatime,pagesize=1024M 0 0
hugetlbfs /mnt/hugepages-1G hugetlbfs rw,relatime,pagesize=1024M 0 0
To persist - invoke:
echo "none /mnt/hugepages-1G hugetlbfs pagesize=1G 0 0" | sudo tee -a /etc/fstab
3.5 Configure your Virtualization Software to use Huge Pages
Neither Proxmox nor libvirt is using huge pages by default.
To use them in libvirt, you need to add the following section to the domain XML
<memoryBacking>
<hugepages>
<page size='1048576' unit='KiB'/>
</hugepages>
<locked/>
</memoryBacking>
In Proxmox CLI you do it as follows:
qm set <vmid> --hugepages 1024 # use 1GiB pages
qm set <vmid> --keephugepages 1 # optional: keep reserved after shutdown
4. Bind to VFIO Early
For maximum stability, have VFIO claim the GPU at boot so no runtime driver swaps occur (Proxmox/libvirt will otherwise bind/unbind around VM start/stop).
4.1 Identify the PCI IDs to bind
First, you need to determine the PCI vendor ID and device ID for your GPUs.
List all NVIDIA functions (display + audio, and any auxiliary functions):
lspci -nn | grep -i nvidia
Example (RTX 5090):
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)
4.2 Give VFIO first claim
Add the following lines to GRUB_CMDLINE_LINUX_DEFAULT
in /etc/default/grub
, replacing the PCI vendor ID and device ID with the appropriate values. Keep other options if needed.
GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel vfio-pci.ids=10de:2b85,10de:22e8 ..."
Proxmox is likely using systemd-boot by default instead of GRUB. Check the bootloader you're using and adjust the kernel command line accordingly.
Many online manuals suggest adding VFIO modules to
/etc/modprobe.d/vfio.conf
, but this approach has not always worked for me. I recommend early binding via the kernel command line.
4.3 Ensure VFIO is in the initramfs
We need to make sure that vfio modules are loaded early in the boot process. To achieve this, we include them in the initramfs
sudo tee -a /etc/initramfs-tools/modules >/dev/null <<'EOF'
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
EOF
4.4 Reboot and verify
Update grub, initramfs and reboot.
sudo update-initramfs -u -k all
sudo update-grub
sudo reboot
After reboot check that VFIO drivers are in use. You can use lspci -k | grep -A 2 -i nvidia
command and should see vfio
drivers in use:
81:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device 416f
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
81:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)
Subsystem: NVIDIA Corporation Device 0000
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
To be fair, there was one machine where this technique to bind VFIO failed. The system was aggressively binding snd_hda_intel driver to the GPU audio function. However, this method worked for me in all other cases.
5. Other GRUB Options
Here is a summary of other kernel command line options that you may want to consider, along with my thoughts on each.
pci=realloc
: Reallocate PCI resources forces the kernel to reassign PCI bus resources (MMIO/IOBARs) from scratch, ignoring what the firmware/BIOS assigned. It helps avoid issues when the BIOS didn't allocate enough space for devices (common with large GPUs or multiple devices). Fixes “BAR can't be assigned” or “resource busy” errors. This option is helpful. I like to include it into the guest OS kernel params as well. It occasionally helps to work around BAR allocation issues. However, there is no need to list it unless the system has PCI device enumeration issues.iommu=pt
: IOMMU passthrough mode tells the kernel to enable the IOMMU but use pass-through mode for DMA mappings by default. For VFIO GPU passthrough — allows the device to access physical memory directly with minimal performance penalty. I haven't had a chance to test the performance gains, so I can just say that this option didn't break anything.pcie_aspm=off
: Disable PCIe Active State Power Management, which is a power-saving feature that reduces PCIe link power in idle states. Some PCIe devices (especially GPUs) have trouble retraining links or waking from ASPM low-power states, leading to hangs or device inaccessible errors. This option was introduced to my configs after losing a lot of time on the Reset Bug. It didn't help. I don't consider this option helpful at the moment, but I am still evaluating it.nomodeset
: Disable kernel mode setting (KMS) for all GPUs; prevents DRM drivers from taking over the console. This option is intended for use with headless servers only. It can break desktop/console output. I typically use it since we're working with headless servers.video=efifb:off
: Disables the firmware EFI framebuffer so simpledrm/efifb won’t grab the boot GPU before VFIO claims it. This option is outdated and has no effect on systems with modern kernels. I list it for completeness.intel_iommu=on
/amd_iommu=on
: Enable IOMMU support for Intel and AMD. These are enabled by default, so there is no need to add them to kernel parameters.
Here is how the typical kernel command line should look on a headless server with over 500GB of RAM.
nomodeset
modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel
vfio-pci.ids=10de:2b85,10de:22e8
default_hugepagesz=1G hugepagesz=1G hugepages=400
Conclusion
The VFIO GPU passthrough is a finicky process. It is sensitive to host hardware and software configuration. However, with enough diligence, you can make it robust and reliable. I strongly believe in this approach and rely on VFIO GPU passthrough as the primary tool for our GPU rental service at cloudrift.ai.
I hope this guide helped you to improve your homelab or data center setup. If you notice any inaccuracies or have suggestions, please don't hesitate to let me know so we can improve the workflow together.
Final host checklist:
Enable IOMMU, Above 4G, and (where applicable) ReBAR in the BIOS.
Verify clean IOMMU groups; each GPU (+ audio) isolated.
Bind to vfio-pci early.
Size huge pages (1 GiB on high-RAM hosts) and confirm in /proc/meminfo.
Configure other kernel command-line options as needed.
Subscribe to my newsletter
Read articles from Dmitry Trifonov directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
