Exploring Direct NVMe Access through Virtio: An Engineer's Perspective

In my work optimizing data center infrastructure, I've been investigating methods to improve storage performance in virtualized environments. This article documents my exploration of using virtio for direct NVMe access, with implementation details verified against the Linux kernel source code and related projects.
The Storage Performance Challenge
Virtual machine storage performance presents significant challenges when running I/O-intensive workloads. Traditional storage virtualization adds multiple layers that impact performance:
The guest OS block layer
The virtualization layer (QEMU/KVM)
The host OS block layer
The physical device access
By mapping NVMe directly to user space applications via virtio, we can potentially bypass several of these layers and achieve near-native performance.
Understanding the Technical Components
NVMe Protocol
The NVMe protocol, defined in the Linux kernel under drivers/nvme/host/
, offers several advantages over legacy storage protocols:
Multiple submission and completion queues (defined in
drivers/nvme/host/nvme.h
)Reduced command overhead with a streamlined command set
PCIe transport for high bandwidth
Optimized command set for flash storage
As shown in the Linux kernel's implementation, NVMe commands are structured as follows:
/* From drivers/nvme/host/nvme.h */
struct nvme_command {
__u8 opcode;
__u8 flags;
__u16 command_id;
__le32 nsid;
__le64 reserved2;
__le64 metadata;
__le64 prp1;
__le64 prp2;
union nvme_data_ptr dptr;
__le32 cdw10;
__le32 cdw11;
__le32 cdw12;
__le32 cdw13;
__le32 cdw14;
__le32 cdw15;
};
Virtio Framework
Virtio provides a standardized interface for virtual I/O devices in the Linux kernel:
Located in
drivers/virtio/
in the kernel sourceDefined by the OASIS Virtio Specification and implemented in
drivers/virtio/virtio_ring.c
Uses shared memory rings (virt queues) for efficient data exchange
Includes an NVMe-specific implementation under
drivers/block/virtio_blk.c
, which we can study for insights
From the source code, virtio works via shared memory ring buffers:
/* From drivers/virtio/virtio_ring.c */
struct vring {
unsigned int num;
struct vring_desc *desc;
struct vring_avail *avail;
struct vring_used *used;
};
Implementation Approach
After studying the Linux kernel source code, SPDK (lib/nvme/
), and QEMU's virtio implementation (hw/block/virtio-blk.c
), I developed a test implementation that uses VFIO to map a virtio-NVMe device directly to user space.
Device Initialization with VFIO
The Linux VFIO framework (located in drivers/vfio/
) enables direct device access from user space:
/* Open VFIO container */
dev->vfio_container_fd = open("/dev/vfio/vfio", O_RDWR);
if (dev->vfio_container_fd < 0) {
perror("Error opening VFIO container");
return -1;
}
/* Find the IOMMU group for our device */
snprintf(path, sizeof(path), "/sys/bus/pci/devices/%s/iommu_group", pci_addr);
/* Read the group number from the symlink */
if ((len = readlink(path, group_path, sizeof(group_path) - 1)) < 0) {
perror("Error reading IOMMU group");
close(dev->vfio_container_fd);
return -1;
}
group_path[len] = '\0';
sscanf(basename(group_path), "%d", &group_id);
/* Open the VFIO group */
snprintf(path, sizeof(path), "/dev/vfio/%d", group_id);
group_fd = open(path, O_RDWR);
This code follows the VFIO API as defined in include/uapi/linux/vfio.h
.
Memory Mapping NVMe Registers
Next, we map the NVMe controller's registers into user space:
/* Map BAR0 (NVMe controller registers) */
reg_info.index = 0; /* BAR0 */
if (ioctl(dev->vfio_device_fd, VFIO_DEVICE_GET_REGION_INFO, ®_info) < 0) {
perror("Error getting VFIO region info");
return -1;
}
dev->regs = mmap(NULL, reg_info.size, PROT_READ | PROT_WRITE,
MAP_SHARED, dev->vfio_device_fd, reg_info.offset);
The register offsets are defined in the NVMe specification and reflected in drivers/nvme/host/nvme.h
:
/* NVMe Controller Register Offsets */
#define NVME_REG_CAP 0x0000 /* Controller Capabilities */
#define NVME_REG_VS 0x0008 /* Version */
#define NVME_REG_INTMS 0x000c /* Interrupt Mask Set */
#define NVME_REG_INTMC 0x0010 /* Interrupt Mask Clear */
#define NVME_REG_CC 0x0014 /* Controller Configuration */
#define NVME_REG_CSTS 0x001c /* Controller Status */
#define NVME_REG_AQA 0x0024 /* Admin Queue Attributes */
#define NVME_REG_ASQ 0x0028 /* Admin Submission Queue Base Address */
#define NVME_REG_ACQ 0x0030 /* Admin Completion Queue Base Address */
Setting Up NVMe Queues
Based on the NVMe specification and Linux driver implementation, we set up the admin queues:
/* Allocate admin submission queue (4KB aligned) */
if (posix_memalign(&dev->admin_sq, 4096, 4096) != 0) {
perror("Failed to allocate admin SQ");
return -1;
}
memset(dev->admin_sq, 0, 4096);
/* Allocate admin completion queue (4KB aligned) */
if (posix_memalign(&dev->admin_cq, 4096, 4096) != 0) {
perror("Failed to allocate admin CQ");
free(dev->admin_sq);
return -1;
}
memset(dev->admin_cq, 0, 4096);
This aligns with how the kernel's NVMe driver allocates queue memory in drivers/nvme/host/core.c
:
/* From drivers/nvme/host/core.c */
static int nvme_alloc_admin_queues(struct nvme_dev *dev)
{
/* ... */
dev->admin_q = blk_mq_tag_set_create_node(...);
/* ... */
}
Initializing the NVMe Controller
Following the protocol defined in the NVMe specification and implemented in drivers/nvme/host/core.c
, we initialize the controller:
/* Set admin queue attributes */
*(volatile uint32_t *)((uint8_t *)dev->regs + NVME_REG_AQA) =
((16 - 1) << 16) | (16 - 1); /* 16 entries for both queues */
/* Set admin submission queue address */
*(volatile uint64_t *)((uint8_t *)dev->regs + NVME_REG_ASQ) =
(uint64_t)dev->admin_sq;
/* Set admin completion queue address */
*(volatile uint64_t *)((uint8_t *)dev->regs + NVME_REG_ACQ) =
(uint64_t)dev->admin_cq;
/* Set CC register to enable controller with 4KB pages */
cc = (0x4 << 7) | /* CSS: NVM command set */
(0x0 << 4) | /* MPS: 4KB page size */
(0x0 << 11) | /* AMS: Round-robin arbitration */
0x1; /* EN: Enable controller */
*(volatile uint32_t *)((uint8_t *)dev->regs + NVME_REG_CC) = cc;
This mirrors the initialization sequence in nvme_enable_ctrl()
within the kernel source.
Command Submission Process
To submit commands to the NVMe controller, we follow the protocol defined in the NVMe specification:
/* Submit a command to the admin queue */
static int submit_admin_command(struct virtio_nvme_device *dev,
struct nvme_command *cmd) {
struct nvme_command *sq_entry;
volatile uint32_t *doorbell;
/* Get pointer to next submission queue entry */
sq_entry = (struct nvme_command *)dev->admin_sq + dev->sq_tail;
/* Copy command to submission queue */
memcpy(sq_entry, cmd, sizeof(struct nvme_command));
/* Set command ID */
sq_entry->command_id = dev->command_id++;
/* Update submission queue tail */
dev->sq_tail = (dev->sq_tail + 1) % 16;
/* Ring doorbell */
doorbell = (volatile uint32_t *)((uint8_t *)dev->regs + NVME_REG_SQ0TDBL);
*doorbell = dev->sq_tail;
return 0;
}
This implementation aligns with how the kernel submits commands in nvme_submit_cmd()
within drivers/nvme/host/core.c
.
DMA Memory Management
For DMA memory management, we use VFIO's IOMMU support as defined in include/uapi/linux/vfio.h
:
/* Pin the memory for DMA */
struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
/* Map admin SQ */
dma_map.vaddr = (uintptr_t)dev->admin_sq;
dma_map.iova = (uintptr_t)dev->admin_sq; /* Using vaddr as iova for simplicity */
dma_map.size = 4096;
dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
if (ioctl(dev->vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map) < 0) {
perror("Failed to map admin SQ for DMA");
return -1;
}
This approach follows the same pattern used in SPDK's VFIO implementation in lib/env_dpdk/pci_vfio.c
.
Integration with Virtio
To leverage virtio for NVMe access, we need to understand how QEMU's virtio-blk device interfaces with NVMe. From examining hw/block/virtio-blk.c
in QEMU's source code, virtio-blk provides a simpler block device interface that can sit atop various block device backends, including NVMe.
For a true virtio-NVMe implementation, we would:
Define a virtio-NVMe device that exposes NVMe semantics (rather than just block device semantics)
Create a shared memory region for virtqueues between guest and host
Implement the virtio PCI transport layer
This is partially modeled in QEMU's code, though not with a direct NVMe command interface.
Performance Considerations
Based on the SPDK documentation and source code, several factors significantly impact NVMe performance:
Queue depth: The Linux NVMe driver typically uses queue depths of 32 or more (
include/linux/blk-mq.h
)I/O size alignment: Aligning to the NVMe block size (typically 4KB) is critical
Interrupt coalescing: Configuring optimal interrupt thresholds can reduce CPU overhead
IOMMU overhead: The IOMMU adds protection but introduces some overhead
These considerations are reflected in the SPDK NVMe driver implementation in lib/nvme/nvme_pcie.c
.
Testing and Validation
For validating this approach, we can use tools from the Linux kernel source:
tools/testing/nvme/
contains NVMe test utilitiestools/perf/
provides performance measurement capabilitiestools/io_uring/
offers examples of high-performance I/O patterns
Future Directions
Future work could explore:
Implementing full NVMe queue management beyond admin queues
Integrating with io_uring for efficient asynchronous I/O
Exploring direct integration with SPDK's NVMe driver
Comparing with QEMU's vhost-user-blk implementation (
hw/block/vhost-user-blk.c
)
Conclusion
Direct NVMe access through virtio has significant potential for high-performance storage in virtualized environments. By carefully implementing the NVMe protocol and leveraging the virtio framework, we can achieve near-native storage performance while maintaining the benefits of virtualization.
Subscribe to my newsletter
Read articles from Dove-Wing directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
