Exploring Direct NVMe Access through Virtio: An Engineer's Perspective

Dove-WingDove-Wing
7 min read

In my work optimizing data center infrastructure, I've been investigating methods to improve storage performance in virtualized environments. This article documents my exploration of using virtio for direct NVMe access, with implementation details verified against the Linux kernel source code and related projects.

The Storage Performance Challenge

Virtual machine storage performance presents significant challenges when running I/O-intensive workloads. Traditional storage virtualization adds multiple layers that impact performance:

  1. The guest OS block layer

  2. The virtualization layer (QEMU/KVM)

  3. The host OS block layer

  4. The physical device access

By mapping NVMe directly to user space applications via virtio, we can potentially bypass several of these layers and achieve near-native performance.

Understanding the Technical Components

NVMe Protocol

The NVMe protocol, defined in the Linux kernel under drivers/nvme/host/, offers several advantages over legacy storage protocols:

  • Multiple submission and completion queues (defined in drivers/nvme/host/nvme.h)

  • Reduced command overhead with a streamlined command set

  • PCIe transport for high bandwidth

  • Optimized command set for flash storage

As shown in the Linux kernel's implementation, NVMe commands are structured as follows:

/* From drivers/nvme/host/nvme.h */
struct nvme_command {
    __u8    opcode;
    __u8    flags;
    __u16   command_id;
    __le32  nsid;
    __le64  reserved2;
    __le64  metadata;
    __le64  prp1;
    __le64  prp2;
    union nvme_data_ptr dptr;
    __le32  cdw10;
    __le32  cdw11;
    __le32  cdw12;
    __le32  cdw13;
    __le32  cdw14;
    __le32  cdw15;
};

Virtio Framework

Virtio provides a standardized interface for virtual I/O devices in the Linux kernel:

  • Located in drivers/virtio/ in the kernel source

  • Defined by the OASIS Virtio Specification and implemented in drivers/virtio/virtio_ring.c

  • Uses shared memory rings (virt queues) for efficient data exchange

  • Includes an NVMe-specific implementation under drivers/block/virtio_blk.c, which we can study for insights

From the source code, virtio works via shared memory ring buffers:

/* From drivers/virtio/virtio_ring.c */
struct vring {
    unsigned int num;
    struct vring_desc *desc;
    struct vring_avail *avail;
    struct vring_used *used;
};

Implementation Approach

After studying the Linux kernel source code, SPDK (lib/nvme/), and QEMU's virtio implementation (hw/block/virtio-blk.c), I developed a test implementation that uses VFIO to map a virtio-NVMe device directly to user space.

Device Initialization with VFIO

The Linux VFIO framework (located in drivers/vfio/) enables direct device access from user space:

/* Open VFIO container */
dev->vfio_container_fd = open("/dev/vfio/vfio", O_RDWR);
if (dev->vfio_container_fd < 0) {
    perror("Error opening VFIO container");
    return -1;
}

/* Find the IOMMU group for our device */
snprintf(path, sizeof(path), "/sys/bus/pci/devices/%s/iommu_group", pci_addr);
/* Read the group number from the symlink */
if ((len = readlink(path, group_path, sizeof(group_path) - 1)) < 0) {
    perror("Error reading IOMMU group");
    close(dev->vfio_container_fd);
    return -1;
}
group_path[len] = '\0';
sscanf(basename(group_path), "%d", &group_id);

/* Open the VFIO group */
snprintf(path, sizeof(path), "/dev/vfio/%d", group_id);
group_fd = open(path, O_RDWR);

This code follows the VFIO API as defined in include/uapi/linux/vfio.h.

Memory Mapping NVMe Registers

Next, we map the NVMe controller's registers into user space:

/* Map BAR0 (NVMe controller registers) */
reg_info.index = 0; /* BAR0 */
if (ioctl(dev->vfio_device_fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info) < 0) {
    perror("Error getting VFIO region info");
    return -1;
}

dev->regs = mmap(NULL, reg_info.size, PROT_READ | PROT_WRITE, 
                 MAP_SHARED, dev->vfio_device_fd, reg_info.offset);

The register offsets are defined in the NVMe specification and reflected in drivers/nvme/host/nvme.h:

/* NVMe Controller Register Offsets */
#define NVME_REG_CAP     0x0000  /* Controller Capabilities */
#define NVME_REG_VS      0x0008  /* Version */
#define NVME_REG_INTMS   0x000c  /* Interrupt Mask Set */
#define NVME_REG_INTMC   0x0010  /* Interrupt Mask Clear */
#define NVME_REG_CC      0x0014  /* Controller Configuration */
#define NVME_REG_CSTS    0x001c  /* Controller Status */
#define NVME_REG_AQA     0x0024  /* Admin Queue Attributes */
#define NVME_REG_ASQ     0x0028  /* Admin Submission Queue Base Address */
#define NVME_REG_ACQ     0x0030  /* Admin Completion Queue Base Address */

Setting Up NVMe Queues

Based on the NVMe specification and Linux driver implementation, we set up the admin queues:

/* Allocate admin submission queue (4KB aligned) */
if (posix_memalign(&dev->admin_sq, 4096, 4096) != 0) {
    perror("Failed to allocate admin SQ");
    return -1;
}
memset(dev->admin_sq, 0, 4096);

/* Allocate admin completion queue (4KB aligned) */
if (posix_memalign(&dev->admin_cq, 4096, 4096) != 0) {
    perror("Failed to allocate admin CQ");
    free(dev->admin_sq);
    return -1;
}
memset(dev->admin_cq, 0, 4096);

This aligns with how the kernel's NVMe driver allocates queue memory in drivers/nvme/host/core.c:

/* From drivers/nvme/host/core.c */
static int nvme_alloc_admin_queues(struct nvme_dev *dev)
{
    /* ... */
    dev->admin_q = blk_mq_tag_set_create_node(...);
    /* ... */
}

Initializing the NVMe Controller

Following the protocol defined in the NVMe specification and implemented in drivers/nvme/host/core.c, we initialize the controller:

/* Set admin queue attributes */
*(volatile uint32_t *)((uint8_t *)dev->regs + NVME_REG_AQA) = 
    ((16 - 1) << 16) | (16 - 1); /* 16 entries for both queues */

/* Set admin submission queue address */
*(volatile uint64_t *)((uint8_t *)dev->regs + NVME_REG_ASQ) = 
    (uint64_t)dev->admin_sq;

/* Set admin completion queue address */
*(volatile uint64_t *)((uint8_t *)dev->regs + NVME_REG_ACQ) = 
    (uint64_t)dev->admin_cq;

/* Set CC register to enable controller with 4KB pages */
cc = (0x4 << 7) | /* CSS: NVM command set */
     (0x0 << 4) | /* MPS: 4KB page size */
     (0x0 << 11) | /* AMS: Round-robin arbitration */
     0x1;        /* EN: Enable controller */

*(volatile uint32_t *)((uint8_t *)dev->regs + NVME_REG_CC) = cc;

This mirrors the initialization sequence in nvme_enable_ctrl() within the kernel source.

Command Submission Process

To submit commands to the NVMe controller, we follow the protocol defined in the NVMe specification:

/* Submit a command to the admin queue */
static int submit_admin_command(struct virtio_nvme_device *dev, 
                               struct nvme_command *cmd) {
    struct nvme_command *sq_entry;
    volatile uint32_t *doorbell;

    /* Get pointer to next submission queue entry */
    sq_entry = (struct nvme_command *)dev->admin_sq + dev->sq_tail;

    /* Copy command to submission queue */
    memcpy(sq_entry, cmd, sizeof(struct nvme_command));

    /* Set command ID */
    sq_entry->command_id = dev->command_id++;

    /* Update submission queue tail */
    dev->sq_tail = (dev->sq_tail + 1) % 16;

    /* Ring doorbell */
    doorbell = (volatile uint32_t *)((uint8_t *)dev->regs + NVME_REG_SQ0TDBL);
    *doorbell = dev->sq_tail;

    return 0;
}

This implementation aligns with how the kernel submits commands in nvme_submit_cmd() within drivers/nvme/host/core.c.

DMA Memory Management

For DMA memory management, we use VFIO's IOMMU support as defined in include/uapi/linux/vfio.h:

/* Pin the memory for DMA */
struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };

/* Map admin SQ */
dma_map.vaddr = (uintptr_t)dev->admin_sq;
dma_map.iova = (uintptr_t)dev->admin_sq; /* Using vaddr as iova for simplicity */
dma_map.size = 4096;
dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
if (ioctl(dev->vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map) < 0) {
    perror("Failed to map admin SQ for DMA");
    return -1;
}

This approach follows the same pattern used in SPDK's VFIO implementation in lib/env_dpdk/pci_vfio.c.

Integration with Virtio

To leverage virtio for NVMe access, we need to understand how QEMU's virtio-blk device interfaces with NVMe. From examining hw/block/virtio-blk.c in QEMU's source code, virtio-blk provides a simpler block device interface that can sit atop various block device backends, including NVMe.

For a true virtio-NVMe implementation, we would:

  1. Define a virtio-NVMe device that exposes NVMe semantics (rather than just block device semantics)

  2. Create a shared memory region for virtqueues between guest and host

  3. Implement the virtio PCI transport layer

This is partially modeled in QEMU's code, though not with a direct NVMe command interface.

Performance Considerations

Based on the SPDK documentation and source code, several factors significantly impact NVMe performance:

  1. Queue depth: The Linux NVMe driver typically uses queue depths of 32 or more (include/linux/blk-mq.h)

  2. I/O size alignment: Aligning to the NVMe block size (typically 4KB) is critical

  3. Interrupt coalescing: Configuring optimal interrupt thresholds can reduce CPU overhead

  4. IOMMU overhead: The IOMMU adds protection but introduces some overhead

These considerations are reflected in the SPDK NVMe driver implementation in lib/nvme/nvme_pcie.c.

Testing and Validation

For validating this approach, we can use tools from the Linux kernel source:

  1. tools/testing/nvme/ contains NVMe test utilities

  2. tools/perf/ provides performance measurement capabilities

  3. tools/io_uring/ offers examples of high-performance I/O patterns

Future Directions

Future work could explore:

  1. Implementing full NVMe queue management beyond admin queues

  2. Integrating with io_uring for efficient asynchronous I/O

  3. Exploring direct integration with SPDK's NVMe driver

  4. Comparing with QEMU's vhost-user-blk implementation (hw/block/vhost-user-blk.c)

Conclusion

Direct NVMe access through virtio has significant potential for high-performance storage in virtualized environments. By carefully implementing the NVMe protocol and leveraging the virtio framework, we can achieve near-native storage performance while maintaining the benefits of virtualization.

0
Subscribe to my newsletter

Read articles from Dove-Wing directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dove-Wing
Dove-Wing