Benchmarking Linux FUSE Overhead

Swebert CorreaSwebert Correa
3 min read

Our experiments indicate that depending on the workload and hardware used, performance degradation caused by FUSE can be completely imperceptible or as high as –83% even when optimized; and relative CPU utilization can increase by 31%.

Ref: To FUSE or Not to FUSE: Performance of User-Space File Systems | USENIX

Check Installed Storage Types

Firstly, I wanted to check what kind of storage devices are installed. This would help in creating the benchmarking workload. The primary goal of this exercise is to figure out the bottlenecks in FUSE and not the disk. So, the test should be configured according to the disk type.

$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0     7:0    0    4K  1 loop /snap/bare/5
loop1     7:1    0 73.9M  1 loop /snap/core22/2045
loop2     7:2    0  516M  1 loop /snap/gnome-42-2204/202
loop3     7:3    0 91.7M  1 loop /snap/gtk-common-themes/1535
loop4     7:4    0 49.3M  1 loop /snap/snapd/24792
loop5     7:5    0 50.8M  1 loop /snap/snapd/25202
loop6     7:6    0 73.9M  1 loop /snap/core22/2082
sda       8:0    0   40G  0 disk
├─sda1    8:1    0   39G  0 part /
├─sda14   8:14   0    4M  0 part
├─sda15   8:15   0  106M  0 part /boot/efi
└─sda16 259:0    0  913M  0 part /boot

Only sda is present in the output, so there are only HDDs and no NVMe SSDs installed.

Check FUSE Configuration

Many of the configurations for FUSE are set during the kernel compilation. These can be checked by looking into the kernel's boot configuration.

$ cat /boot/config-$(uname -r) | grep FUSE
CONFIG_FUSE_FS=y
CONFIG_FUSE_DAX=y
CONFIG_FUSE_PASSTHROUGH=y
CONFIG_FUSE_IO_URING=y

FUSE and passthrough, the flags we are interested in, are enabled. Another interesting flag is CONFIG_FUSE_IO_URING that hooks up FUSE with IO_uring. This might be more powerful than passthrough if we plan to integrate the backing filesystem in user space.

Benchmarking and Tools

Doing a basic google search revealed some tools that can be used for benchmarking a filesystem.

  1. Filesystem: fio, fio-plot

  2. Kernel: iostat, vmstat, pidstat

Designing the fio workload

  1. Small Random IO - Latency

    1. Small block size creates small IOPS thus increasing the number of kernel context switches.

    2. Small IO depth forces latency measurement as each IO operation has to resolve immediately preventing any reorder optimizations.

    3. Large IO depth coupled with small block sizes showcases real-life scenarios.

  2. Large Sequential IO - Throughput

    1. Large block size to minimizes the number of IO operations and increases the data transferred, focusing the throughput instead of latency.

    2. Large IO depth makes kernel submit multiple IO operations, allowing HDD to reorder IOs for minimal head movement.

  3. Metadata Operations - Overhead

    1. Purely metadata operations like create and stat have no data transfer involved. This helps in figuring out the overhead purely by FUSE.

    2. Large number of files helps in averaging out the cost of operations.

exouser@dimuthu-fuse-benchmark:~/sbtc$ ls -l *.fio
-rw-rw-r-- 1 exouser exouser 325 Aug 27 08:30 latency.fio
-rw-rw-r-- 1 exouser exouser 180 Aug 27 08:30 overhead.fio
-rw-rw-r-- 1 exouser exouser 193 Aug 27 08:30 throughput.fio

Running the Benchmarks

Each of the

$ fio latency.fio --filename=$PWD/nexus-fs/build/fuse_mount/$PWD/fiotest.dat --output-format=json --output=fuse-latency.json

Determining Bottlenecks

  1. Kernel Context Switching

  2. Data Copying

  3. Latency for Small Operations

0
Subscribe to my newsletter

Read articles from Swebert Correa directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Swebert Correa
Swebert Correa

I am a software engineer at Rakuten Symphony. On a daily basis, I deal with storage and distributed systems. This involves pumping out features, figuring out race conditions, mitigating deadlocks, and coding by keeping network latency in mind. That's a lot of C and GDB 😛