Waiting in Line: Food Trucks, CPU Steal, and Virtual Machine Performance

Let's say you've had a long day and you're at your favorite food truck, patiently waiting to place an order. The person in front of you gets their food and leaves. Just as you're about to step up, someone cuts in line because they're in a hurry and get served first. You were next, but now your turn is delayed. I don't know about you, but I wouldn't be very excited about that, especially if I was really hungry.

The same scenario plays out with computers. Virtual machines (VMs) are software-based emulations of physical computers that run within a host system’s environment. When you spin up a VM in a virtualized environment, more often than not, it is not the only VM on the server — you could be sharing the server with tens or hundreds of other isolated VMs running different operating systems. Cloud providers use a virtualization software called hypervisors, which runs on top of a physical server, to dynamically allocate resources, like CPU, memory, and storage, to the VMs running on it. The aim is to optimize server performance and minimize cost.

What then is CPU Steal?

CPU steal, measured in %, is the time a VM waits to be allocated vCPU by the physical CPU of the host. Steal time can range from 0% to 100%. If there are five equally sized VMs running on a hypervisor, their vCPU are not necessarily capped at 20% each. A VM can request more CPU resources depending on the processes it is running. This VM is like the person who cut the line in front of you to make an order at the food truck. If a VM doesn’t show any CPU steal, it simply means it doesn’t spend any time waiting for vCPU to be assigned to it — it has all the vCPU it needs to run its processes. Conversely, a steal of 90% means 90% of the physical CPU is being utilized by other VMs running on the hypervisor. If your VM needs more than the remaining 10% vCPU to run its processes, it would have to contend for vCPU with other VMs on the hypervisor or wait to be allocated more vCPU.

CPU steal is pretty common in virtualized environments. It is mainly caused by either or a combination of two reasons:

  • Your VM is too small in comparison to the workload it runs. For example, if your VM has 1vCPU but is running an intensive workload, then in this case, you might need to upsize your vCPU.

  • The physical server hosting your VM might have been oversold and the virtual machines on it are contending for CPU resources. If you have strong suspicion this might be the case, then you want to reach out to your cloud provider. A way around this may be running your VM on a dedicated host, but this is not a cost-effective solution.

How to Measure CPU Steal Time

In Linux machines, CPU steal time is reported by a number of command line utilities. Some of them are:

top: The top utility periodically displays a sorted list of system processes. CPU steal on the virtual machine is represented as st in the %Cpu(s) row in the command line output.

vmstat: vmstat reports information about processes, memory, paging, block IO, disks and cpu activity. Steal is shown under the st column on the output. column -t was added to the vmstat command to make the output easily readable.

iostat: This utility monitors system input/output device loading by observing the time the devices are active in relation to their average transfer rates. iostat displays CPU steal under the %steal column as seen in the output.

iostat may not come pre-installed on your linux distribution. To install iostat on Debian and Debian-based distros, like Ubuntu, run apt install sysstat.

How Much CPU Steal is Too Much CPU Steal?

Typically, a CPU steal time of 1 - 10% is normal and shouldn’t significantly affect machine performance. However, if it exceeds 10% for long periods, it could suggest problems such as CPU contention or an overloaded physical host. As a rule of thumb:

  • 1 - 3%: Minimal impact, barely noticeable in most tasks

  • 4 - 7%: Somewhat noticeable in CPU-heavy tasks, but it shouldn't have a major impact on overall performance

  • 8 - 10%: It could lead to some performance reduction in CPU-intensive applications, but can be managed with adequate resource planning

  • \>10%: The VM is probably running slower than anticipated, suggesting CPU contention, which can impact the performance of the workload deployed on it

0
Subscribe to my newsletter

Read articles from Valentine Okonkwo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Valentine Okonkwo
Valentine Okonkwo

Hello there! Thank you for visiting my blog! My name is Valentine Okonkwo. I’m starting my journey in tech and will be sharing what I learn here.