Understanding SIMD: Guide to Vectorized Computing


In this blog, we will discuss two main instruction execution models of the CPU:
Scalar Processing: Single Instruction, Single Data
Vector Processing: Single Instruction, Multiple Data
Scalar Processing
This is the most common execution model whenever you write code. The following code uses scalar processing (assuming there are no optimizations from the compiler).
#include<iostream>
using namespace std;
int main() {
int a[5] = {2, 3, 4, 5, 6};
int b[5] = {5, 6, 7, 8, 9};
int c[5];
// 5 iterations are needed to perform the sum of two arrays
for(int i = 0; i < 5; i++) {
c[i] = a[i] + b[i];
}
return 0;
}
Let's see what's happening here:
The first element of array A is loaded into a register.
The first element of array B is loaded into another register.
The operation ( in this case addition) is performed on both.
This cycle repeats until all elements are processed (here, 5 times).
Now, imagine you are working with a large dataset—maybe millions of elements in both arrays. Do you think this code will be efficient?
The answer is no.
Let's explore how we can optimize this using Vector Processing.
Vector Processing (Single Instruction, Multiple Data)
Modern CPUs support parallel execution of instructions, and SIMD (Single Instruction, Multiple Data) is one form of parallelism.
Scalar Operations:
- Elements are loaded one by one into registers before performing operations.
SIMD Operations:
Loading: All elements of array A are loaded into a single wide register.
Loading: All elements of array B are loaded into another wide register.
Operation: The required operation is performed on all elements simultaneously, avoiding repetitive steps.
But how can we load all elements at once into a register? Let’s dive deeper.
How does this work?
CPUs that support SIMD have specialized registers capable of loading multiple elements at once. These registers allow parallel processing, improving performance for data-heavy computations. SIMD registers vary across different CPU architectures. Here are some of the major SIMD register sets:
Architecture | Register Set | Size | Elements (32-bit size) |
x86 SSE | XMM0-XMM15 | 128 bits | 4 |
x86 AVX | YMM0-YMM15 | 256 bits | 8 |
ARM NEON | Q0-Q31 | 128 bits | 4 |
For example, in the x86 SSE register set, the XMM0 register can hold four 32-bit float values in the following way:
- XMM0 → [a0, a1, a2, a3]
Among the various register sets, the YMM0 series (from AVX) can hold the highest number of 32-bit elements, allowing operations on up to eight elements in parallel.
However, even with SIMD, there is a limit to how many elements can be processed simultaneously, as there is a limit to how many elements can be loaded into these specialized registers. To exceed this limit or further optimize performance, techniques like loop unrolling and pipelining can be used. These advanced strategies are beyond the scope of this blog but play a crucial role in maximizing SIMD efficiency.
Note: This does not mean that SIMD-specific registers are physically larger than general-purpose registers; rather, they are designed to efficiently handle multiple values simultaneously, whereas general-purpose registers are not optimized for such operations.
Implementation
If you have ever worked with datasets or mathematical computations, you might have come across a library called NumPy. NumPy implements vector processing by leveraging the SIMD capabilities of the host CPU. If the CPU supports SIMD, NumPy takes advantage of it to execute operations in parallel, significantly improving performance. However, if the CPU does not support SIMD, NumPy automatically falls back to scalar operations, processing one element at a time. This ensures compatibility across different hardware architectures while still optimizing performance when possible.
import numpy as np
import time
SIZE = 10_000_000
# Generate random float arrays
A = np.random.rand(SIZE).astype(np.float32)
B = np.random.rand(SIZE).astype(np.float32)
# Scalar Addition
def scalar_add(A, B):
C = [A[i] + B[i] for i in range(SIZE)]
return C
# SIMD Addition
def simd_add(A, B):
return A + B # NumPy automatically uses SIMD
# Benchmark function
def benchmark(func, *args):
start = time.time()
result = func(*args)
end = time.time()
return end - start
# Run tests
time_scalar = benchmark(scalar_add, A, B)
print(f"Scalar Addition Time: {time_scalar:.6f} sec")
time_simd = benchmark(simd_add, A, B)
print(f"SIMD (NumPy) Addition Time: {time_simd:.6f} sec")
# Speed increment
speed = time_scalar / time_simd
print(f"Speedup: {speed:.2f}x")
Output
Scalar Addition Time: 2.396143 sec
SIMD (NumPy) Addition Time: 0.002513 sec
Speedup: 953.43x
The difference is real!
In C and C++, we can leverage SIMD using intrinsics. Intrinsics are special function calls that directly map to low-level CPU instructions, allowing fine-grained control over SIMD operations without writing assembly code.
By using intrinsics, developers can manually utilize SIMD registers and instructions to optimize performance for tasks like numerical computations, image processing, and data transformations.
Conclusion
When to Use SIMD | When NOT to Use SIMD |
Large arrays or matrices | Small data sets (setup overhead is high) |
Repetitive arithmetic operations | Dependent operations (e.g., recursion) |
Image processing (filters, convolution) | If-else conditions (branching issues) |
Deep learning & neural networks | Unaligned memory access |
Check Out These Links to Learn More About the Topic:
Intrinsics: Improving Performance with SIMD Intrinsics in Three Use Cases
Designing a SIMD Algorithm: SIMD Base64
SIMD Introduction with Code - Watch the video on YouTube
Subscribe to my newsletter
Read articles from Tanay Raj directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
