Understanding SIMD: Guide to Vectorized Computing

Tanay RajTanay Raj
5 min read

In this blog, we will discuss two main instruction execution models of the CPU:

  1. Scalar Processing: Single Instruction, Single Data

  2. Vector Processing: Single Instruction, Multiple Data

Scalar Processing

This is the most common execution model whenever you write code. The following code uses scalar processing (assuming there are no optimizations from the compiler).

#include<iostream>
using namespace std;

int main() {
    int a[5] = {2, 3, 4, 5, 6};
    int b[5] = {5, 6, 7, 8, 9};
    int c[5];
    // 5 iterations are needed to perform the sum of two arrays
    for(int i = 0; i < 5; i++) {
        c[i] = a[i] + b[i];
    }
    return 0;
}

Let's see what's happening here:

  1. The first element of array A is loaded into a register.

  2. The first element of array B is loaded into another register.

  3. The operation ( in this case addition) is performed on both.

This cycle repeats until all elements are processed (here, 5 times).

Now, imagine you are working with a large dataset—maybe millions of elements in both arrays. Do you think this code will be efficient?

The answer is no.

Let's explore how we can optimize this using Vector Processing.

Vector Processing (Single Instruction, Multiple Data)

Modern CPUs support parallel execution of instructions, and SIMD (Single Instruction, Multiple Data) is one form of parallelism.

Scalar Operations:

  • Elements are loaded one by one into registers before performing operations.

SIMD Operations:

  • Loading: All elements of array A are loaded into a single wide register.

  • Loading: All elements of array B are loaded into another wide register.

  • Operation: The required operation is performed on all elements simultaneously, avoiding repetitive steps.

But how can we load all elements at once into a register? Let’s dive deeper.

How does this work?

CPUs that support SIMD have specialized registers capable of loading multiple elements at once. These registers allow parallel processing, improving performance for data-heavy computations. SIMD registers vary across different CPU architectures. Here are some of the major SIMD register sets:

Architecture

Register Set

Size

Elements  (32-bit size)

x86 SSE

XMM0-XMM15

128 bits

4

x86 AVX

YMM0-YMM15

256 bits

8

ARM NEON

Q0-Q31

128 bits

4

For example, in the x86 SSE register set, the XMM0 register can hold four 32-bit float values in the following way:

  • XMM0 → [a0, a1, a2, a3]

Among the various register sets, the YMM0 series (from AVX) can hold the highest number of 32-bit elements, allowing operations on up to eight elements in parallel.

However, even with SIMD, there is a limit to how many elements can be processed simultaneously, as there is a limit to how many elements can be loaded into these specialized registers. To exceed this limit or further optimize performance, techniques like loop unrolling and pipelining can be used. These advanced strategies are beyond the scope of this blog but play a crucial role in maximizing SIMD efficiency.

Note: This does not mean that SIMD-specific registers are physically larger than general-purpose registers; rather, they are designed to efficiently handle multiple values simultaneously, whereas general-purpose registers are not optimized for such operations.

Implementation

If you have ever worked with datasets or mathematical computations, you might have come across a library called NumPy. NumPy implements vector processing by leveraging the SIMD capabilities of the host CPU. If the CPU supports SIMD, NumPy takes advantage of it to execute operations in parallel, significantly improving performance. However, if the CPU does not support SIMD, NumPy automatically falls back to scalar operations, processing one element at a time. This ensures compatibility across different hardware architectures while still optimizing performance when possible.

import numpy as np
import time

SIZE = 10_000_000

# Generate random float arrays
A = np.random.rand(SIZE).astype(np.float32)
B = np.random.rand(SIZE).astype(np.float32)

# Scalar Addition
def scalar_add(A, B):
    C = [A[i] + B[i] for i in range(SIZE)]
    return C

# SIMD Addition
def simd_add(A, B):
    return A + B  # NumPy automatically uses SIMD

# Benchmark function
def benchmark(func, *args):
    start = time.time()
    result = func(*args)
    end = time.time()
    return end - start

# Run tests
time_scalar = benchmark(scalar_add, A, B)
print(f"Scalar Addition Time: {time_scalar:.6f} sec")

time_simd = benchmark(simd_add, A, B)
print(f"SIMD (NumPy) Addition Time: {time_simd:.6f} sec")

# Speed increment
speed = time_scalar / time_simd
print(f"Speedup: {speed:.2f}x")

Output

Scalar Addition Time: 2.396143 sec

SIMD (NumPy) Addition Time: 0.002513 sec

Speedup: 953.43x

The difference is real!

In C and C++, we can leverage SIMD using intrinsics. Intrinsics are special function calls that directly map to low-level CPU instructions, allowing fine-grained control over SIMD operations without writing assembly code.

By using intrinsics, developers can manually utilize SIMD registers and instructions to optimize performance for tasks like numerical computations, image processing, and data transformations.

Conclusion

When to Use SIMDWhen NOT to Use SIMD
Large arrays or matricesSmall data sets (setup overhead is high)
Repetitive arithmetic operationsDependent operations (e.g., recursion)
Image processing (filters, convolution)If-else conditions (branching issues)
Deep learning & neural networksUnaligned memory access
1
Subscribe to my newsletter

Read articles from Tanay Raj directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tanay Raj
Tanay Raj