In programming, code performance tends to be tighten to algorithm. We often think of code complexity as the primary factor for code speed. The rule of thumb is:

The fewer operations a program performs, the faster it runs.

While this principle is valid, it overlooks a critical aspect: performance comparisons only make sense when the environment remains the same. And by environment, I mean the hardware: CPU, GPU, RAM etc— because ultimately, code is just instructions, what executes tasks is the hardware. Code doesn't just define what tasks to perform, but can also tell the hardware how to do it.

Yet, most developers focus heavily on minimizing the what—reducing the number of operations—while completely ignoring the how. The reason is understandable: optimizing the how requires a solid understanding of hardware-level behavior, and as software engineers, we often:

Tend to prioritize high-level features and functionality.
Lack familiarity with low-level hardware operations.
May not even realize how much hardware-aware coding can boost performance.

In this article, we'll see how a deeper understanding of hardware can turn even small code changes into significant performance gains.

Memory reading

Let’s look at a simple example.

Given this array of length 2 073 600 (1920×1080):

Note: this example only works with arrays, not linked lists— we’ll explore why later in the section.

private const val COLUMN = 1920
private const val ROW = 1080
private val length = COLUMN * ROW
private val list = ArrayList<Int>()
for (i in 0 until length) {
    list.add(i)
}

Now, let’s write a program that iterates through the entire array and processes each element. For this, we are going to study 2 approaches:

The first method iterates through items sequentially: 0, 1, 2, 3, 4 …
The second method iterates in a leapfrogging pattern: 0, 1920, 3840, 7680, …, 1, 1921, 3841, ...

Sequential	Leapfrogging
0, 1, 2, 3, 4 …	0, 1920, 3840, 7680, …, 1, 1921, 3841, ...

These 2 approaches have the same complexity, perform exactly the same number of operation, the only difference is the order in which the items are accessed.

Before reading further, take a moment to think: what could possibly make one method faster than the other?

...

Alright, let’s reveal the answer. Here’s the benchmark comparison of the two methods:

Note: The benchmark is done with kotlin-benchmark library, in the Mode.AverageTime

Benchmark                        Mode  Cnt  Score   Error  Units
Benchmark.sequentialIteration    avgt    5  0.952 ± 0.003  ms/op
Benchmark.leapfroggingIteration  avgt    5  3.517 ± 0.015  ms/op

The benchmark shows that the sequential approach completes an operation in approximately 0.952 millisecond , while the leapfrogging approach takes about 3.517 milliseconds. In other words, sequential reading is roughly 3.7 times faster (this number is not consistent as it differs depending on the hardware and the data structure used) than leapfrogging reading 🔥🔥.

Wondering why?

To understand the reason, we’ll examine the data reading function—since the only difference between the two methods lies in the order in which data is read.

Although memory architecture can vary across CPU designs, all systems generally include RAM and multiple layers of cache. When a program requests data, the CPU first checks the cache, if the data isn't there, cache fetches it from RAM, which is significantly slower. However, if future data requests hit previously fetched blocks, those slow RAM accesses can be avoided.

In short, the fewer memory fetches from RAM, the faster the program runs.

Below is what happens in the two above programs:

Sequential	Leapfrogging
0, 1, 2, 3, 4 …	0, 1920, 3840, 7680, …, 1, 1921, 3841, ...

With the same amount of data requested, sequential reading typically requires far fewer RAM fetches than leapfrogging, making it significantly more efficient.

Important note: As mentioned earlier, arrays are the best fit for sequential reading because they store data in contiguous memory blocks. Non-contiguous structures (like LinkedLists) don’t offer the same performance benefits.

Best-practice takeaway: Always aim to reduce the number of RAM fetches. In case of contiguous storing, sequential reading is the most efficient approach. This could also be optimized further by knowing the capacity of cache— newly fetched data is loaded into the cache, while older data is evicted to make room—and tailoring the data reading pattern accordingly.

By understanding how cache fetching works, you can significantly improve your program’s performance with just a tiny code change.

Numpy vectorization operations

I have recently started exploring some Machine Learning basics. While studying the Linear Regression model (with multiple features), I came across a problem that involves multiplying two arrays element-wise:

$$ \mathbf{w} = [w_1, w_2, \ldots, w_n],\quad \mathbf{x} = [x_1, x_2, \ldots, x_n] $$

$$ \mathbf{w} \cdot \mathbf{x} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n $$

From a programming perspective, this is a straightforward problem. The easiest and most intuitive approach is to use a for loop to multiply each pair of elements and then sum all the products.

f = 0
for i in range(0, n):
    f = f + w[i] * x[i]

To me, iterating over every element and doing the math seemed inevitable. Therefore, I believed the for loop was the most efficient solution and that no better alternative existed. And then, I got introduced to the Numpy library with its built-in dot() operation:

f = np.dot(w,x)

At first, I assumed the dot() function was just a fancy wrapper around a for loop—until the instructor made this bold claim:

Numpy's dot function can be hundreds or even thousands of times faster than a for-loop.

Wait what? 😮😮 Like how?

Out of curiosity, I looked it up and discovered how it was made possible. It turns out that the Numpy library uses a variety of sophisticated techniques that take advantage of modern hardware capabilities to maximize performance when working with massive numerical data.

Pre-compiled code

Even though Numpy is a python library, the vectorization methods are pre-compiled C (or sometimes Fortran) code. This compiled code runs much faster than a Python for-loop— where each iteration is interpreted at runtime.

Vectorization (SIMD)

In ML context, data is usually vectorized, meaning it is transformed into numerical vectors that models can process efficiently. Vectorization allows computations on entire vectors at once, leveraging parallel processing capabilities of modern CPUs and GPUs. This is enabled thanks to Single Instruction, Multiple Data (SIMD), a technique where one instruction performs operations on multiple data points simultaneously within a single CPU cycle. For example, in a dot product operation, all multiplications $ w_i x_i $ are performed in parallel, and their results are summed. This results in significantly faster computations and simpler code.

Efficient Memory Management

Memory layout

Numpy's data structures—primarily the ndarray—store data in contiguous memory blocks. As analyzed in the previous section, this memory layout allows for much faster reading and processing of large datasets.

Cache optimization

We've seen how utilizing the cache can improve code performance. NumPy uses blocking/tile algorithms to maximize data locality and cache utilization. The idea is to divide computations into smaller, more manageable chunks called blocks or tiles. Each block is processed entirely before moving on to the next, helping ensure that the necessary data stays in the cache for as long as possible.

Parallelism and Multi-Core Utilization

In addition to parallel data processing within a single CPU core, Numpy also supports parallelism across multiple cores, benefiting from the same optimization techniques used for single-core performance.

And more...

There are still many other algorithms tailored to specific hardware, memory designs, and device architectures. All of them contribute to creating a sophisticated solution for processing large datasets.

Takeaways

Algorithm complexity isn't everything: Reducing the number of operations helps, but it's only effective if the hardware environment stays consistent.
Hardware matters: A solid understanding of fundamental hardware operations—such as data access, cache behavior, memory management, and CPU architecture—can help you design significantly more efficient solutions.
Code optimization has limits, thus hardware matters even more: At some point, algorithmic improvements reach their theoretical minimum. Hardware, on the other hand, continues to evolve. Major tech companies invest heavily in better processors, memory, and accelerators—giving you new tools to push performance further.
Hybrid optimization is essential: For large-scale or performance-critical tasks, neither algorithmic efficiency nor hardware tuning alone is sufficient. Combining both—writing efficient code and aligning it with hardware behavior—yields the best results.

Beyond Big-O: How Hardware Shapes Code Performance.