Transform Your Code: Steps for Rapid Performance Improvement

Raman SinghRaman Singh
5 min read

Have you ever wondered what makes some code run instantly while other code seems to take forever? Here at ML with Brar, we love digging into these kinds of questions. At the heart of many complex applications, from AI and scientific computing to video games, lies a fundamental operation: matrix multiplication.

It might sound like something straight out of a linear algebra textbook, but how we tell a computer to perform this task can mean the difference between waiting seconds or waiting hours.

I decided to put this to the test. I took this one simple task and tried to solve it in several different ways, from the most basic Python code to highly optimized C and powerful GPU programming. The goal? To see just how fast we can go and to understand why some methods are so much faster than others.

Let's dive in!

The Contenders: Our Programming Approaches

We have a lineup of six contenders, each with a different strategy for multiplying matrices:

  1. The Beginner (Naive Python): Just simple, plain for loops. It's the first thing you'd think of, but it comes with a lot of baggage from the Python interpreter.

  2. The Pro (NumPy): Using Python's premier scientific computing library, which hands off the hard work to highly optimized, pre-compiled code.

  3. The Old-Schooler (Naive C): Writing the same simple loops but in C, a language known for being closer to the hardware and much faster than Python.

  4. The Strategists (Optimized C): Two "smarter" versions of our C code. One uses tiling and the other uses transposition—classic techniques to work with the CPU's memory layout, not against it.

  5. The Heavy Hitter (GPU with CuPy): Taking the problem off the CPU entirely and giving it to a powerful NVIDIA Tesla P100 GPU, a piece of hardware born for this kind of parallel number-crunching. We tested this in both standard single precision and high-accuracy double precision.

The Arena: Our Test Machine

All tests were run on a Kaggle Notebook powered by:

  • CPU: An Intel Xeon Processor (Skylake) with AVX-512 instructions.

  • GPU: An NVIDIA Tesla P100.

The Main Event: Performance Showdown

We measured performance in GFLOPS (Giga Floating-point Operations Per Second). Higher is better!

1. The Starting Line: Naive Python

Graph of Naive Python Performance

As you can see, our naive Python implementation is barely on the chart, clocking in at a minuscule 0.0035 GFLOPS. This is our baseline. It works, but it's the equivalent of walking in a race against supercars. The overhead of the Python interpreter for every single calculation is just too much.

2. A Little Better: Naive C

Graph of Naive C Performance

By switching to C, we get a significant speedup over Python, but we're still performing at less than 1 GFLOPS. Why so slow? Because of memory. The standard loop causes the CPU to jump all over memory to find the numbers it needs, creating a "traffic jam" in the CPU's cache. This proves that just using a "fast" language isn't a magic bullet.

3. Getting Smarter: Optimized C

Graph of Tiled C Performance

Graph of Transposed C Performance

Our two "smarter" C versions show a noticeable improvement over the naive C code. By using techniques like tiling and transposition, we're telling the CPU how to access memory more efficiently. While they still don't hold a candle to the next contender, they prove a crucial point: how you write your algorithm matters… a lot.

4. The CPU Champion: NumPy

Graph of NumPy Performance

This is where things get serious. NumPy, running on the CPU, hits a stunning ~250 GFLOPS. That's hundreds of times faster than our naive C code! NumPy uses highly-tuned, low-level libraries (like Intel's MKL) that are specifically designed to use every trick in the CPU's playbook, like its AVX-512 vector instructions. In fact, this result is incredibly close to the CPU's theoretical maximum speed. For CPU-based work, this is the gold standard.

5. The Undisputed King: The GPU

Graph of CuPy FP32 Performance

Graph of CuPy FP64 Performance

And finally, the main event. Handing the work over to the NVIDIA P100 GPU gives us performance that is simply in another league.

  • Single Precision: We hit over 4000 GFLOPS.

  • Double Precision: We still achieved a massive 3700 GFLOPS.

This is the power of massive parallelism. A GPU is like an army of thousands of workers all tackling a small piece of the problem at the same time. For a task like matrix multiplication, this approach is unbeatable. The GPU was about 15-16 times faster than our fully-optimized CPU implementation.

Final Thoughts

This experiment tells a clear story:

  • The tools you use are critical. A library like NumPy can instantly give you performance that would take ages to write and tune by hand.

  • Algorithms are more important than the language. A smart algorithm in C will always beat a naive one, but a professionally optimized library often beats them both.

  • For massive data parallelism, nothing beats a GPU. When you need to do the same calculation over and over on huge datasets, a GPU is the right tool for the job.

It's a fascinating look at how software and hardware dance together to produce incredible results!

Want to see the code behind the tests? Check out the full project on my GitHub Repository.

Thanks for reading!

0
Subscribe to my newsletter

Read articles from Raman Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Raman Singh
Raman Singh