Benchmarking Assembler Code

clockworkclockwork
9 min read

.noise

_click,

Time is an abstract thing with a million analogies. From Einstein’s theory of relativity to radio syncronization, from 9th level D&D spells to the doomsday clock, time is something universal and at the same time highly individual. While time moves onward universially, it can bend and stretch, depending on which situation we find ourselfs in.

And while many services promise to actually save time for us, often times we find ourself wasting it away by watching useless ads or doomscrolling at night. How much of it is left for you, for me, for the world ? Let’s just hope we’ll spend enough time with our loved ones before it runs out..

Time is of utmost importance in this article too, since we are going to benchmark code to see whether an optimization is usefull or a waste of time. I chose a real world example this time to demonstrate how to benchmark assembly instructions.

And which song would be fitting the ocasion better than an all time classic:

https://youtu.be/Qr0-7Ds79zo?si=5WDloS35j9wiKEli

Preface

This time we are going to look at something I worked on at my job. The problem we are going to investigate stems from the need to swap bytes from one architecture to another in the context of an emulator. When we try to emulate a system that uses big endian byte ordering, we have to swap these bytes whenever we try to load them to x86 CPUs and again after we finished whichever operations we where going to execute.

The solution would be to [ mov ] the bytes from memory, then use [ bswap ] operation onto them. We ignore the fact that 16 bit memory needs to be swapped with [ xchg ] for this article.

First, let’s put on our thinking caps and think real hard about how we could achieve a byteswap only with arithmetic operations:

  1. ror [ al ]

  2. ror [ ah ]

  3. ror [ ax ]

  4. ror [ eax ]

  5. ror [ al ]

  6. ror [ ah ]

  7. ror [ ax ]

Mov + Bswap

Since rotating bytes is cumbersome and inefficient, we invented the [ bswap ] operation which does that kind of work for us. The bswap operation is a 3 byte operation (in 64 bit mode) and has the following opcode : [ REX + 0F C8 ].

The [ mov ] operation in 64 bit mode uses 2 bytes: [ REX + 89 ], or [ REX + 8b ] if moving from register to memory.

So we have 5 byte in total to load a memory location and swap the bytes loaded this way, split into 2 processor instructions. In a real process we would find 1 byte more at the end of each operation, representing the operands, so we’d have a total memory size of 7 byte for both operations.

MovBe

Because endian-swap is a commonly used technique, there is of course a seperate operation to just achieve this: [ movbe ] = move big endian. The operation looks like this: [ REX + 0F 38 F0 ], so we have 4 byte in total (in 64 bit mode). If we want to load into memory instead, we replace the last F0 with a F1.

So in total we have 4 byte and only 1 instruction. In theory, this should be faster than using mov + bswap. In a real process we would find 5 bytes, again because we need to encode an operand too, so our total memory size would be 5 byte.

Comparision [ movbe ] vs. [ mov + bswap ]

TechniqueSizeOperations
Mov + Bswap5 + 2 operands = 7 byte2
MovBe4 + 1 operand = 5 byte1

Setup

Now that we know what we are going to examine, we can create a skeleton for benchmarking. I played around with __asm inline assembly and consulted stackoverflow, but I really hate this style of coding and encountered really … unique … fellas over there, so I’ll stick to extern “C” style calls.

The background was that you sometimes won’t have the luxus of a seperate assembler and need to use gcc-style /bin/as. This really, really sucks, so you’d be better off just using __asm keyword and trying your best not to bite your keyboard when handling AT&T syntax. So if you’re one of the guys from my thread, please start breathing heavily now and type aggressively on your keyboard.

The class from which we are going to call our assembly looks like this:

#include <time.h>
#include <stdio.h>

extern "C" void movBeBenchmark(void);
extern "C" void movBswapBenchmark(void);

/*
    benchmarks the movbe instruction with ecx as counter
    uses easy to read values, for ez debugging
*/
void startBenchmarkMovBe(){

    //start a timer for benchmarking
    float startTime = (float)clock()/CLOCKS_PER_SEC;

    movBeBenchmark();

    //stop timer
    float endTime = (float)clock()/CLOCKS_PER_SEC;

    fprintf(stdout, "Function took %.6f units to execution!\n", endTime - startTime);
}

/*
    benchmarks the movbswap instruction with ecx as counter
    uses easy to read values, for ez debugging
*/
void startBenchmarkMovBswap(){

    //start a timer for benchmarking
    float startTime = (float)clock()/CLOCKS_PER_SEC;

    movBswapBenchmark();

    //stop timer
    float endTime = (float)clock()/CLOCKS_PER_SEC;

    fprintf(stdout, "Function took %.6f units to execution!\n", endTime - startTime);
}

int main(){

    fprintf(stdout, "Starting MovBe Benchmark ...\n");
    startBenchmarkMovBe();
    fprintf(stdout, "MovBe Benchmark finished ...\n-------------------------------\n");

    fprintf(stdout, "Starting MovBswap Benchmark ...\n");
    startBenchmarkMovBswap();
    fprintf(stdout, "MovBswap Benchmark finished ...\n");

    return 0;
}

As you can see, this is really no whitchcraft. It’s crucial to use some form of timing, and since I didn’t want to code around more problems than I already had, I decided to just use a C++ timer for this.

The benchmark code

Now we need to think about how we could measure both operations against each other. The premise must be that both function must do the same workload, and it should be more than just one operation, also it would be nice if it was somewhat easy to debug. PLUS, we need a way to measure operations at nanosecond speed. So, how can we achieve this?

First, we build up our stack on both functions to actually contain the same data. Since these use the exact same operations, we should have the same speed in both functions. Let’s use 8 registers [ r8 - r15 ], since these are not saved or reserved. We can place values from memory into these registers, swap their bytes, then swap them again and place them back into memory. In the end, we’re doing the same thing on a real machine with some extra steps.

And how are we going to measure this? I’d say we loop these operations a bazillion times and see which function terminates faster. Since we are building the same looping logic, the loops should not distort our measurements. Basically, if we’re doing the exact same work, the same amount of times and only use a different tool, we should be able to see which tool is the better for the job. Let’s go:

section .text
global movBswapBenchmark

movBswapBenchmark:
; init a stack frame pointer    (prologue)

    push    rbp
    mov     rbp, rsp

; prepare a stack with ez-to-find numbers
    mov qword[rbp-8],   0x11112222
    mov qword[rbp-16],  0x33334444
    mov qword[rbp-24],  0x55556666                    
    mov qword[rbp-32],  0x77778888
    mov qword[rbp-40],  0x33445566
    mov qword[rbp-48],  0x12332123
    mov qword[rbp-56],  0x11001100                    
    mov qword[rbp-64],  0x12345678

; define a counter for testing
    mov rcx,    0x100000000

; worker look for benchmarking
    LOOP:
        mov r8,qword[rbp-8]
        bswap r8
        mov r9,qword[rbp-16]
        bswap r9
        mov r10,qword[rbp-24]
        bswap r10
        mov r11,qword[rbp-32]
        bswap r11
        mov r12,qword[rbp-40]
        bswap r12
        mov r13,qword[rbp-48]
        bswap r13
        mov r14,qword[rbp-56]
        bswap r14
        mov r15,qword[rbp-64]
        bswap r15

        mov qword[rbp-8],r8
        bswap r8
        mov qword[rbp-16],r9
        bswap r9
        mov qword[rbp-24],r10
        bswap r10
        mov qword[rbp-32],r11
        bswap r11
        mov qword[rbp-40],r12
        bswap r12
        mov qword[rbp-48],r13
        bswap r13
        mov qword[rbp-56],r14
        bswap r14
        mov qword[rbp-64],r15
        bswap r15
    loop LOOP

; restore stack frame (epilogue)

    pop     rbp
    ret

We’re using rcx as a counter to create a simple loop around or workload. Memory on the stack is 8 byte wide, so a “standard” quadword. I think it’s easier to only fill them up to DWORD boundaries since we can find trailing 0s more easy when debugging this code.

The [ movbe ] part looks almost identical but obviously uses less instructions:

section .text
global movBeBenchmark

movBeBenchmark:
; init a stack frame pointer    (prologue)

    push    rbp
    mov     rbp, rsp

; prepare a stack with ez-to-find numbers
    mov qword[rbp-8],   0x11112222
    mov qword[rbp-16],  0x33334444
    mov qword[rbp-24],  0x55556666                    
    mov qword[rbp-32],  0x77778888
    mov qword[rbp-40],  0x33445566
    mov qword[rbp-48],  0x12332123
    mov qword[rbp-56],  0x11001100                    
    mov qword[rbp-64],  0x12345678

; define a counter for testing
    mov rcx,    0x100000000

; worker look for benchmarking
    LOOP:
        movbe r8,   [rbp-8]
        movbe r9,   [rbp-16]
        movbe r10,  [rbp-24]
        movbe r11,  [rbp-32]
        movbe r12,  [rbp-40]
        movbe r13,  [rbp-48]
        movbe r14,  [rbp-56]
        movbe r15,  [rbp-64]

        movbe [rbp-8],  r8
        movbe [rbp-16], r9
        movbe [rbp-24], r10
        movbe [rbp-32], r11
        movbe [rbp-40], r12
        movbe [rbp-48], r13
        movbe [rbp-56], r14
        movbe [rbp-64], r15
    loop LOOP

; restore stack frame (epilogue)

    pop     rbp
    ret

So here we go, doing the exact same thing again. Is this madness already?

No matter which version is faster, the [ movbe ] one is more elegant in every way: it’s shorter and uses less instructions. Also it’s kinda better to read, but that could just be me.

That’s basically it, now we can go on and compile this into a benchmark

Testing both functions

You might think to yourself “well, that wasn’t very complicated” - and you’d be right. The idea with any benchmark ( and test in general ) is to produce the smallest possible, yet still runnable example. Don’t let yourself be confused by Unreal engine X benchmarks producing stunning graphics and animations - these are in the end just fancy tests for their engine and more advertisement than real benchmark, although they do their job quite well. If my computer stutters with 11 fps, the benchmark did what it should, benchmarking my ancient hardware.

However, we are more interested in how a specific instruction performs against another one, not in how fast a specific hardware can execute our code. Let’s wrap our benchmark into a little script snippet:

#!/bin/sh
nasm -f elf64 movBswapBenchmark.asm -o movBswapBench.o
nasm/latest/bin/nasm -f elf64 movBeBenchmark.asm -o movBeBench.o
g++ main.cpp movBeBench.o movBswapBench.o -o movBeBenchmark
./movBeBenchmark

A dedicated make file would be overkill for this, and typing this per hand would be cumbersome, so a little script is just the right thing to have. Let’s run it:

Success! As we see here, the [ movbe ] instruction is indeed faster than the [ mov ] + [ bswap ] combo.

I admit that the number of iterations is abnormal high and could be reduced to around 100K, although I noticed that the amount of rounds we test has some influence on our result. For now, I am statisfied with this result since it shows that this microoptimization is indeed saving energy and time.

Epilogue

I hope I could show the general idea of a benchmark here. There might be room for optimization here, since our timers and call overheads produces some measurement diffusion, but I guess this benchmark is good enough to showcase how we can compare instructions against each other.

I hope you liked this article and could take something away from here. Saving time, even small amounts, can save much energy in the long run. Time is really valuable, so don’t waste yours.

0
Subscribe to my newsletter

Read articles from clockwork directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

clockwork
clockwork