Compiler Optimizations? Sounds complicated and advanced right? Do not worry this is why this article exist to help you and I get a better understanding of compiler optimizations. In addition, I’ll go over what 64-bit systems are as well briefly. However, please note that this article will not go in depth. This article is simply my way of explaining things in simple language to help you understand better. To learn more please visit Professor Chris Tyler’s articles on compiler optimizations and 64-bit systems.

What are compiler optimizations?

In simple terms, these are changes made by a compiler to achieve the same result but better performance (SPO600 Wiki). Now what changes are made by the compiler? What does the compiler do to perform these optimizations? Today, I will discuss two types of optimizations “Code Rewriting Optimizations” and “Machine-Code Optimization”

What is a “Code Rewriting Optimization”?

The answer lies in the phrase itself. The compiler will take the code you have written in C++ and rewrite it for performance enhancements. Now please note that the compiler does these optimizations in code thats drastically different than what you have written in C++. Therefore, the examples you see below for each type is just for you and I to understand.

There are a number of Code Rewriting Optimizations which are listed below:

Dead Code Elimination
Strength Reduction
Hoisting
Hoisting - Loop Invariant Expression
Hoisting - Loop Invariant Expression in Loop Condition
Pre calculation of constants
Loop Unswitching
Loop Interchange
Loop Unrolling
Inlining
Common Subexpression Elimination
Jump Threading
Short Circuit Evaluation
Test Reordering
Dead Store Elimination

Strength Reduction

This is a fancy way of saying if we can convert something like a multiplication into multiple additions because multiplication is expensive in terms of memory usage. Please look at the example from Professor Chris Tyler’s video below:

Hoisting 1 - Loop Invariant Variable

This just means that inside a loop we find the line of code that does not need to be inside it. The compiler puts it outside of the loop because why execute it that many times. Look at the example below:

Hoisting 2 - Loop Invariant Edxpression

Like the example above, this is an optimization whereby the compiler will find a calculation inside a loop, create a variable outside and store that value so that the calculation doesnt need to happen every time it loops.

Hoisting 3 - Loop Invariant Expression in Loop

This is finding a value inside the for loop statement and performing a calculation outside so that its only calculated once.

Pre-Calculation of Constants

This is really simple. The compiler just pre-calculates a constant and inserts it. Look at the example below:

Loop Unswitching - Inner/Outer Swap

The term loop unswitching means the compiler will take a condition inside a loop that doesnt need to be there the entire duration of the loop and put it outside. If you look at the code below, you see that ctl is checked if its 0 for 10000 times, thats redundant, so its moved outside.

Loop Unswitching 2 - Inner.Outer Swap with code repetition

Sometimes its more faster to swap the code/loop unswitch with code repition. Look at the example below

Loop Interchange

Lets say you have two inner loops, the compiler may switch the loops so that the inner loop is now the outer loop and vice versa.

Loop Unrolling 1 - Guaranteed Multiple Iterations

This is when you reduce the number of times the loop control condition is evaluated. For example, you may have a loop that is executed every other time based on a value. So the compiler can unroll it in a way where its evaluated only every other time.

Loop Unrolling 2 - Pairs of iterations plus conditional

So you know how we reduced the number of loop iterations by executing multiple iterations at once. In this approach though, its still executed twice, but if the total number of iterations is odd then we can add one extra condition outside the main loop to execute that odd remaining iteration

Loop Unrolling 3 - Large Number of iterations

This is when we reduce the number of iterations substantially. Like instead of executing it twice like the previous example in one iteration, we can do it 10 times at once for example.

Inlining

Inlining means the compiler takes the code from a function and paste it directly into where its called, so theres no function call/header.

Common Subexpression Elimination

Lets say you have an expression that has the same expression evaluated twice inside it, compiler will create a variable and store that value once and use that variable to simplify the expression.

Jump Threading

Jump Threading eliminates unnecessary comparisons by turning it into an else statement instead.

Short-Circuit Evaluation - AND & OR

When we evaluate a condition inside an if statement that uses &&, we know that if the first part of the evaluation is false the entire expression is false. So the compiler will change the code in a way that it will skip the second half of the condition by changing the if statements.

Likewise for OR, its the same concept, if first half is true we are good to go and no need to evaluate the second half.

Test reordering

It looks at an if condition, and finds a comparison thats easiest/cheapest to do first within it. So it essentially reorders it based on whats cheaper to do because if its false, then we dont have to do the most expensive one after, it gets skipped thus saving time.

Dead Store Elimination

Similar to Dead code elimination. It finds code that is dead and simply eliminates it.

Dead Store Elimination 2 - Unused Values

When you assign a value like 0 to an integer and then reassign it elsewhere. The compiler will not assign the value of 0, it will simply instantiate it and then assign it like below:

Dead Store Elimination 3 - Special Case

It looks for a pattern whereby a variable may be declared outside an if statement and then reassigned inside the if statement. Note: the condition doesnt rely on the value of a in this case. So what the compiler does, is it creating an else statement and puts the value of a in that.

Machine Code Optimizations

Machine code optimizations are performed at the machine language level. There are a few I will go over briefly today.

Block Rearrangement
Instruction Selection

Block Rearrangement

Block arrangement optimization is when the compiler rearranges your code so that the part of your code thats more frequently used is loaded into the cache for faster access. And the code thats barely executed is moved somewhere else. Refer to the example below:

Instruction Selection

Instruction Selection is a type of optimization whereby the compiler will swap out a slower instruction with a faster instruction one to execute the same task and achieve the same result. (Pretty cool)

If you look at the example above, the initial instruction on the left is simply loading 0 into a register (eax). So now the compiler will use XOR on eax register and load it with the value of eax which makes it load 0. This is faster!

Introduction to 64 Bit Systems

In this section of the article, I will give a brief overview of what I learned during the Introduction to 64-Bit Systems lecture by Professor Chris Tyler.

As you may know we have two types of servers we use in this course. One is called the x86 and the other is aarch64. To learn more about these severs, please click here. But note that these two systems have different processor designs, that’s what essentially makes them different.

When it comes to the design of processors, there are 2 main designs which are listed below:

RISC - Reduced Instruction Set Computing
CISC - Complex Instruction Set Computing

RISC - Reduced Instruction Set Computing

This design is based on an approach whereby the processor has a smaller set of simple and efficient instructions and how this will make it more faster and power efficient. ARM (AArch64) processors are based on this architecture.

CICS - Complex Instruction Set Computing

This approach is based on the approach whereby the instruction set is larger to perform more complex operations. An example of this is the x86_64 processor which are based on this architecture.

So now that we know a bit more on what RISC and CISC is, lets take a look into how these two types are implemented at the hardware level by learning about something called the Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA)

ISA is how the system is engineered from the perspective of the instruction it understands - Professor Chris Tyler. In simple terms, it defines the instructions the processor can execute from a software perspective. So for example, it may have instructions hardcoded that can do arithmetic and logical operations.

The ISA also defines how many registers there are and how large these registers are. Let’s compare the registers in x86_64 and ARM

x86_64 Registers

There are a total of 16 general purpose registers. Please see the image below:

AArch64 Registers

AArch64 registers have 31 general purpose registers plus one special one. R0 to R30 are general purpose and RSP/RZR is for stack pointer or for 0 (has two purposes). This special register works differently for read and write, when the register reads, its 0 and when it writes it discards the value.

Now lets go over the size of registers.

x86 Register Sizes

Lets take a closer look into the register sizes for x86 registers. Heres an image from Professor Chris Tylers lecture:

AArch64 Register Sizes

Okay so we’ve discussed ISA and how it defines the instructions the processor can perform, lets dive into another important concept called the Application Binary Interface (ABI).

Application Binary Interface (ABI)

ABI basically defines how software interacts with the operating system and with hardware at the binary level. So things like how functions are called is what ABI defines. To learn more about this, we need to discuss Syscalls (System Calls)

System Calls - Syscalls

Lets say you have a function you want to call that reads a file. How does this work? Your program needs to talk to the OS and for that it makes something we call a Syscall. Now note that the function calls arguments are stored inside registers. Refer to the image below for more information for x86_64 processors:

This is what happens for your system to setup function calling! So the RAX register will return the value on exit. But on entry it holds the Syscall!

Preserved column just tells us that while function calling is happening, those registers that are preserved, their values are not changed or altered.

So how are registers working for Syscall in AArch64 processors?

Note that in AArch64 its different, whereby R8 hold the Syscall, and R0 to R7 hold parameters. But the R0 on exit return the value.

See how different Syscalls are in these two types of architectures? This also tells us how applications compiled for one architecture won’t run on another because Syscalls use different registers.

Now we will look at what’s inside a build program file and see how programs are stored and executed on systems. We will look into a different concept called ELF - Executable and Linkable Format

Executable and Linkable Format

This is a file format that contains multiple sections that hold several components like debug information, initial values of variables, build ids, etc. Its stores everything needed to run a program into one file. To read the elf file we can use the following command: readelf -S outputFile. If you run this command, you will see different sections, but to get a nice list of the section headers use the command objdump -h outputFile. Running that command will show you a list similar to this:

The objdump command also has another flag you can use that prints the assembly instructions of the program. The command is objdump -d outputFile which should look similar to this:

If you look closely you can see the assembly instructions in the third column. You’ll see instructions like mov, call, sub, push. Note: if you run this command in an AArch64 processor, some of the instructions are named the same but they do behave differently!

Reflection

This weeks topics sounded a bit complicated but I feel like I have a good, basic understanding of compiler optimizations and 64 bit systems. It was quite intriguing learning how the compiler rewrites code and does things like loop unrolling, strength reduction, etc. I also had fun learning about RISC and CISC and although these two terms sound complex, their really simple. Lastly the section on ABI taught me how function calls and syscalls interact with the OS and which registers play a role in these two types of system. Whereas, ELF gave me an idea of how a program executable is structured and loaded. The commands like objdump and readelf allowed me to analyze the files. In conclusion, this topic educated me on how compilers work and the differences in system architecture. I hope to learn more in next weeks lectures. And as always stay tuned for more educational content.

References

The cover image was taken from a research paper by Mohamed Boussaa from https://www.researchgate.net/publication/320223187_Automatic_Non-functional_Testing_and_Tuning_of_Configurable_Generators

All images were sourced directly from Professor Chris Tyler’s presentation. All credit goes to him. To access such content, please visit SPO600 wiki to learn more.

Week 5 - Compiler Optimizations & Intro to 64-Bit Systems

Table of contents