How a compiler works - GCC Compiler


We know from very beginning that compiler works just by converting high level language to machine language but that is not fully true.
But, a compiler doesn’t just translate; it also analyzes, optimizes, and organizes code in stages before producing the final machine instructions.
Lets discuss the brief of how a compiler works, GCC compiler is most commonly used for C++, C, Go and others language lets discuss about this.
Mainly a compiler has 4 main steps to break high level language codes to machine readable code. These steps are,
Source Code (.c)
↓ (Preprocessor) → Preprocessed Code (.i)
↓ (Compiler) → Assembly Code (.s)
↓ (Assembler) → Object Code (.o)
↓ (Linker) →Executable (a.out)
Preprocessing:
Preprocessing happens before actual compilation. It’s handled by the C Preprocessor (cpp).
👉 What it does:
Expands macros: Replaces
#define
constants/macros with their actual values.Includes header files: Replaces
#include <...>
or"..."
with the full contents of those files.Removes comments: Strips out
//
and/*...*/
comments.Handles conditional compilation: Processes directives like
#if
,#ifdef
,#ifndef
,#else
,#endif
to include/exclude parts of code.
👉 Output:
The result is a pure C code file (.i) — no macros, no includes, no comments, just expanded C code.
Example:
Input:
#include <stdio.h>
#define SQUARE(x) ((x) * (x))
#define DEBUG 1
#ifndef PI
#define PI 3.14159
#endif
int main() {
#if DEBUG
printf("Debug mode ON\n");
#endif
int a = 5;
printf("Square of %d = %d\n", a, SQUARE(a));
#ifdef PI
printf("PI value = %f\n", PI);
#endif
return 0;
}
Output:
extern int printf(const char *__restrict __format, ...);
extern int scanf(const char *__restrict __format, ...);
extern int printf(const char *__restrict __format, ...);
extern int scanf(const char *__restrict __format, ...);
.....
int main() {
printf("Debug mode ON\n");
int a = 5;
printf("Square of %d = %d\n", a, ((a) * (a)));
printf("PI value = %f\n", 3.14159);
return 0;
}
🔹 What happened here:
#include <stdio.h>
→ Expanded into the full contents of the standard I/O header file (very large, not shown here).Macro expansion
SQUARE(a)
→((a) * (a))
.PI
→3.14159
.
Conditional compilation
Since
DEBUG
is defined as1
, theprintf("Debug mode ON\n");
line was included.#ifdef PI
block included becausePI
was defined.
Comments removed
(None in this example, but they’d all be stripped).Compiler
Once preprocessing is done, GCC passes the expanded code to the compiler proper (e.g., cc1
for C).
This stage is much more than just “conversion” — it’s where most of the heavy lifting happens.
What happens here:
Lexical Analysis (Tokenizing)
- Breaks the preprocessed code into tokens (keywords, identifiers, operators, literals, etc.).
Example:
- Breaks the preprocessed code into tokens (keywords, identifiers, operators, literals, etc.).
Syntax Analysis (Parsing)
Builds a Parse Tree / Abstract Syntax Tree (AST) from tokens.
Ensures code follows C language grammar rules.
Example:int a = 5;
→ Tree representing a variable declaration with initialization.
Semantic Analysis
Checks meaning & correctness.
Ensures types match, variables are declared, functions have correct arguments, etc.
Intermediate Representation (IR)
- GCC translates code into an internal format (like GIMPLE or RTL) for optimization.
Optimization
Performs improvements without changing behavior:
Constant folding (
2+3 → 5
)Dead code elimination
Loop optimizations
Inlining functions
Code Generation (Assembly)
- Finally converts IR → Assembly code for the target CPU architecture
int main() {
int a = 5;
int b = 10;
int c = a + b;
return c;
}
.file "example.c"
.text
.globl main
.type main, @function
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
movl $10, -8(%rbp)
movl -4(%rbp), %edx
movl -8(%rbp), %eax
addl %edx, %eax
popq %rbp
ret
Assembler
Once the compiler produces Assembly code (
.s
file), it’s still just human-readable text. The Assembler (as
) takes that and converts it into machine-readable binary instructions, stored in an object file (.o
).
What happens here:
Assembly Translation
The
.s
file (assembly code) is translated into opcodes (binary machine instructions).Example:
movl $5, -4(%rbp)
→ gets turned into binary likeC7 45 FC 05 00 00 00
.
Object File Creation (
.o
)The assembler outputs a relocatable object file.
Contains:
Machine code instructions
Symbol table (function names, variable references)
Relocation info (addresses not yet resolved — linker will handle this later).
Still Incomplete
The
.o
file can’t run yet because external functions (likeprintf
) are still unresolved.Linking (next step) will handle that.
Assembly (example.s
, simplified x86-64):
.globl add .type add, @function add: pushq %rbp movq %rsp, %rbp movl %edi, -4(%rbp) movl %esi, -8(%rbp) movl -4(%rbp), %edx movl -8(%rbp), %eax addl %edx, %eax popq %rbp ret
Object file (
example.o
):Binary file (not human-readable).
If you inspect it with
objdump -d example.o
, you’ll see machine instructions like
0000000000000000 <add>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 89 7d fc mov %edi,-0x4(%rbp)
7: 89 75 f8 mov %esi,-0x8(%rbp)
a: 8b 55 fc mov -0x4(%rbp),%edx
d: 8b 45 f8 mov -0x8(%rbp),%eax
10: 01 d0 add %edx,%eax
12: 5d pop %rbp
13: c3 ret
🔹 Raw Binary (Hex Bytes)
If you dump the .o
with objdump -d
, you’ll see exactly these opcodes:
55
48 89 e5
89 7d fc
89 75 f8
8b 55 fc
8b 45 f8
01 d0
5d
c3
Concatenated as a full sequence:
55 48 89 e5 89 7d fc 89 75 f8 8b 55 fc 8b 45 f8 01 d0 5d c3
🔹 In Binary (0s and 1s)
Each hex digit = 4 bits. Converting the above to binary:
01010101
01001000 10001001 11100101
10001001 01111101 11111100
10001001 01110101 11111000
10001011 01010101 11111100
10001011 01000101 11111000
00000001 11010000
01011101
11000011
‘
Linker
A linker is a program that takes one or more object files (produced by the compiler) and combines them into a single executable, shared library, or another object file.
Think of it as the stage after compilation that “glues” all pieces of code together.
Role of the Linker in GCC
When you run a command like:
gcc main.c helper.c -o myprogram
Here’s what happens internally:
Compilation phase
Each.c
file is compiled separately into an object file (.o
), containing machine code and symbolic references (like function names, global variables).Example:
gcc -c main.c # produces main.o gcc -c helper.c # produces helper.o
Linking phase
The linkerld
(called internally by GCC) takes all.o
files:Resolves symbols: matches function calls and global variables across files.
Includes library code if needed (
-lm
for math, etc.).Produces the final executable
myprogram
.
Key Responsibilities of the Linker
Symbol Resolution
Finds where every function or variable is defined.
Example:
main.o
callshelper()
→ linker findshelper()
inhelper.o
.
Address Assignment
Assigns memory addresses to code and data sections (
.text
,.data
,.bss
).Ensures all references point to the correct addresses.
Library Linking
Links against static libraries (
.a
) or dynamic/shared libraries (.so
).Handles dynamic linking if needed at runtime.
Relocation
- Adjusts relative addresses in machine code so everything works when loaded into memory.
Types of Linking
Static Linking
All code from libraries is copied into the executable.
Larger file size but no runtime dependencies.
Dynamic Linking
Code resides in shared libraries (
.so
).Executable is smaller, and libraries can be updated independently.
5. Viewing Linker Actions
You can see what the linker does using:
gcc -v main.c helper.c -o myprogram
Or run ld
manually:
ld main.o helper.o -o myprogram
You can also use nm
or objdump
on object files to inspect symbols before linking:
nm main.o
objdump -d helper.o
Now Lets talk how we can link multiple language together with linker and can gain a performance oriented code.
How Linker can be used to achieve high performance in a single project with multiple codebase:
1. Understand the Role of the Linker
Each language’s compiler produces object files (
.o
) with machine code.The linker combines all object files into a single executable or library.
Key requirements:
Symbols (function names, global variables) must be visible across languages.
Calling conventions must be compatible (how functions pass arguments/return values).
2. General Steps for Multi-Language Projects
Step 1: Choose a “bridge” language
Usually C is the bridge, because almost all languages can call C functions and be called from C.
For example:
Rust → C → C++
Assembly → C → Python
Step 2: Write code in multiple languages
Example: C and Assembly
Assembly (add.asm
):
global add
section .text
add:
mov rax, rdi
add rax, rsi
ret
C (main.c
):
#include <stdio.h>
extern long add(long a, long b); // link to assembly function
int main() {
printf("5 + 7 = %ld\n", add(5,7));
return 0;
}
Step 3: Compile each language separately
# Assemble Assembly code
nasm -f elf64 add.asm -o add.o
# Compile C code
gcc -c main.c -o main.o
Step 4: Link all object files
# Link object files into an executable
gcc main.o add.o -o program
The linker resolves the symbol
add
frommain.o
withadd.o
.Produces a final executable
program
.
Step 5: Run
./program
Output:
5 + 7 = 12
3. Tips for Multi-Language Linking
Use
extern "C"
in C++- C++ mangles function names;
extern "C"
prevents that so C/C++ can link properly.
- C++ mangles function names;
Match calling conventions
- For example,
cdecl
for x86, or the platform default.
- For example,
Use libraries if needed
- You can link static libraries (
.a
) or shared libraries (.so
/.dll
) across languages.
- You can link static libraries (
Keep track of symbols
- Tools like
nm
orobjdump
help check available symbols in object files.
- Tools like
4. Example: Three languages in one project
Assembly: low-level math functions
C: glue code
C++: main application and I/O
Steps:
Write Assembly functions → assemble to
.o
Write C glue code → compile to
.o
Write C++ main → compile to
.o
Link all
.o
files → executable
Below are the advantages we can get using multiple language:
1. Performance Optimization
Use low-level languages (like Assembly or C) for performance-critical parts.
Use high-level languages (like C++, Python, or Rust) for easier development elsewhere.
Example: Assembly for math routines, C++ for UI.
2. Reuse Existing Code
- You can link in existing libraries written in different languages without rewriting them.
Example: Use a legacy C library in a C++ or Rust project.
3. Flexibility
Choose the best language for each component:
High-level for productivity and readability.
Low-level for speed or hardware access.
4. Modularity
Separate functionality into different language modules:
Easier maintenance
Independent development and testing
5. Interoperability
Allows multiple teams to work in their preferred languages while still producing a single executable.
Makes it easier to integrate specialized libraries (e.g., graphics, math, AI).
6. Smaller Executables (with dynamic linking)
- Using shared libraries across languages can reduce the final executable size and allow library updates without recompiling everything.
7. Access to Language-Specific Features
Leverage unique features of each language:
Rust → memory safety
C → low-level control
Python → rapid prototyping
Assembly → ultra-optimized routines
Now lets talk about ABI and how it play an important role when using multiple language:
The The ABI defines the low-level interface between binary program modules, specifying how they interact.
Calling conventions
How functions receive arguments (registers vs stack)
How return values are passed
Data types and alignment
- Size and memory layout of structs, arrays, and other data types
Name mangling and symbol decoration
- How function names appear in object files
Register usage and stack cleanup
Which registers a function must preserve
Who cleans up the stack after a function call
💡 Think of ABI as the “contract” between compiled modules. If two modules don’t follow the same ABI, they cannot safely call each other.
ABI Problems in Multi-Language Projects
When mixing languages, ABI mismatches are a common source of bugs:
a) Calling convention mismatch
Example: C uses
cdecl
, but another module usesstdcall
.Result: Stack corruption → crashes or incorrect results.
b) Data layout mismatch
Structs may be aligned differently in C and C++ or across compilers.
Example: One module expects 4-byte alignment, another 8-byte → incorrect memory reads.
c) Name mangling issues
C++ compilers mangle names to support overloading.
If C++ code calls a C function without
extern "C"
, the linker cannot find the symbol.
d) Type size differences
Example:
int
in C is 4 bytes, butlong
in another language may be 8 bytes.Passing mismatched types leads to garbled data.
How to Avoid ABI Problems
Use C as the bridge
C has a simple and stable ABI.
Declare
extern "C"
in C++ for interoperability.
Match calling conventions
- Use compiler flags like
__cdecl
,__stdcall
in C/C++ when needed.
- Use compiler flags like
Check data layout
- Use
#pragma pack
or attributes to control struct alignment if necessary.
- Use
Avoid passing complex objects across language boundaries
Prefer simple types: integers, floats, pointers.
Complex types like classes, strings, or STL containers can have different ABIs.
Use shared libraries carefully
- ABI must be compatible with the compiler that built the library.
Example of an ABI Problem
C++ function calling C without extern "C"
:
// C++: main.cpp
#include <iostream>
int add(int a, int b) { return a + b; } // mangled name
extern "C" int add_c(int a, int b); // expecting C symbol
int main() {
std::cout << add_c(5, 7) << std::endl;
}
C: add.c
int add_c(int a, int b) { return a + b; }
Without
extern "C"
, the C++ compiler mangles theadd
symbol.Linker cannot find the C symbol → linking error.
While this discussion focused on GCC, the fundamental process applies to most compilers. All compilers translate high-level code into machine code and rely on a linker in the final stage to produce an executable. Because the linker works at the binary level, it allows us to connect code from different programming languages, even if those languages don’t use GCC. This is why multi-language projects—combining C, C++, Rust, Assembly, and others—are possible, as long as the compiled object files follow compatible ABIs and calling conventions.
Subscribe to my newsletter
Read articles from Fakhrul Siddiqei directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Fakhrul Siddiqei
Fakhrul Siddiqei
I am an Software Engineer from Bangladesh, working with mobile development since 2017. I love to learn new technologies and use them in my development. Learning and letting other's know about new technologies and programming tech is my hobby.