We know from very beginning that compiler works just by converting high level language to machine language but that is not fully true.
But, a compiler doesn’t just translate; it also analyzes, optimizes, and organizes code in stages before producing the final machine instructions.

Lets discuss the brief of how a compiler works, GCC compiler is most commonly used for C++, C, Go and others language lets discuss about this.

Mainly a compiler has 4 main steps to break high level language codes to machine readable code. These steps are,

Source Code (.c)

↓ (Preprocessor) → Preprocessed Code (.i)

↓ (Compiler) → Assembly Code (.s)

↓ (Assembler) → Object Code (.o)

↓ (Linker) →Executable (a.out)

Preprocessing:

Preprocessing happens before actual compilation. It’s handled by the C Preprocessor (cpp).

👉 What it does:

Expands macros: Replaces #define constants/macros with their actual values.
Includes header files: Replaces #include <...> or "..." with the full contents of those files.
Removes comments: Strips out // and /*...*/ comments.
Handles conditional compilation: Processes directives like #if, #ifdef, #ifndef, #else, #endif to include/exclude parts of code.

👉 Output:
The result is a pure C code file (.i) — no macros, no includes, no comments, just expanded C code.

Example:

Input:

#include <stdio.h>

#define SQUARE(x) ((x) * (x))
#define DEBUG 1

#ifndef PI
    #define PI 3.14159
#endif

int main() {
#if DEBUG
    printf("Debug mode ON\n");
#endif

    int a = 5;
    printf("Square of %d = %d\n", a, SQUARE(a));

#ifdef PI
    printf("PI value = %f\n", PI);
#endif

    return 0;
}

Output:

extern int printf(const char *__restrict __format, ...);
extern int scanf(const char *__restrict __format, ...);
extern int printf(const char *__restrict __format, ...);
extern int scanf(const char *__restrict __format, ...);
.....
int main() {
    printf("Debug mode ON\n");

    int a = 5;
    printf("Square of %d = %d\n", a, ((a) * (a)));

    printf("PI value = %f\n", 3.14159);

    return 0;
}

🔹 What happened here:

#include <stdio.h>
→ Expanded into the full contents of the standard I/O header file (very large, not shown here).
Macro expansion
- SQUARE(a) → ((a) * (a)).
- PI → 3.14159.
Conditional compilation
- Since DEBUG is defined as 1, the printf("Debug mode ON\n"); line was included.
- #ifdef PI block included because PI was defined.
Comments removed
(None in this example, but they’d all be stripped).

Compiler

Once preprocessing is done, GCC passes the expanded code to the compiler proper (e.g., cc1 for C).
This stage is much more than just “conversion” — it’s where most of the heavy lifting happens.

What happens here:

Lexical Analysis (Tokenizing)
- Breaks the preprocessed code into tokens (keywords, identifiers, operators, literals, etc.).
  Example:
Syntax Analysis (Parsing)
- Builds a Parse Tree / Abstract Syntax Tree (AST) from tokens.
- Ensures code follows C language grammar rules.
  Example: int a = 5; → Tree representing a variable declaration with initialization.
Semantic Analysis
- Checks meaning & correctness.
- Ensures types match, variables are declared, functions have correct arguments, etc.
Intermediate Representation (IR)
- GCC translates code into an internal format (like GIMPLE or RTL) for optimization.
Optimization
- Performs improvements without changing behavior:
  - Constant folding (2+3 → 5)
  - Dead code elimination
  - Loop optimizations
  - Inlining functions
Code Generation (Assembly)
- Finally converts IR → Assembly code for the target CPU architecture

int main() {
    int a = 5;
    int b = 10;
    int c = a + b;
    return c;
}

    .file   "example.c"
    .text
    .globl  main
    .type   main, @function
main:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $5, -4(%rbp)
    movl    $10, -8(%rbp)
    movl    -4(%rbp), %edx
    movl    -8(%rbp), %eax
    addl    %edx, %eax
    popq    %rbp
    ret

Assembler

Once the compiler produces Assembly code (.s file), it’s still just human-readable text. The Assembler (as) takes that and converts it into machine-readable binary instructions, stored in an object file (.o).

What happens here:
1. Assembly Translation
  - The .s file (assembly code) is translated into opcodes (binary machine instructions).
  - Example: movl $5, -4(%rbp) → gets turned into binary like C7 45 FC 05 00 00 00.
2. Object File Creation (.o)
  - The assembler outputs a relocatable object file.
  - Contains:
    - Machine code instructions
    - Symbol table (function names, variable references)
    - Relocation info (addresses not yet resolved — linker will handle this later).
3. Still Incomplete
  - The .o file can’t run yet because external functions (like printf) are still unresolved.
  - Linking (next step) will handle that.

Assembly (example.s, simplified x86-64):

      .globl add
      .type  add, @function
  add:
      pushq   %rbp
      movq    %rsp, %rbp
      movl    %edi, -4(%rbp)
      movl    %esi, -8(%rbp)
      movl    -4(%rbp), %edx
      movl    -8(%rbp), %eax
      addl    %edx, %eax
      popq    %rbp
      ret

Object file (example.o):

Binary file (not human-readable).
If you inspect it with objdump -d example.o, you’ll see machine instructions like

    0000000000000000 <add>:
       0: 55                      push   %rbp
       1: 48 89 e5                mov    %rsp,%rbp
       4: 89 7d fc                mov    %edi,-0x4(%rbp)
       7: 89 75 f8                mov    %esi,-0x8(%rbp)
       a: 8b 55 fc                mov    -0x4(%rbp),%edx
       d: 8b 45 f8                mov    -0x8(%rbp),%eax
      10: 01 d0                   add    %edx,%eax
      12: 5d                      pop    %rbp
      13: c3                      ret

🔹 Raw Binary (Hex Bytes)

If you dump the .o with objdump -d, you’ll see exactly these opcodes:

Concatenated as a full sequence:

    55 48 89 e5 89 7d fc 89 75 f8 8b 55 fc 8b 45 f8 01 d0 5d c3

🔹 In Binary (0s and 1s)

Each hex digit = 4 bits. Converting the above to binary:

    01010101
    01001000 10001001 11100101
    10001001 01111101 11111100
    10001001 01110101 11111000
    10001011 01010101 11111100
    10001011 01000101 11111000
    00000001 11010000
    01011101
    11000011

‘

Linker

A linker is a program that takes one or more object files (produced by the compiler) and combines them into a single executable, shared library, or another object file.

Think of it as the stage after compilation that “glues” all pieces of code together.

Role of the Linker in GCC

When you run a command like:

    gcc main.c helper.c -o myprogram

Here’s what happens internally:

Compilation phase
Each .c file is compiled separately into an object file (.o), containing machine code and symbolic references (like function names, global variables).

Example:
```
 gcc -c main.c   # produces main.o
 gcc -c helper.c # produces helper.o
```
Linking phase
The linker ld (called internally by GCC) takes all .o files:
- Resolves symbols: matches function calls and global variables across files.
- Includes library code if needed (-lm for math, etc.).
- Produces the final executable myprogram.

Key Responsibilities of the Linker

Symbol Resolution
- Finds where every function or variable is defined.
- Example: main.o calls helper() → linker finds helper() in helper.o.
Address Assignment
- Assigns memory addresses to code and data sections (.text, .data, .bss).
- Ensures all references point to the correct addresses.
Library Linking
- Links against static libraries (.a) or dynamic/shared libraries (.so).
- Handles dynamic linking if needed at runtime.
Relocation
- Adjusts relative addresses in machine code so everything works when loaded into memory.

Types of Linking

Static Linking
- All code from libraries is copied into the executable.
- Larger file size but no runtime dependencies.
Dynamic Linking
- Code resides in shared libraries (.so).
- Executable is smaller, and libraries can be updated independently.

5. Viewing Linker Actions

You can see what the linker does using:

    gcc -v main.c helper.c -o myprogram

Or run ld manually:

    ld main.o helper.o -o myprogram

You can also use nm or objdump on object files to inspect symbols before linking:

    nm main.o
    objdump -d helper.o

Now Lets talk how we can link multiple language together with linker and can gain a performance oriented code.

How Linker can be used to achieve high performance in a single project with multiple codebase:

1. Understand the Role of the Linker

Each language’s compiler produces object files (.o) with machine code.
The linker combines all object files into a single executable or library.
Key requirements:
1. Symbols (function names, global variables) must be visible across languages.
2. Calling conventions must be compatible (how functions pass arguments/return values).

2. General Steps for Multi-Language Projects

Step 1: Choose a “bridge” language

Usually C is the bridge, because almost all languages can call C functions and be called from C.
For example:
- Rust → C → C++
- Assembly → C → Python

Step 2: Write code in multiple languages

Example: C and Assembly

Assembly (add.asm):

global add
section .text
add:
    mov rax, rdi
    add rax, rsi
    ret

C (main.c):

#include <stdio.h>
extern long add(long a, long b); // link to assembly function

int main() {
    printf("5 + 7 = %ld\n", add(5,7));
    return 0;
}

Step 3: Compile each language separately

# Assemble Assembly code
nasm -f elf64 add.asm -o add.o

# Compile C code
gcc -c main.c -o main.o

Step 4: Link all object files

# Link object files into an executable
gcc main.o add.o -o program

The linker resolves the symbol add from main.o with add.o.
Produces a final executable program.

Step 5: Run

./program

Output:

5 + 7 = 12

3. Tips for Multi-Language Linking

Use extern "C" in C++
- C++ mangles function names; extern "C" prevents that so C/C++ can link properly.
Match calling conventions
- For example, cdecl for x86, or the platform default.
Use libraries if needed
- You can link static libraries (.a) or shared libraries (.so / .dll) across languages.
Keep track of symbols
- Tools like nm or objdump help check available symbols in object files.

4. Example: Three languages in one project

Assembly: low-level math functions
C: glue code
C++: main application and I/O

Steps:

Write Assembly functions → assemble to .o
Write C glue code → compile to .o
Write C++ main → compile to .o
Link all .o files → executable

Below are the advantages we can get using multiple language:

1. Performance Optimization

Use low-level languages (like Assembly or C) for performance-critical parts.
Use high-level languages (like C++, Python, or Rust) for easier development elsewhere.
Example: Assembly for math routines, C++ for UI.

2. Reuse Existing Code

You can link in existing libraries written in different languages without rewriting them.
Example: Use a legacy C library in a C++ or Rust project.

3. Flexibility

Choose the best language for each component:
- High-level for productivity and readability.
- Low-level for speed or hardware access.

4. Modularity

Separate functionality into different language modules:
- Easier maintenance
- Independent development and testing

5. Interoperability

Allows multiple teams to work in their preferred languages while still producing a single executable.
Makes it easier to integrate specialized libraries (e.g., graphics, math, AI).

6. Smaller Executables (with dynamic linking)

Using shared libraries across languages can reduce the final executable size and allow library updates without recompiling everything.

7. Access to Language-Specific Features

Leverage unique features of each language:
- Rust → memory safety
- C → low-level control
- Python → rapid prototyping
- Assembly → ultra-optimized routines

Now lets talk about ABI and how it play an important role when using multiple language:

The The ABI defines the low-level interface between binary program modules, specifying how they interact.

Calling conventions
- How functions receive arguments (registers vs stack)
- How return values are passed
Data types and alignment
- Size and memory layout of structs, arrays, and other data types
Name mangling and symbol decoration
- How function names appear in object files
Register usage and stack cleanup
- Which registers a function must preserve
- Who cleans up the stack after a function call

💡 Think of ABI as the “contract” between compiled modules. If two modules don’t follow the same ABI, they cannot safely call each other.

ABI Problems in Multi-Language Projects

When mixing languages, ABI mismatches are a common source of bugs:

a) Calling convention mismatch

Example: C uses cdecl, but another module uses stdcall.
Result: Stack corruption → crashes or incorrect results.

b) Data layout mismatch

Structs may be aligned differently in C and C++ or across compilers.
Example: One module expects 4-byte alignment, another 8-byte → incorrect memory reads.

c) Name mangling issues

C++ compilers mangle names to support overloading.
If C++ code calls a C function without extern "C", the linker cannot find the symbol.

d) Type size differences

Example: int in C is 4 bytes, but long in another language may be 8 bytes.
Passing mismatched types leads to garbled data.

How to Avoid ABI Problems

Use C as the bridge
- C has a simple and stable ABI.
- Declare extern "C" in C++ for interoperability.
Match calling conventions
- Use compiler flags like __cdecl, __stdcall in C/C++ when needed.
Check data layout
- Use #pragma pack or attributes to control struct alignment if necessary.
Avoid passing complex objects across language boundaries
- Prefer simple types: integers, floats, pointers.
- Complex types like classes, strings, or STL containers can have different ABIs.
Use shared libraries carefully
- ABI must be compatible with the compiler that built the library.

Example of an ABI Problem

C++ function calling C without extern "C":

// C++: main.cpp
#include <iostream>

int add(int a, int b) { return a + b; }  // mangled name

extern "C" int add_c(int a, int b);      // expecting C symbol

int main() {
    std::cout << add_c(5, 7) << std::endl;
}

C: add.c

int add_c(int a, int b) { return a + b; }

Without extern "C", the C++ compiler mangles the add symbol.
Linker cannot find the C symbol → linking error.

While this discussion focused on GCC, the fundamental process applies to most compilers. All compilers translate high-level code into machine code and rely on a linker in the final stage to produce an executable. Because the linker works at the binary level, it allows us to connect code from different programming languages, even if those languages don’t use GCC. This is why multi-language projects—combining C, C++, Rust, Assembly, and others—are possible, as long as the compiled object files follow compatible ABIs and calling conventions.

How a compiler works - GCC Compiler

Table of contents

Preprocessing:

🔹 What happened here:

Compiler

What happens here:

Assembler

What happens here:

🔹 Raw Binary (Hex Bytes)

🔹 In Binary (0s and 1s)

Linker

Key Responsibilities of the Linker

Types of Linking

5. Viewing Linker Actions

How Linker can be used to achieve high performance in a single project with multiple codebase:

1. Understand the Role of the Linker

2. General Steps for Multi-Language Projects

Step 1: Choose a “bridge” language

Step 2: Write code in multiple languages

Step 3: Compile each language separately

Step 4: Link all object files

Step 5: Run

3. Tips for Multi-Language Linking

4. Example: Three languages in one project

Below are the advantages we can get using multiple language:

1. Performance Optimization

2. Reuse Existing Code

3. Flexibility

4. Modularity

5. Interoperability

6. Smaller Executables (with dynamic linking)

7. Access to Language-Specific Features

Now lets talk about ABI and how it play an important role when using multiple language:

ABI Problems in Multi-Language Projects

a) Calling convention mismatch

b) Data layout mismatch

c) Name mangling issues

d) Type size differences

How to Avoid ABI Problems

Example of an ABI Problem

Subscribe to my newsletter

Fakhrul Siddiqei

Fakhrul Siddiqei