How a compiler works - GCC Compiler

Fakhrul SiddiqeiFakhrul Siddiqei
12 min read

We know from very beginning that compiler works just by converting high level language to machine language but that is not fully true.
But, a compiler doesn’t just translate; it also analyzes, optimizes, and organizes code in stages before producing the final machine instructions.

Lets discuss the brief of how a compiler works, GCC compiler is most commonly used for C++, C, Go and others language lets discuss about this.

Mainly a compiler has 4 main steps to break high level language codes to machine readable code. These steps are,

Source Code (.c)

(Preprocessor) → Preprocessed Code (.i)

↓ (Compiler) → Assembly Code (.s)

↓ (Assembler) → Object Code (.o)

↓ (Linker) →Executable (a.out)

Preprocessing:

Preprocessing happens before actual compilation. It’s handled by the C Preprocessor (cpp).

👉 What it does:

  • Expands macros: Replaces #define constants/macros with their actual values.

  • Includes header files: Replaces #include <...> or "..." with the full contents of those files.

  • Removes comments: Strips out // and /*...*/ comments.

  • Handles conditional compilation: Processes directives like #if, #ifdef, #ifndef, #else, #endif to include/exclude parts of code.

👉 Output:
The result is a pure C code file (.i) — no macros, no includes, no comments, just expanded C code.

Example:

Input:

#include <stdio.h>

#define SQUARE(x) ((x) * (x))
#define DEBUG 1

#ifndef PI
    #define PI 3.14159
#endif

int main() {
#if DEBUG
    printf("Debug mode ON\n");
#endif

    int a = 5;
    printf("Square of %d = %d\n", a, SQUARE(a));

#ifdef PI
    printf("PI value = %f\n", PI);
#endif

    return 0;
}

Output:

extern int printf(const char *__restrict __format, ...);
extern int scanf(const char *__restrict __format, ...);
extern int printf(const char *__restrict __format, ...);
extern int scanf(const char *__restrict __format, ...);
.....
int main() {
    printf("Debug mode ON\n");

    int a = 5;
    printf("Square of %d = %d\n", a, ((a) * (a)));

    printf("PI value = %f\n", 3.14159);

    return 0;
}

🔹 What happened here:

  1. #include <stdio.h>
    → Expanded into the full contents of the standard I/O header file (very large, not shown here).

  2. Macro expansion

    • SQUARE(a)((a) * (a)).

    • PI3.14159.

  3. Conditional compilation

    • Since DEBUG is defined as 1, the printf("Debug mode ON\n"); line was included.

    • #ifdef PI block included because PI was defined.

  4. Comments removed
    (None in this example, but they’d all be stripped).

    Compiler

Once preprocessing is done, GCC passes the expanded code to the compiler proper (e.g., cc1 for C).
This stage is much more than just “conversion” — it’s where most of the heavy lifting happens.

What happens here:

  1. Lexical Analysis (Tokenizing)

    • Breaks the preprocessed code into tokens (keywords, identifiers, operators, literals, etc.).
      Example:
  2. Syntax Analysis (Parsing)

    • Builds a Parse Tree / Abstract Syntax Tree (AST) from tokens.

    • Ensures code follows C language grammar rules.
      Example: int a = 5; → Tree representing a variable declaration with initialization.

  3. Semantic Analysis

    • Checks meaning & correctness.

    • Ensures types match, variables are declared, functions have correct arguments, etc.

  4. Intermediate Representation (IR)

    • GCC translates code into an internal format (like GIMPLE or RTL) for optimization.
  5. Optimization

    • Performs improvements without changing behavior:

      • Constant folding (2+3 → 5)

      • Dead code elimination

      • Loop optimizations

      • Inlining functions

  6. Code Generation (Assembly)

    • Finally converts IR → Assembly code for the target CPU architecture
int main() {
    int a = 5;
    int b = 10;
    int c = a + b;
    return c;
}
    .file   "example.c"
    .text
    .globl  main
    .type   main, @function
main:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $5, -4(%rbp)
    movl    $10, -8(%rbp)
    movl    -4(%rbp), %edx
    movl    -8(%rbp), %eax
    addl    %edx, %eax
    popq    %rbp
    ret

Assembler

  1. Once the compiler produces Assembly code (.s file), it’s still just human-readable text. The Assembler (as) takes that and converts it into machine-readable binary instructions, stored in an object file (.o).


    What happens here:

    1. Assembly Translation

      • The .s file (assembly code) is translated into opcodes (binary machine instructions).

      • Example: movl $5, -4(%rbp) → gets turned into binary like C7 45 FC 05 00 00 00.

    2. Object File Creation (.o)

      • The assembler outputs a relocatable object file.

      • Contains:

        • Machine code instructions

        • Symbol table (function names, variable references)

        • Relocation info (addresses not yet resolved — linker will handle this later).

    3. Still Incomplete

      • The .o file can’t run yet because external functions (like printf) are still unresolved.

      • Linking (next step) will handle that.

Assembly (example.s, simplified x86-64):

  1.       .globl add
          .type  add, @function
      add:
          pushq   %rbp
          movq    %rsp, %rbp
          movl    %edi, -4(%rbp)
          movl    %esi, -8(%rbp)
          movl    -4(%rbp), %edx
          movl    -8(%rbp), %eax
          addl    %edx, %eax
          popq    %rbp
          ret
    

    Object file (example.o):

    • Binary file (not human-readable).

    • If you inspect it with objdump -d example.o, you’ll see machine instructions like

    0000000000000000 <add>:
       0: 55                      push   %rbp
       1: 48 89 e5                mov    %rsp,%rbp
       4: 89 7d fc                mov    %edi,-0x4(%rbp)
       7: 89 75 f8                mov    %esi,-0x8(%rbp)
       a: 8b 55 fc                mov    -0x4(%rbp),%edx
       d: 8b 45 f8                mov    -0x8(%rbp),%eax
      10: 01 d0                   add    %edx,%eax
      12: 5d                      pop    %rbp
      13: c3                      ret

🔹 Raw Binary (Hex Bytes)

If you dump the .o with objdump -d, you’ll see exactly these opcodes:

    55
    48 89 e5
    89 7d fc
    89 75 f8
    8b 55 fc
    8b 45 f8
    01 d0
    5d
    c3

Concatenated as a full sequence:

    55 48 89 e5 89 7d fc 89 75 f8 8b 55 fc 8b 45 f8 01 d0 5d c3

🔹 In Binary (0s and 1s)

Each hex digit = 4 bits. Converting the above to binary:

    01010101
    01001000 10001001 11100101
    10001001 01111101 11111100
    10001001 01110101 11111000
    10001011 01010101 11111100
    10001011 01000101 11111000
    00000001 11010000
    01011101
    11000011

Linker

A linker is a program that takes one or more object files (produced by the compiler) and combines them into a single executable, shared library, or another object file.

Think of it as the stage after compilation that “glues” all pieces of code together.

Role of the Linker in GCC

When you run a command like:

    gcc main.c helper.c -o myprogram

Here’s what happens internally:

  1. Compilation phase
    Each .c file is compiled separately into an object file (.o), containing machine code and symbolic references (like function names, global variables).

    Example:

     gcc -c main.c   # produces main.o
     gcc -c helper.c # produces helper.o
    
  2. Linking phase
    The linker ld (called internally by GCC) takes all .o files:

    • Resolves symbols: matches function calls and global variables across files.

    • Includes library code if needed (-lm for math, etc.).

    • Produces the final executable myprogram.


Key Responsibilities of the Linker

  1. Symbol Resolution

    • Finds where every function or variable is defined.

    • Example: main.o calls helper() → linker finds helper() in helper.o.

  2. Address Assignment

    • Assigns memory addresses to code and data sections (.text, .data, .bss).

    • Ensures all references point to the correct addresses.

  3. Library Linking

    • Links against static libraries (.a) or dynamic/shared libraries (.so).

    • Handles dynamic linking if needed at runtime.

  4. Relocation

    • Adjusts relative addresses in machine code so everything works when loaded into memory.

Types of Linking

  1. Static Linking

    • All code from libraries is copied into the executable.

    • Larger file size but no runtime dependencies.

  2. Dynamic Linking

    • Code resides in shared libraries (.so).

    • Executable is smaller, and libraries can be updated independently.


5. Viewing Linker Actions

You can see what the linker does using:

    gcc -v main.c helper.c -o myprogram

Or run ld manually:

    ld main.o helper.o -o myprogram

You can also use nm or objdump on object files to inspect symbols before linking:

    nm main.o
    objdump -d helper.o

Now Lets talk how we can link multiple language together with linker and can gain a performance oriented code.

How Linker can be used to achieve high performance in a single project with multiple codebase:

1. Understand the Role of the Linker

  • Each language’s compiler produces object files (.o) with machine code.

  • The linker combines all object files into a single executable or library.

  • Key requirements:

    1. Symbols (function names, global variables) must be visible across languages.

    2. Calling conventions must be compatible (how functions pass arguments/return values).


2. General Steps for Multi-Language Projects

Step 1: Choose a “bridge” language

  • Usually C is the bridge, because almost all languages can call C functions and be called from C.

  • For example:

    • Rust → C → C++

    • Assembly → C → Python


Step 2: Write code in multiple languages

Example: C and Assembly

Assembly (add.asm):

global add
section .text
add:
    mov rax, rdi
    add rax, rsi
    ret

C (main.c):

#include <stdio.h>
extern long add(long a, long b); // link to assembly function

int main() {
    printf("5 + 7 = %ld\n", add(5,7));
    return 0;
}

Step 3: Compile each language separately

# Assemble Assembly code
nasm -f elf64 add.asm -o add.o

# Compile C code
gcc -c main.c -o main.o

# Link object files into an executable
gcc main.o add.o -o program
  • The linker resolves the symbol add from main.o with add.o.

  • Produces a final executable program.


Step 5: Run

./program

Output:

5 + 7 = 12

3. Tips for Multi-Language Linking

  1. Use extern "C" in C++

    • C++ mangles function names; extern "C" prevents that so C/C++ can link properly.
  2. Match calling conventions

    • For example, cdecl for x86, or the platform default.
  3. Use libraries if needed

    • You can link static libraries (.a) or shared libraries (.so / .dll) across languages.
  4. Keep track of symbols

    • Tools like nm or objdump help check available symbols in object files.

4. Example: Three languages in one project

  • Assembly: low-level math functions

  • C: glue code

  • C++: main application and I/O

Steps:

  1. Write Assembly functions → assemble to .o

  2. Write C glue code → compile to .o

  3. Write C++ main → compile to .o

  4. Link all .o files → executable

Below are the advantages we can get using multiple language:

1. Performance Optimization

  • Use low-level languages (like Assembly or C) for performance-critical parts.

  • Use high-level languages (like C++, Python, or Rust) for easier development elsewhere.
    Example: Assembly for math routines, C++ for UI.

2. Reuse Existing Code

  • You can link in existing libraries written in different languages without rewriting them.
    Example: Use a legacy C library in a C++ or Rust project.

3. Flexibility

  • Choose the best language for each component:

    • High-level for productivity and readability.

    • Low-level for speed or hardware access.

4. Modularity

  • Separate functionality into different language modules:

    • Easier maintenance

    • Independent development and testing

5. Interoperability

  • Allows multiple teams to work in their preferred languages while still producing a single executable.

  • Makes it easier to integrate specialized libraries (e.g., graphics, math, AI).


6. Smaller Executables (with dynamic linking)

  • Using shared libraries across languages can reduce the final executable size and allow library updates without recompiling everything.

7. Access to Language-Specific Features

  • Leverage unique features of each language:

    • Rust → memory safety

    • C → low-level control

    • Python → rapid prototyping

    • Assembly → ultra-optimized routines

Now lets talk about ABI and how it play an important role when using multiple language:

The The ABI defines the low-level interface between binary program modules, specifying how they interact.

  1. Calling conventions

    • How functions receive arguments (registers vs stack)

    • How return values are passed

  2. Data types and alignment

    • Size and memory layout of structs, arrays, and other data types
  3. Name mangling and symbol decoration

    • How function names appear in object files
  4. Register usage and stack cleanup

    • Which registers a function must preserve

    • Who cleans up the stack after a function call

💡 Think of ABI as the “contract” between compiled modules. If two modules don’t follow the same ABI, they cannot safely call each other.


ABI Problems in Multi-Language Projects

When mixing languages, ABI mismatches are a common source of bugs:

a) Calling convention mismatch

  • Example: C uses cdecl, but another module uses stdcall.

  • Result: Stack corruption → crashes or incorrect results.

b) Data layout mismatch

  • Structs may be aligned differently in C and C++ or across compilers.

  • Example: One module expects 4-byte alignment, another 8-byte → incorrect memory reads.

c) Name mangling issues

  • C++ compilers mangle names to support overloading.

  • If C++ code calls a C function without extern "C", the linker cannot find the symbol.

d) Type size differences

  • Example: int in C is 4 bytes, but long in another language may be 8 bytes.

  • Passing mismatched types leads to garbled data.


How to Avoid ABI Problems

  1. Use C as the bridge

    • C has a simple and stable ABI.

    • Declare extern "C" in C++ for interoperability.

  2. Match calling conventions

    • Use compiler flags like __cdecl, __stdcall in C/C++ when needed.
  3. Check data layout

    • Use #pragma pack or attributes to control struct alignment if necessary.
  4. Avoid passing complex objects across language boundaries

    • Prefer simple types: integers, floats, pointers.

    • Complex types like classes, strings, or STL containers can have different ABIs.

  5. Use shared libraries carefully

    • ABI must be compatible with the compiler that built the library.

Example of an ABI Problem

C++ function calling C without extern "C":

// C++: main.cpp
#include <iostream>

int add(int a, int b) { return a + b; }  // mangled name

extern "C" int add_c(int a, int b);      // expecting C symbol

int main() {
    std::cout << add_c(5, 7) << std::endl;
}

C: add.c

int add_c(int a, int b) { return a + b; }
  • Without extern "C", the C++ compiler mangles the add symbol.

  • Linker cannot find the C symbol → linking error.

While this discussion focused on GCC, the fundamental process applies to most compilers. All compilers translate high-level code into machine code and rely on a linker in the final stage to produce an executable. Because the linker works at the binary level, it allows us to connect code from different programming languages, even if those languages don’t use GCC. This is why multi-language projects—combining C, C++, Rust, Assembly, and others—are possible, as long as the compiled object files follow compatible ABIs and calling conventions.

8
Subscribe to my newsletter

Read articles from Fakhrul Siddiqei directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Fakhrul Siddiqei
Fakhrul Siddiqei

I am an Software Engineer from Bangladesh, working with mobile development since 2017. I love to learn new technologies and use them in my development. Learning and letting other's know about new technologies and programming tech is my hobby.