How Git Works: Git Internals Explained - Blobs, Commits and Branches

If you're already familiar with Git basics, feel free to skip to the next chapter where we dive into how Git really works under the hood.

Problem: You start a file (e.g. thesis_report.docSave a "snapshot" of Version 1, then make changes to it. But now you need to experiment with two diverging versions (e.g, adding new results vs. refining grammar after a proofreading session) starting from the same snapshot. Chaos ensues!

Solutions:

Rely on Copy/Paste and a naming convention that only God and you know. thesis_report_FINAL.doc → thesis_report_FINAL_gram.doc → thesis_report_FINAL_gram_methodA.doc, etc. Sound familiar? (Yes, your Final Year Project Report Trauma ^-^)
If there are multiple files, you can be creative and use subfolders or symbolic links.

As a result, over time, this will be a nightmare to manage, a waste of disk space, and unfriendly to collaborate with others. Happily, Git solved this efficiently and elegantly.

Book definition:

Git is a Version Control System (VCS) used to track changes in a working directory over time. It offers ease of collaboration + history transparency (WHAT changed, WHEN, by WHO).

The 4 stages of Git:

Working Directory: where we create, edit, and delete files freely.
We use git init in a folder, to tell git that this will be our working directory [learn more].
Staging Area: a preparation zone, where we group files with finalized changes together to be saved later on the same snapshot.
Once you’ve worked freely on your working directory and feel comfortable with a change, you can stage it and get it ready by git add documentation.doc other-script.sh or you can add all the changed files at once by git add . [learn more].
Local Repository: where we save our snapshots of the working directory locally. We can move these snapshots around as needed, but they remain in our private vault and haven’t been shared with other collaborators.
When you’re done with your change objective, and all the related files are staged, you can save a snapshot by committing them to your local repo via: git commit -m "Fix script and update documentation" [learn more].
Remote Repository: a shared vault (e.g. Github, Gitlab, Bitbucket). It stores copies of everyone's snapshots, so others can access, collaborate on, or back them up.
Now, you have done multiple modifications in different places in your working directory. Feel free to share them with others or save them for later review. First, link your local repo with a remote one git remote add origin [remote-repo-url], and then we can send our snapshots git push --set-upstream origin main. This is just for the first time, afterwards a git push is sufficient. 'main' is a branch name, we will discuss this concept in detail afterwards [learn more 1, 2].

You can think of Git as a distributed transactional environment where we can COMMIT and ROLLBACK our working directory.

We can save working directory snapshots via Commits
Commits are mutable pre-push (on the local repository, as long as they haven’t been shared or pushed to a remote repository).
Commits are immutable after push (on the remote repository).
Commits can be ROLLBACKED via reset (when local) or revert (when pushed to a remote repository)

Git 4 stages, source: ByteByteGo

Now that we’ve seen how Git behaves from the outside, let’s pop the hood and see what’s happening underneath.

Git ≈ Content-Addressable Store + diff & patch

Git is one of the most powerful tools in a developer’s toolkit — and yet, it’s also one of the most misunderstood. Many people use Git daily without fully grasping how it works. That’s not a criticism; it’s a call to curiosity.

This chapter is about unlearning the myths, clarifying common misconceptions, and preparing your mental model for what’s to come.

Behind the scenes, Git is a “content-addressable file system”, yes, a FS with its logic and mechanisms.

Myths vs Reality

1) Commit: a set of changes

Most people (my past self included) thought: “A commit is a set of changes on different files”. But here’s the twist:

Reality: a commit is a snapshot.

It’s not merely a delta of changes, but a full snapshot of the entire project’s state.

Commit myths

2) Staging Area: the magical place

When we executegit add command, it moves files to a special area!

Reality : git add updates the index (.git/index*) with content hashes of files. We will discuss that in detail later on.

3) Branche: independent sandbox

Most people think of a branch as an isolated place from the source branch, with its own copy of the files, so that they can work on their features freely/carelessly.

Reality is: a branch is a pointer to a commit that keeps moving from one commit to another.

So, if we create 3 branches from the same branch, each one of them is a pointer towards the HEAD of the source branch (e.g. main), and the changes we do are branch-agnostic since they affect only the working directory, which is global and shared, unless they're committed. A proof of that, Git won't let you switch branches while having changes in the working directory.

# On feature branch
echo "a change occured" > file.txt  # Uncommitted change
# Switch to main branch
git switch main
# Output: error: Your local changes would be overwritten!

Git stores blobs — not diffs

In Git’s object database (a key-value store .git/objects), the smallest object is the blob — the raw content of a file. When you commit, Git doesn't save a list of changes like file X changed lines 12–16. Instead, it saves the entire content of every changed file as a new blob.

Every Git object (blob, tree, commit) is stored by the SHA-1 hash (20 bytes) of its content. This is what makes Git a content-addressable store. Therefore, Git is dedupe by design.

So, if two files have the same content (e.g. README.md and COPY-README.md) Git stores one and only one blob.

same content = same hash => stored once
content changed = new hash => stored separately

We explained that a blob stores a file’s compressed content and is uniquely identified by its hash. The remaining two fundamental objects in Git’s data model are the tree, commit, and tag.

1) Tree

It ensures the structure of our project by mapping blobs to filenames and linking them to trees (the equivalent of files & directories in a standard file system).

If you rename README.md to DOCS.md Git doesn't change the blob. Only the tree updates the name linked to the existing blob.

my-project/
├── README.md
└── src/
  └── app.js

In Git’s world:

Blobs:
- A blob stores the raw content of README.md.
- A blob stores the raw content of app.js.
Trees:
- A root tree (for my-project/) points to:
  - The README.md blob (assigned the name "README.md").
  - A sub-tree for the src/ directory.
- The sub-tree (for src/) points to:
  - The app.js blob (assigned the name "app.js").

2) Commit

It represents a snapshot of our project's entire structure and content at a specific moment in time, which records:

What changed: the current root tree object.
Metadatas:
- Who changed it: the author.
- Why it's changed: the message.
- When the change occurred: date and time.
- Where it fits in the history: the parent/base commit(s), usually a commit has only one parent (or 0 if it's the initial commit), but there are cases where it has 2 or more parents, we're gonna see that later.

So when we commit, we're saying:

Here's WHAT the project looks like NOW. I changed it for this REASON, and it came from this PREVIOUS snapshot(s).

All that information is stored in the object database with a unique hash of each commit. Afterwards, we can use that commit hash to get the state of the project at any point in the timeline git checkout <COMMIT_ID>, also it’s the mechanism behind the history of changes git log, because commits, with time, form a graph via parent linking.

Branches

Since it's inconvenient to remember commit hashes(like ef4abe7...), it will be hard to track and switch between them; that's where Git introduced the concept of branches.

A branch is not a Tree structure (like the UI tools fooled us for a long time); it's not even a Git object; it's just a pointer to a commit hash.

Simply, it's a human-friendly label (e.g. my-feature or bug-fix) that we can stick to a commit hash. It's also mutable, so we can reassign it to point to another commit at any time. Hence, each time we commit, Git updates itself and sticks the branch to that new commit.

Technically, they're small text files stored under .git/refs/heads contain the hash of the commit they point to.

This way, when we need to get a given state, we can check it out with its branch name instead of the commit hash git checkout my-feature-branch (like the main branch we've seen earlier).

3) Tag

Unlike branches, temporary bookmarks on commits to mark progress and can be moved, tags, on the other hand, are meant to stay!
A tag is a permanent mark on a specific commit; it's immovable and marks a milestone in the history of a project. Furthermore, a tag is stored independently as a Git object(*) with its own computed hash.
The SHA-1 hash of a tag is computed from its content:

The hash of the tagged commit.
Metadatas (like a commit): Tagger identity, date & time, and a message.
An Optional GPG Signature (GNU Privacy Guard).

This type of tag is called an Annotated Tag, we can create it git tag -a v1.0.0 -m "Official Release v1.0.0", and we can sign it by adding the sign parameter -s instead of -a. This tag type is like a sealed envelope of metadata that is stapled to the commit. Usually, they're used for version releases, public milestones, or any scenario requiring auditability.

() The other type of tags is a Lightweight Tag; it's not a Git object, just a reference like branches, stored under .git/refs/tags/.You can imagine it as a fluorescent marker* on a commit commonly used for local and transient needs. We can create a lightweight tag via git tag candidate-beta.

Feature	Lightweight Tag	Annotated Tag
Creation Command	`git tag <tagname>`	`git tag -a <tagname> -m "message"`
Stored Data	Only a pointer to a commit.	Full Git object with metadata + commit SHA.
Use Case	Temporary or local tags.	Permanent tags (e.g., releases, versions).
Metadata	No author, date, or message.	Includes tagger name, email, date, and message.
GPG Signing	Cannot be signed.	Can be signed with `-s` instead of `-a`.
Git Object Type	Reference (like branches).	Full Git object (stored in `.git/objects`).
When to Use	Quick, throwaway references.	Public releases, historical milestones.

If you need to update a tag, you must delete and recreate it (though this is discouraged for public tags).

By default, git push does not push tags (neither lightweight nor annotated). You must explicitly push tags using: git push --tags or push just a specific one git push origin <TAG>.

4) Staging Area

We know that before committing a snapshot, we must stage the desired affected files via git add, let's see what's happening down there.

What it is: The .git/index (also known as "staging area" or "cache") is a binary file within our .git directory. It's a crucial intermediary between the working directory and the local repository.
What it stores: The index doesn't store the actual content of the files. Instead, it stores metadata about the files that are currently staged for the next commit. This metadata includes:
- File path (relative to repo root)
- File permissions
- File size
- Timestamps (including ctime and mtime from the filesystem)
- The file's inode number
- A SHA-1 checksum (hash) of the file's content (as it exists in the objects db)
- A pointer to the actual "blob" object in Git's object database (.git/objects) that contains the staged content.

Let's add a second COPY-README.md and utils.js files inside the src folder, stage them, and see what is happening before the commit.

Staging Area

The git ls-files --stage command displays all files currently tracked in Git's index, not just the newly added ones. This is because the .git/index file stores a full snapshot of the project’s staging area, representing the exact state that will be committed. So even though only COPY-README.md and utils.js were recently staged, the command lists all files in the index — including unchanged ones like README.md and app.js — as part of the complete picture that Git maintains for the next commit.

Additionally, you can notice that the SHA-1 hash for COPY-README.md is identical to that of README.md. This indicates that both files have the same content. Git knows they are duplicates at the content level—even if they have different filenames or paths, so only one blob is stored in the objects database.

The git diff --staged --raw command shows a low-level summary of changes that have been added to the staging area.

Permission changes: :000000 old_mode 100644 new_mode
SHA-1 changes: 0000000 old_blob_sha-1 134dfe3 new_blob_sha-1
Status code: [A\=added, M\=modified, D\=deleted, R\=renamed, C\=copied]

diff & patch mystery!

In Git, the concepts of "diff" & "patch" are central in how the version control mechanism works. When we run a git diff, we retrieve two snapshots (trees), compare them, and display the differences between them. So, the diff in Git ecosystem is COMPUTED not STORED. That computed diff outcome is what's called a patch.

A patch is a textual representation of changes (diffs) between two states of a file. Think of it as a set of instructions for modifying a file from one state A → B:

Add line 5 with this content "console.log('An Update');"
Modify line 7 to say X instead of Y
Delete lines [10,12]

1) Merge & Rebase

Git operations like merge & r ebase fundamentally rely on generating and applying patches.

Other common Git routines that use patch:

git apply: we're applying a patch directly to our working directory. learn more
git cherry-pick <COMMIT_ID>: it simply generates a patch from one commit and tries to apply it on top of another. learn more
git stash: temporarily stores uncommitted patches, and we can bring and apply them later on.

A) What is a Merge?

When we do a git merge Git combines changes from two commits by:

Finding the common ancestor commit (the point where the two snapshots diverged).
Generating two patches:
- A diff between the ancestor and the source commit (the commit that asked to be merged e.g. feature).
- A diff between the ancestor and the target commit (the recent commit of the hosting timeline e.g. main).
Applying both patches to the ancestor:
- If the patches don't conflict (i.e. they modify different parts of the tree/blob), Git automatically combines them into a new merge commit, which will have two parents in its metadata [source_SHA-1, target_SHA-1]
- If patches overlap (conflict), Git pauses and asks us to resolve those conflicts manually.

main:    A -- B -- C -- D -- (M) (merge commit, combines F & D patches)  
                   | (common ancestor)      
feature: A -- B -- C -- E -- F

NB: A merge commit can have more than 2 parents (Yeah, Git supports merging 3+ branches at once with git merge branch1 branch2 branch3), this is called an octopus merge. You can read a funny story about a Linux kernel commit 2cde51fbd0f3.

B) What is a Rebase?

When we do a git rebase Git rewrites history (in a linear way) by:

Finding the common ancestor (like a merge).
Extracting patches (diffs between each commit and its direct parent) in our current branch.
Replaying those patches on top of the target branch (e.g., main), as if the changes were made sequentially, so we have a linear timeline on the current branch history.

main:    A -- B -- C -- D
                   | (common ancestor)
feature: A -- B -- C -- E -- F (E & F must be replayed)
feature (rebased): A -- B -- C -- D -- E' -- F' (E & F replayed on top of D)

NB: The replayed commits will be assigned different hashes, hence stored separately.

Important Takeaways

Git stores snapshots, not diffs.
Blobs hold content, trees hold structure, and commits link it all with history.
Branches are movable pointers; tags are permanent markers.
Diffs are computed - not stored - and used heavily in merges, rebases, cherry-picks, and more.

A Final Analogy: Git is Like a Book

To wrap it all up, think of Git as a versioned book you're writing.

Your repository is the entire book, with a local draft on your machine and a remote copy shared with your editor, co-authors, and reviewers.
Each commit is like a chapter, made up of paragraphs and pages (blobs and trees) — self-contained and complete at the time it was written.
Branches are your colorful bookmarks or Post-Its, pointing to different narrative threads or timelines you're exploring, and you can move them and replace them as you want.
A lightweight tag is like a fluorescent highlight — a quick mark to find something important.
An annotated tag is a chapter title page with a summary explaining why that moment in the story mattered.
A merge is when two co-authors hand in their chapters, and you combine them into one manuscript, resolving overlapping content as needed.
A rebase is when you take your co-author’s updated chapter order, then rewrite your chapters to follow that new flow, as if you started writing from their latest draft.

This analogy isn't 100% perfect, but it helps: Git is not a list of changes — it's the evolving draft of your whole project, with structure, snapshots, and storytelling.

One Last Call: Stay Curious ^-^

Whether you're using Git, writing code, or just driving your car — don't settle for just knowing what to do. Ask why ?!

You press the clutch before shifting gears — but have you ever wondered why? Once you understand how a clutch works (disconnecting the engine from the gearbox to change speed smoothly), the whole driving experience feels different — more connected, more in control.

The same goes for tools like Git. When you dig beneath the commands and understand the design — snapshots, patches, hashes — you're no longer just pushing buttons. You're mastering the craft.

Raptor Lake or M2, always remember that a semiconductor is winking 😜😉 for your happiness.

Stay curious. It's the most powerful skill you'll ever develop ~ Badreddine.

Git Is Not What We Think It Is