How does git work internally?

Sidharth SunuSidharth Sunu
8 min read

What is Git?

Most of us have used Git at some point. For those who haven't, or maybe just haven't thought about it, Git is a version control system. So, what is a version control system? Imagine you're building a project. You reach a point where the code does something useful, so you decide it's a good time to save. You upload it to a Git website, like GitHub. As you continue working, you keep adding updates, called commits, and you can go back to any previous commit whenever you want.

The next concept in Git is branching. What is branching? Let's consider an example: you're making a browser and want to create a gamer version of it. You create a branch and work on that, developing the gamer-themed version of your browser, which is separate from the main browser development.

If you decide you want the gamer version to become your main browser because you like it so much, you merge the gamer version with the normal one. This is called a Git merge of two branches.

So far, we've covered the basics, and now we can move on to how Git actually works. For anyone interested in creating their own version of Git, I'll share a link to a version I made called pytrk.
https://github.com/sidharth-sunu/pytrk/tree/master

By the way, as a side note, we will only cover the essential features of Git for now, such as:

  • init

  • add

  • commit

  • branch

  • merge

Git Internals

You may or may not be familiar with using Git, but we’ll cover how these commands work on the inside.

git init

Let's start from the beginning. What do we do when we have a baseline for our project and want to commit it to Git? Well, we first initialize a Git repository. Essentially, we run the git init command, which creates a .git folder. This folder contains a few key files and folders:

  • objects: This is where our Git data is stored, whether it be commit data or the data from an add.

  • refs/heads: This is where the "heads" (or latest commits) of your branches are stored.

  • HEAD: This is a file where the pointer to the current branch's head is stored.

  • index file: Like we said earlier, this is where the files you've staged so far are tracked. (In my project, I used index.txt for simplicity, but in real Git, it's a binary file called index).

So the file structure becomes something like this:

.git/
├── HEAD
├── objects/
├── refs/
│ └── heads/
└── index

git add

This command is used to add either specific files or all files that haven't been added yet to the Git repo (this also includes files that have changed since you added them last time).

This first reads the file and then creates a header for it, which includes the length of the data in the file and a \0 character (for separation). This is then encoded and added to the file's data, which gives us a blob data type. Then we generate a hash of this blob and compress the blob data.

This hash is then used to create a folder within the objects folder. The first two characters of the hash become the folder's name, and the rest of the hash is used as the filename where the compressed blob data is stored. Then, in a file called the index, the hash of the file and the file's name are stored for later reference. This is how each file added using git add is stored.

git commit

Now, since we have added files to the repo, we need to commit the changes to kind of create a snapshot. This is a very useful feature, as we can roll back to any commit we like.

So how does it work? Well, when we run git commit -m <commit message>, we don't directly create a commit. First, we have to make a "tree."

So what is a tree? When we run commit, a write-tree function is called first. It reads the index file and extracts the hash and path of the files we added earlier. Then we create an entry for each file, which consists of: file mode + the path + a \x00 space character + bytes(hash). The file mode is essentially a number (100644) used to indicate it's a regular file.

Then, each entry is appended to a list called entries.

Now we need to generate the actual tree content and hash. For that, we make the tree_content by joining all the data from the entries list end-to-end. We also create a tree_header, which is going to be a recurring theme. All headers have a similar format: the object type, the length of the content, and a null character.

Then the header and the content are added together to create the full tree body. This body is then passed to our hashing function, which, as we've seen, takes this data, creates a hash from it, compresses it, and then stores it in the objects directory. The function returns the hash.

This hash is taken as the tree hash. Now, finally, this tree hash can be used in the commit function.

For the commit itself, we first take a few details, like the time, name, and email of the user, as well as some other key stuff, such as parent and tree.

  • parent is the commit hash of the previous commit, if there is one. We check the HEAD file and go to the path specified in it. If that path exists, it means a commit has already occurred in the current branch, and we can use that commit's hash as the parent. If not, we understand that this is the first commit and leave the parent empty.

  • tree is the tree hash we just got a moment ago.

Then we create a body using all these details (parent, tree, author, email, etc.), create a header as usual using the length of the body, and join these two for the final commit content. We pass this over to our hashing function again, which creates a file with the commit data and returns the commit hash. This hash is now stored in the file that HEAD points to.

And voila, we have got a commit!

git checkout

So checkout essentially refers to two possibilities: either changing the branch to another one, orrr moving to a previous (or specific) commit hash.

If it's a branch name, we go to the branch file in refs/heads/<name-of-branch> and get the latest commit hash. Otherwise, we already have the commit hash we want to move to.

Next, we might clean up the working directory by removing the files from the current commit before replacing them with files from the new one.

Now our job is to go to the target commit hash and do the reverse of the steps we've done before. It's all about finding the right hash in the objects folder, decompressing the data, figuring out if it's a commit or a tree or a blob, and using that info to write the files back into our project folder. The motive of this guide is to make you aware of the internal workings of Git, and not the full code. That will be available later in the next blog post which I will link below! This is essentially the theory behind the practical.

git branch

I already explained in brief what git branch is at the beginning of this guide. Now let's get into the details!

This is pretty straightforward, as we've covered most of the complex parts earlier. Essentially, all we have to do is get the latest commit of the current branch, create a new file in the refs/heads directory with the name of the new branch, and then put the hash of that commit in this new file.

To move to this branch, as we saw in checkout, we just move the HEAD file pointer to this new branch file. And voila, we have created a new branch which is separate from the old one, and all the changes here won't affect the original branch.

Now, what happens when we develop these branches separately?

Well, the original branch’s new commits will link back to the commit where we split. Meanwhile, from that same split point, another chain of commits will be formed for the new branch. Once you get a picture of this in your head, it’ll be much more of an "ooohh, now I get it!" moment.

git merge

Time for our final part... the git merge. There are two main types of merges: a fast-forward merge and a 3-way merge. To keep things simple, I'll explain the fast-forward merge for now.

Essentially, in a fast-forward merge, what we do is this: consider we are working on the main branch, and we have a feature branch as well. We take the current latest commit hash of the main branch. Then we start at the feature branch’s latest commit and trace our way back (using the parent commits). If any of the parent commits match the main branch's latest commit hash, then voila, we know that the feature branch is a direct extension of the current branch!

So what do we do now?

Well, it's pretty straightforward, actually. We just take the feature branch's latest commit hash and update the file for the main branch to point to it. Bam, we are done!

0
Subscribe to my newsletter

Read articles from Sidharth Sunu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sidharth Sunu
Sidharth Sunu