Building a Simple Git(crowGit)-like Version Control System in Python

Biohacker0Biohacker0
26 min read

This is a simple version, but It does work, we have all the basic most used capabilities, I built another version where I used to move files physically, it was more of a fancy file mover LOL, but this one is like the actual git.

Also, some things might not work as expected, so apologies from my side, I am also learning things and building, but any suggestions are appreciated

Not only I built the git, It also have cli command capabilities

Credits/References : I used lots of references to build this, I don't wanna lie and say I built this by myself and took no help and the code just came to my mind, I used chat-gpt to debug, read the blogs:https://benhoyt.com/writings/pygit/

https://wyag.thb.lt/ I spent 2 days just reading all the blogs about how others build theirs, and what language would be best for my case etc.

This is the third project where I did not watch any YouTube tutorials to build a project, all was done by reading articles, actual Python docs, and blogs, slowly I am getting into the habit of reading pdf now, its more fun, and faster

Funny anime gifs | Anime Amino

To use it : GitHub Link

The commands are mentioned on my GitHub.

use the mygit_cli.py version to test CLI's , there is another python file there called mygit.py , it does not have cli , but it uses user user print commands


MyGit CLI Overview:

  • File System: MyGit works with your local file system, tracking changes and creating snapshots.

  • Git Data Structures:

    • Tree: Represents the directory structure of your project.

    • Commit: Snapshots of your project at specific points in time.

    • Blob: Individual file contents stored as objects.

  • Hashing:

    • SHA-1 Hashing: Securely identifies objects by creating unique hash values based on their content.
  • Commands:

    • mygit init: Initializes a new MyGit repository.

    • mygit add: Stages changes for commit, creating Blobs and updating the Tree.

    • mygit commit: Creates a new Commit by referencing the current Tree.

    • mygit create_branch: Adds new branches.

    • mygit switch_branch: Switches between branches.

    • mygit log: Displays the commit history.

How MyGit Works - Step by Step:

  1. Initialization (mygit init):

    • Creates an empty MyGit repository with essential directories.

    • Initializes the default 'master' branch.

  2. Adding Changes (mygit add):

    • Stages changes by creating Blobs for each file.

    • Updates the Tree to reflect the current directory structure.

  3. Committing (mygit commit):

    • Creates a Commit object:

      • References the current Tree.

      • Records authorship and commit message.

      • Optionally references the parent commit for history.

    • Updates the current branch reference to the new Commit.

  4. Branching (mygit create_branch):

    • Adds a new branch, creating a branch reference.

    • Allows parallel development on separate branches.

  5. Switching Branches (mygit switch_branch):

    • Changes the current working branch.

    • Enables developers to work on different features or versions.

  6. Viewing Commit History (mygit log):

    • Lists available branches.

    • Displays detailed commit history for the selected branch.


Code working and Explanation:

ayo master, this is the main stuff you came for, also gib me moni and job if you like me

Money Anime GIF - Money Anime Kyoto Animation - Discover & Share GIFs

REPO_PATH = "mygit"
  • Input: There is no direct input to this constant. It simply defines the path to the Git repository.

  • Output: No output is produced. It's a constant value used throughout the code.

  • Algorithm (Initialization):

    • The constant REPO_PATH is set to "mygit," which represents the path to the Git repository.
  • Mathematics and Intuition:

    • REPO_PATH serves as a placeholder for the path to the Git repository, making it easier to reference throughout the code.

TreeEntry Class:

class TreeEntry:
    def __init__(self, name, is_directory, hash):
        self.name = name
        self.mode = "40000" if is_directory else "100644"
        self.hash = hash
  • Input:

    • name: Name of the file or directory.

    • is_directory: Boolean indicating whether it's a directory (True) or a file (False).

    • hash: Hash of the file content (if it's a file) or the tree object (if it's a directory).

  • Output: An instance of the TreeEntry class is created with the specified attributes.

  • Algorithm (Initialization):

    • The __init__ method initializes a TreeEntry object with three attributes: name, mode, and hash.

    • name represents the name of the file or directory.

    • is_directory is a boolean flag that determines if the entry is a directory or a file.

    • hash stores the hash associated with the object (file content hash for files and tree object hash for directories).

    • The mode attribute is determined based on whether the entry is a directory or a file.

  • Mathematics and Intuition:

    • The TreeEntry class encapsulates file and directory information in Git's tree structure.

    • It abstracts the complexity of storing mode, name, and hash for each entry.

    • The mode attribute is set based on whether the entry represents a directory or a file, following Git's conventions.


Tree Class:

class Tree:
    def __init__(self):
        self.entries = {}

    def hash(self):
        data = []
        for name, entry in sorted(self.entries.items()):
            data.append(f"{entry.mode} {name}\0{entry.hash}")
        return hash_object("".join(data).encode(), "tree")
  • Input: The Tree class has no direct input. It initializes an empty tree with an entries dictionary.

  • Output: The hash method returns the SHA-1 hash of the tree object represented by the entries.

  • Algorithm (Initialization):

    • The __init__ method initializes an empty Tree object with an empty entries dictionary.
  • Algorithm (Hash Calculation - hash method):

    • To calculate the hash of the Tree object, follow these steps:

      1. Create an empty list data to store data entries.

      2. Iterate through the entries dictionary, sorted by entry name.

      3. For each entry, concatenate its mode, name, and hash with null characters in between.

      4. Append the concatenated entry to the data list.

      5. Join all entries in data into a single string and encode it.

      6. Compute the SHA-1 hash of the encoded data using the hash_object function.

      7. Return the computed SHA-1 hash.

  • Mathematics and Intuition:

    • The Tree class represents a Git tree object, which organizes file and directory entries within a Git repository.

    • The hash method calculates a unique identifier (hash) for the tree object by concatenating the mode, name, and hash of each entry. This hash is crucial for tracking and managing changes within the repository.


hash_object Function:

hash_object(data, obj_type):
    header = f"{obj_type} {len(data)}\0"
    full_data = header.encode() + data
    sha1_hash = hashlib.sha1(full_data).hexdigest()
    with open(f"{REPO_PATH}/objects/{sha1_hash}", "wb") as f:
        f.write(full_data)
    return sha1_hash
  • Input:

    • data: The data to be hashed, typically the content of a Git object (e.g., file content or tree structure).

    • obj_type: The type of the Git object, such as "blob" for file content or "tree" for tree structure.

  • Output: The function returns the SHA-1 hash of the provided data.

  • Algorithm:

    • To calculate the SHA-1 hash of the provided data for a Git object, follow these steps:

      1. Construct a header string that includes the object type and the length of the data, separated by a null character.

      2. Combine the header and data by encoding them as bytes and storing them in the full_data variable.

      3. Calculate the SHA-1 hash of the full_data variable.

      4. Create a file in the Git repository's "objects" directory with the SHA-1 hash as the filename, and write the full object content to the file.

      5. Return the computed SHA-1 hash.

  • Mathematics and Intuition:

    • The hash_object function generates a unique identifier (hash) for Git objects, which is essential for Git's version control system.

    • It constructs a header that includes metadata about the object type and data length.

    • The SHA-1 hash is computed over the combined data, resulting in a unique identifier for the Git object.

    • This hash is then used for storing and retrieving Git objects within the repository.


create_tree Function:

create_tree():
    tree = Tree()
    for root, dirs, files in os.walk(".", topdown=True):
        subtree = Tree()
        for dir_name in dirs:
            subtree.entries[dir_name] = TreeEntry(dir_name, True, "")
        for file_name in files:
            file_path = os.path.join(root, file_name)
            file_content = open(file_path, "rb").read()
            file_hash = hash_object(file_content, "blob")
            subtree.entries[file_name] = TreeEntry(file_name, False, file_hash)
        dir_name = os.path.relpath(root, start=".")
        if dir_name == ".":
            dir_name = ""
        tree.entries[dir_name] = TreeEntry(dir_name, True, subtree.hash())
    return tree.hash()
  • Input: None

  • Output: The function returns the SHA-1 hash of a tree object that represents the current state of the working directory.

  • Algorithm:

    1. Create an empty tree object.

    2. Traverse the directory structure using os.walk, starting from the current directory (".").

    3. For each directory encountered, create a subtree object and add entries for subdirectories (with empty hashes).

    4. For each file encountered, calculate the SHA-1 hash of its content using hash_object and add it to the subtree.

    5. Determine the relative directory path (dir_name) and add an entry for it in the tree, pointing to the subtree object.

    6. Repeat steps 3-5 for all directories and files in the working directory.

    7. Return the SHA-1 hash of the tree object.

  • Mathematics and Intuition:

    • The create_tree function generates a tree object that represents the current state of the working directory in a Git repository.

    • It constructs the tree by recursively traversing the directory structure and calculating the SHA-1 hash of file contents.

    • The tree's SHA-1 hash serves as a unique identifier for the state of the working directory at a specific point in time.


commit Function:

commit(args):
    repo_path = args.repo_path  # Get the repository path from args
    tree_hash = create_tree()

    # Determine the current branch dynamically if not provided
    current_branch = args.branch
    if current_branch is None:
        current_branch = get_current_branch(repo_path)

    parent_commit = get_head_commit(current_branch)
    commit_data = f"tree {tree_hash}\nauthor {args.author}\ncommitter {args.author}\n\n{args.message}\n"
    if parent_commit:
        commit_data += f"parent {parent_commit}\n"
    commit_hash = hash_object(commit_data.encode(), "commit")

    # Update the branch reference to the new commit
    branch_ref_path = os.path.join(repo_path, "refs", "heads", current_branch)
    with open(branch_ref_path, "w") as ref_file:
        ref_file.write(commit_hash)

    print(f"Committed: {commit_hash[:7]} to branch {current_branch}")
    return current_branch
  • Input:

    • args: A dictionary-like object containing information about the commit, including author, message, and optional branch name.
  • Output: The function returns the name of the branch to which the commit was made.

  • Algorithm:

    1. Get the repository path from the provided args.

    2. Calculate the SHA-1 hash of the tree object representing the current state of the working directory using the create_tree function.

    3. Determine the current branch dynamically based on the provided branch name or the default branch name "master" using the get_current_branch function.

    4. Get the SHA-1 hash of the head commit on the current branch (if it exists) using the get_head_commit function.

    5. Construct commit data:

      • Include the tree hash, author, committer, and commit message.

      • Optionally, include the parent commit hash if there is a previous commit on the branch.

    6. Calculate the SHA-1 hash of the commit data using the hash_object function.

    7. Update the branch reference to point to the new commit in the repository.

    8. Print a confirmation message indicating the commit's SHA-1 hash and the branch it was committed to.

    9. Return the name of the current branch.

  • Mathematics and Intuition:

    • The commit function is responsible for creating a new commit in a Git repository.

    • It calculates the SHA-1 hash of the tree object representing the current state of the working directory and constructs commit data that includes this tree hash, author information, committer information, and a commit message.

    • If it's not the initial commit, it also includes a reference to the parent commit.

    • The resulting commit object is identified by its SHA-1 hash, and the branch reference is updated to point to this new commit, effectively advancing the branch.


get_head_commit Function:

get_head_commit(branch="master"):
    head_file_path = os.path.join(REPO_PATH, "refs", "heads", branch)
    if os.path.isfile(head_file_path):
        with open(head_file_path, "r") as ref_file:
            return ref_file.read().strip()
    return None
  • Input:

    • branch (optional): The name of the branch to retrieve the head commit from. If not provided, it defaults to "master."
  • Output: The function returns the SHA-1 hash of the commit at the head of the specified branch, or None if the branch doesn't exist.

  • Algorithm:

    1. Determine the path to the branch reference file based on the provided or default branch name.

    2. If the branch reference file exists: a. Read the content of the branch reference file, which contains the SHA-1 hash of the head commit. b. Return the SHA-1 hash.

    3. If the branch reference file doesn't exist (branch doesn't exist), return None.

  • Mathematics and Intuition:

    • The get_head_commit function retrieves the SHA-1 hash of the commit at the head of a specified branch.

    • It does this by reading the content of the branch reference file, which stores the SHA-1 hash of the head commit.

    • If the branch doesn't exist, it returns None.


create_default_branch Function:

create_default_branch(repo_path):
    branch_name = "default"
    branch_ref_path = os.path.join(repo_path, "refs", "heads", branch_name)
    with open(branch_ref_path, "w") as ref_file:
        ref_file.write("")
    update_head_ref(repo_path, branch_name)
    return branch_name
  • Input:

    • repo_path: The path to the Git repository.
  • Output: The function returns the name of the default branch, which is "default."

  • Algorithm:

    1. Set the default branch name as "default."

    2. Determine the path to the branch reference file for the default branch.

    3. Create an empty branch reference file for the default branch.

    4. Update the HEAD reference to point to the default branch.

    5. Return the name of the default branch, which is "default."

  • Mathematics and Intuition:

    • The create_default_branch function initializes a Git repository by creating a default branch named "default."

    • It ensures that the HEAD reference points to the default branch, making it the initial branch of the repository.


create_branch Function:

create_branch(args):
    branch_name = args.branch_name
    branch_ref_path = os.path.join(args.repo_path, "refs", "heads", branch_name)
    if os.path.isfile(branch_ref_path):
        print(f"Branch '{branch_name}' already exists.")
    else:
        with open(branch_ref_path, "w") as ref_file:
            ref_file.write("")
        print(f"Created branch '{branch_name}'.")

    # Automatically switch to the newly created branch
    switch_branch(args)

    return branch_name
  • Input:

    • args: A dictionary-like object containing the name of the branch to be created (branch_name).
  • Output: The function returns the name of the newly created branch.

  • Algorithm:

    1. Get the desired branch name from the provided args.

    2. Determine the path to the branch reference file for the new branch.

    3. Check if the branch reference file already exists.

    4. If the branch reference file exists, print a message indicating that the branch already exists.

    5. If the branch reference file doesn't exist (new branch), create an empty branch reference file for it.

    6. Print a message indicating the successful creation of the new branch.

    7. Automatically switch to the newly created branch using the switch_branch function.

    8. Return the name of the newly created branch.

  • Mathematics and Intuition:

    • The create_branch function creates a new branch in the Git repository.

    • It checks if the branch already exists and, if not, creates a new branch reference file for it.

    • Optionally, it can automatically switch to the newly created branch, ensuring that you start working on the new branch immediately.


switch_branch Function:

switch_branch(args):
    branch_name = args.branch_name
    current_branch = get_head_commit(args.repo_path)
    if not branch_exists(branch_name, args.repo_path):
        print(f"Branch '{branch_name}' does not exist. Would you like to create it? (y/n)")
        choice = input().strip()
        if choice.lower() == 'y':
            create_branch(args)
        else:
            print("Aborted branch switch.")
            return current_branch
    update_head_ref(args.repo_path, branch_name)
    return branch_name
  • Input:

    • args: A dictionary-like object containing the name of the branch to switch to (branch_name).
  • Output: The function returns the name of the branch that was switched to.

  • Algorithm:

    1. Get the desired branch name from the provided args.

    2. Get the SHA-1 hash of the commit at the head of the current branch using get_head_commit.

    3. Check if the desired branch exists using the branch_exists function.

    4. If the desired branch does not exist: a. Print a prompt asking if you want to create the branch. b. Read the user's choice ('y' for yes, 'n' for no). c. If the user chooses 'y' (yes), create the branch using the create_branch function. d. If the user chooses 'n' (no), print a message indicating that the branch switch was aborted and return the current branch.

    5. If the desired branch exists, update the HEAD reference to point to the desired branch.

    6. Return the name of the branch that was switched to.

  • Mathematics and Intuition:

    • The switch_branch function allows you to switch to a different branch within the Git repository.

    • It first checks if the desired branch exists by calling branch_exists.

    • If the branch exists, it updates the HEAD reference to point to the desired branch, effectively switching branches.

    • If the branch doesn't exist, it offers to create the branch if the user chooses to do so.


branch_exists Function:

branch_exists(branch_name, repo_path):
    branch_ref_path = os.path.join(repo_path, "refs", "heads", branch_name)
    return os.path.isfile(branch_ref_path)
  • Input:

    • branch_name: The name of the branch to check for existence.

    • repo_path: The path to the Git repository.

  • Output: The function returns True if the branch exists, and False otherwise.

  • Algorithm:

    1. Determine the path to the branch reference file for the specified branch.

    2. Check if the branch reference file exists.

    3. Return True if the file exists (branch exists), or False if it doesn't.

  • Mathematics and Intuition:

    • The branch_exists function checks whether a specified branch exists within the Git repository.

    • It does this by checking for the presence of the branch's reference file.

    • If the file exists, it returns True, indicating that the branch exists. Otherwise, it returns False.


log Function:

log(args):
    repo_path = args.repo_path  # Get the repository path from args
    branch_name = args.branch

    if branch_name is None:
        branches = [branch for branch in os.listdir(os.path.join(repo_path, "refs", "heads")) if os.path.isfile(os.path.join(repo_path, "refs", "heads", branch))]
        if not branches:
            print("No branches found.")
            return
        print("Available branches:")
        for branch in branches:
            print(f"- {branch}")
        branch_choice = input("Enter the branch name to view commits (or 'default' for default): ").strip()
        branch_name = branch_choice if branch_choice else "default"

    if not branch_exists(branch_name, repo_path):
        print(f"Branch '{branch_name}' does not exist.")
        return
    with open(os.path.join(repo_path, "refs", "heads", branch_name), "r") as ref_file:
        latest_commit = ref_file.read().strip()
    commit = latest_commit
    while commit:
        commit_path = os.path.join(repo_path, "objects", commit)
        with open(commit_path, "r") as commit_file:
            commit_data = commit_file.read()
        print(f"Commit: {commit[:7]}")
        lines = commit_data.split("\n")
        for line in lines:
            if line.startswith("author "):
                print(line)
        print(commit_data)
        lines = commit_data.split("\n")
        parent_commit = None
        for line in lines:
            if line.startswith("parent: "):
                parent_commit = line.split(": ")[1]
                break
        if parent_commit:
            print(f"Parent: {parent_commit}\n")
            commit = parent_commit
        else:
            break
  • Input:

    • args: A dictionary-like object containing information about the branch to log (branch) and the repository path (repo_path).
  • Output: The function displays the commit history for the specified branch or the default branch.

  • Algorithm:

    1. Get the repository path and branch name from the provided args.

    2. If no branch is specified, list the available branches in the repository.

    3. Prompt the user to enter the name of the branch they want to view commits for.

    4. If the user doesn't specify a branch, default to "default."

    5. Check if the specified branch exists using the branch_exists function.

    6. If the branch doesn't exist, print a message indicating that it doesn't exist.

    7. Read the SHA-1 hash of the latest commit on the branch from the branch's reference file.

    8. Start a loop to iterate through the commit history.

    9. Read and display commit information, including author, commit message, and parent commit.

    10. Update the commit variable to point to the parent commit and repeat the loop until there are no more parent commits.

  • Mathematics and Intuition:

    • The log function provides a commit history log for a specified branch (or the default branch if not specified).

    • It retrieves commit information, including the author, commit message, and parent commit(s), and displays them in a structured format.

    • The function iterates through the commit history by following parent commits, if they exist.


add Function:

add(args):
    if args.filename == ".":
        # Adds all files in the current directory and subdirectories
        for root, dirs, files in os.walk(".", topdown=True):
            for file_name in files:
                file_path = os.path.join(root, file_name)
                add_file(args.repo_path, file_path)
            if not args.recursive:
                break
    else:
        # Adds the specified file
        add_file(args.repo_path, args.filename)
  • Input:

    • args: A dictionary-like object containing information about the file to be added (filename) and whether to add files recursively (recursive).
  • Output: The function adds the specified file(s) to the Git repository's index.

  • Algorithm:

    1. Check if the filename in args is set to "." (indicating adding all files in the current directory and subdirectories).

    2. If the filename is ".", iterate through the directory structure using os.walk:

      • For each file encountered, call the add_file function to add it to the repository's index.

      • If the recursive flag is not set, break the loop after processing the current directory.

    3. If the filename is not ".", call the add_file function to add the specified file to the repository's index.

  • Mathematics and Intuition:

    • The add function is responsible for adding files to the Git repository's index, preparing them for the next commit.

    • It can add either a single specified file or all files in the current directory and its subdirectories.

    • The function iterates through the directory structure to find and add files, and it respects the recursive flag to control the depth of file searching.


init Function:

init(repo_path):
    os.makedirs(os.path.join(repo_path, "objects"))
    os.makedirs(os.path.join(repo_path, "refs", "heads"))
    create_default_branch(repo_path)
    tree_hash = create_tree()
    commit_data = f"tree {tree_hash}\nauthor Anonymous\ncommitter Anonymous\n\nInitial commit\n"
    initial_commit_hash = hash_object(commit_data.encode(), "commit")
    update_head_ref(repo_path, "default", initial_commit_hash)
    print("Initialized empty MyGit repository with an initial commit.")
  • Input:

    • repo_path: The path where the Git repository will be initialized.
  • Output: The function initializes an empty Git repository and prints a message to confirm the initialization.

  • Algorithm:

    1. Create the directory structure for the Git repository:

      • Create the "objects" directory to store Git objects.

      • Create the "refs/heads" directory to store branch references.

    2. Create the default branch using the create_default_branch function.

    3. Calculate the SHA-1 hash of the tree object representing the initial state of the working directory using the create_tree function.

    4. Construct commit data for the initial commit, including the tree hash, author, committer, and commit message.

    5. Calculate the SHA-1 hash of the commit data using the hash_object function.

    6. Update the HEAD reference to point to the default branch and the initial commit.

    7. Print a confirmation message indicating the successful initialization of the repository.

  • Mathematics and Intuition:

    • The init function creates a new, empty Git repository and sets up the initial default branch.

    • It calculates the tree hash and creates an initial commit that represents the initial state of the working directory.

    • The function initializes the HEAD reference to point to the default branch, making it the starting point for version control.


update_head_ref Function:

update_head_ref(repo_path, branch_name, commit_hash=None):
    head_ref_path = os.path.join(repo_path, "HEAD")
    if commit_hash is not None:
        with open(head_ref_path, "w") as head_file:
            head_file.write(f"ref: refs/heads/{branch_name}\n")
        branch_ref_path = os.path.join(repo_path, "refs", "heads", branch_name)
        with open(branch_ref_path, "w") as branch_file:
            branch_file.write(commit_hash)
    else:
        with open(head_ref_path, "w") as head_file:
            head_file.write(f"ref: refs/heads/{branch_name}\n")
  • Input:

    • repo_path: The path to the Git repository.

    • branch_name: The name of the branch to update the HEAD reference to.

    • commit_hash (optional): The SHA-1 hash of the commit to point the branch to. If not provided, it only updates the branch reference.

  • Output: The function updates the HEAD reference to point to the specified branch and commit (if provided).

  • Algorithm:

    1. Determine the path to the HEAD reference file.

    2. If a commit_hash is provided: a. Write a reference to the specified branch in the HEAD reference file. b. Determine the path to the branch reference file for the specified branch. c. Write the commit_hash to the branch reference file, updating the branch to point to the specified commit.

    3. If no commit_hash is provided:

      • Write a reference to the specified branch in the HEAD reference file, indicating that it's the current branch.
  • Mathematics and Intuition:

    • The update_head_ref function is responsible for updating the HEAD reference in a Git repository.

    • It allows you to specify both the branch and commit to which the HEAD reference should point.

    • By doing so, it effectively changes the current state of the repository to the specified branch and commit.


code ends here buddy, below is just the theory of data structure I used.


Data Structure Used:

1. Tree Structure and TreeEntry

In the Git code, the tree structure (Tree class) represents the hierarchical organization of files and directories within the repository. Mathematically, it can be defined as follows:

  • Let T represent a Git tree object.

  • T can be defined as an ordered set of TreeEntry objects: T = {entry_1, entry_2, ..., entry_n}, where each entry_i corresponds to a file or directory.

  • Each TreeEntry object has the following attributes:

    • name_i: The name of the file or directory (a string).

    • mode_i: The mode or permissions of the entry (a string, e.g., "100644" for files).

    • hash_i: The SHA-1 hash of the entry's content (a string).

Example:

Consider a simple directory structure within the repository:

codeRoot/
|-- File1.txt (mode: 100644, hash: abc123)
|-- Directory1/
|   |-- File2.txt (mode: 100644, hash: def456)
|-- Directory2/
|   |-- File3.txt (mode: 100644, hash: ghi789)

In this example, the Tree object T can be represented as:

codeT = {
    TreeEntry(name="File1.txt", mode="100644", hash="abc123"),
    TreeEntry(name="Directory1", mode="40000", hash="..."),
    TreeEntry(name="Directory2", mode="40000", hash="..."),
}

Why This Choice?

  • Efficient Representation: The tree structure efficiently captures the hierarchical nature of files and directories within a repository. It organizes them into a structured format that is easy to traverse and manipulate.

  • Hashing: Each entry's hash ensures that changes in file content or structure result in a different tree hash, providing a means to track changes.


2. Hash Calculation (hash_object Function)

In the Git code, the hash_object function calculates the SHA-1 hash of Git objects, including blobs (file content), trees (directory structure), and commits (version snapshots). Here's a more mathematical explanation:

  • Mathematical Explanation:

    • The SHA-1 hash function, denoted as SHA1, takes an input binary string data and computes a 160-bit (20-byte) hash value.

    • It can be expressed as SHA1(data) = hash_value, where hash_value is a 160-bit hexadecimal string.

    • Mathematically, the function SHA1 operates on binary sequences, and it produces a fixed-size output.

Example:

Let's say we have a blob object with binary data data_blob:

codedata_blob = "This is some content."

Using the SHA-1 hash calculation, we can compute the hash:

codeSHA1(data_blob) = "2ef7bde608ce5404e97d5f042f95f89f1c61f0744f"

Why This Choice?

  • Cryptographic Security: SHA-1 is chosen for its cryptographic properties. It produces a unique hash value for each distinct input, making it extremely difficult to find two different inputs that produce the same hash (collision resistance).

  • Efficiency: SHA-1 provides a fixed-size hash output (160 bits), which is efficient for Git's purposes. It balances the need for uniqueness and efficiency in tracking changes.


Commit Data Structure

In the Git code, a commit object represents a version snapshot of the repository. Here's a mathematical view:

  • Mathematical Explanation:

    • A Git commit can be represented as a tuple of attributes: Commit = (tree_hash, author, committer, message, parent_commit).

    • Each attribute can be expressed mathematically:

      • tree_hash: The SHA-1 hash of the corresponding tree object (a string).

      • author: Author information (a string).

      • committer: Committer information (a string).

      • message: Commit message (a string).

      • parent_commit: An optional reference to the parent commit (a string or null for the initial commit).

Example:

Consider a simple commit:

codeTree Hash: abc123
Author: John Doe
Committer: John Doe
Message: Initial commit
Parent Commit: None

Mathematically, this commit can be represented as:

codeCommit = ("abc123", "John Doe", "John Doe", "Initial commit", None)

Why This Choice?

  • Structured Data: Using a structured tuple to represent commit data allows Git to organize and store essential information in a consistent format.

  • Parent Commit Reference: Including a reference to the parent commit allows Git to build a directed acyclic graph (DAG) of commits, representing the entire history of the repository.


4. Branching and HEAD Reference

  • Mathematical Explanation:

    • A Git branch can be represented as a set of commits. Let B represent a branch, and Commits(B) represent the set of commits in that branch.

    • The HEAD reference (HEAD) can be thought of as a pointer to a branch. It signifies the currently checked-out branch. In mathematical terms, it can be represented as HEAD -> B, where B is the active branch.

    • Switching branches involves updating the HEAD reference, effectively changing its target branch: HEAD -> B', where B' is the new branch.

Example:

Consider two branches, master and feature, with their respective commit histories:

codeMaster Branch: A -- B -- C
Feature Branch:     \
                     D -- E

If the feature branch is checked out, the HEAD reference can be represented as:

codeHEAD -> Feature

Why This Choice?

  • Branch as a Set of Commits: Representing a branch as a set of commits is mathematically elegant and aligns with Git's internal data structure, which is essentially a directed acyclic graph (DAG) of commits.

  • HEAD as a Pointer: The concept of HEAD as a pointer to a branch makes it easy to switch between branches and denote the currently active branch.


Traversal of File System (os.walk)

In your Git code, the os.walk function is used to traverse the directory structure of the working directory. Here's a mathematical perspective on directory traversal:

  • Mathematical Explanation:

    • Directory traversal can be viewed as a directed graph traversal, where directories and files are nodes, and containment relationships are edges.

    • Let G be a directed graph representing the file system hierarchy, where nodes represent directories or files, and edges represent parent-child relationships.

    • Traversal involves visiting nodes in a specific order, similar to graph traversal algorithms like depth-first search (DFS) or breadth-first search (BFS).

Example:

Consider a directory structure:

codeRoot/
|-- File1.txt
|-- Directory1/
|   |-- File2.txt
|-- Directory2/
|   |-- File3.txt

In graph terms, this structure can be represented as:

codeG = {
    Nodes: {Root, File1.txt, Directory1, File2.txt, Directory2, File3.txt},
    Edges: {(Root, File1.txt), (Root, Directory1), (Directory1, File2.txt), (Root, Directory2), (Directory2, File3.txt)}
}

Why This Choice?

  • Graph Representation: Representing the file system as a graph allows for efficient traversal and exploration of directory structures.

  • Graph Algorithms: Concepts from graph theory, like DFS and BFS, provide well-defined strategies for navigating and manipulating the file system.


Functions I used over

## `init()`

- **Purpose:** Initializes a new Git repository in the specified directory.
- **Function Calls:**
  - `os.makedirs()`: Creates necessary directories for the repository.
  - `hash_object()`: Creates the initial tree and commit objects.
  - `create_tree()`: Generates the initial tree structure.
  - `commit()`: Creates the initial commit.

## `create_tree()`

- **Purpose:** Creates a tree object that represents the current state of the working directory.
- **Function Calls:**
  - `Tree()`: Initializes a new tree object.
  - `add_file()`: Adds files and directories to the tree.
  - `hash_object()`: Creates a tree object by hashing the serialized tree data.

## `add(filename, recursive=False)`

- **Purpose:** Stages changes made to files or directories for the next commit.
- **Function Calls:**
  - `add_file()`: Stages a specific file.
  - `create_tree()`: Updates the tree structure.

## `add_file(file_path)`

- **Purpose:** Stages a specific file for the next commit.
- **Function Calls:**
  - `hash_object()`: Hashes the file's content.
  - Updates the staging area.

## `commit(message, author="Anonymous", current_branch="master")`

- **Purpose:** Creates a new commit with the staged changes.
- **Function Calls:**
  - `create_tree()`: Creates the tree object representing the current state.
  - `get_head_commit()`: Retrieves the current branch's latest commit (if it exists).
  - `hash_object()`: Creates a new commit object.
  - Updates the branch reference.

## `get_head_commit(branch="master")`

- **Purpose:** Retrieves the latest commit hash for the specified branch.
- **Function Calls:**
  - Reads the branch reference file.

## `create_branch(branch_name)`

- **Purpose:** Creates a new branch with the given name.
- **Function Calls:**
  - Checks if the branch already exists.
  - Creates a new branch reference.

## `switch_branch(branch_name, current_branch)`

- **Purpose:** Switches to the specified branch.
- **Function Calls:**
  - Checks if the branch exists.

## `branch_exists(branch_name)`

- **Purpose:** Checks if the specified branch exists.
- **Function Calls:**
  - Checks if the branch reference file exists.

## `log(current_branch="master")`

- **Purpose:** Displays the commit history for the current branch or the specified branch.
- **Function Calls:**
  - Reads commit objects and displays commit details.

Working Of Head:

  1. Branch References:

    In MyGit, branch references are used to keep track of the latest commit in each branch. These branch references are files stored in the refs/heads directory. Each file corresponds to a branch and contains the SHA-1 hash of the latest commit in that branch.

    For example, if you have a branch called "master," there will be a file named refs/heads/master, and its content will be the SHA-1 hash of the latest commit on the "master" branch.

  2. HEAD Reference:

    The HEAD reference points to the currently checked-out branch. In MyGit, this is usually stored in the HEAD file in the main directory of your repository.

    • When you create a new branch (e.g., "feature-branch"), the HEAD reference is updated to point to this new branch. This means you are now on the "feature-branch."

    • When you make a new commit on the current branch, the branch reference for that branch is updated to point to the new commit's hash. The HEAD reference is updated to indicate that you are still on the same branch.

    • When you switch branches using commands like switch_branch, the HEAD reference is updated to point to the new branch, and the working directory is updated to match the files in that branch.

    • When you create a new commit, the branch reference for the current branch is updated to point to the new commit's hash, and the HEAD reference is updated to indicate that you are still on the same branch.

    • When you use commands like log, they read the HEAD reference to determine which branch you are currently on, and then they use the branch reference to retrieve the commit history for that branch.

1
Subscribe to my newsletter

Read articles from Biohacker0 directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Biohacker0
Biohacker0

I am a software engineer and a bioinformatics researcher. I find joy in learning how things work and diving into rabbit holes. JavaScript + python + pdf's and some good music is all I need to get things done. Apart from Bio and software , I am deeply into applied physics. Waves, RNA, Viruses, drug design , Lithography are something I will get deep into in next 2 years. I will hack biology one day