Introduction

Have you ever joined a new GitHub project and spent days- or even weeks- trying to understand the codebase? Or perhaps you've had to make changes to a feature implemented by a developer who's no longer with the team? As developers, we spend a significant portion of our time reading and understanding code rather than writing it.

Our Capstone Project, talkToGIT, tackles this universal challenge by creating an intelligent question-answering system for codebases. We’ve used the power of Generative AI and modern retrieval techniques to build a tool that can answer natural language questions about any GitHub repository.

The Problem

Modern software projects are complex:

Large codebases often span hundreds of files and thousands of lines of code
Documentation may be sparse or outdated
For student developers and contributors, understanding repositories where code and implementation details are scattered across multiple files and folders is overwhelming

These issues scare students wanting to start contributing to open source projects and cause a significant decrease in productivity. Our goal is to drastically reduce this time by creating an AI assistant that can instantly answer questions about code implementation.

Our Solution: talkToGIT

Our solution creates an easy interface between developers and repositories through natural language. Here's how it works:

Repository Analysis: We clone a GitHub repository and extract all its code files
Code Understanding: We analyze the metadata, structure, functions, classes, and dependencies of the code files (currently, we provide this analysis for Python files)
AI-Powered Explanations: Each file, regardless of programming language, is converted to text and sent to the model, where it is then interpreted and explained using Gemini AI
Vector Database Storage: All such generated code files explanations are embedded and stored in a vector database
Intelligent Retrieval and Comprehensive Answers: When a user asks a question, we find the most relevant code explanations. The system generates detailed, contextual responses based on the retrieved information

Gen AI Capabilities We Used

1. Document Understanding

Our solution demonstrates advanced document understanding capabilities by processing entire code files as documents with specific structure and meaning. The system doesn't just treat code as text - it comprehends the structural and functional elements within the code. We extract structured information about functions, classes, and imports, while also capturing metadata and relationships between files for Python files:


def extract_top_level_comments_and_docstrings(file_path):
    comments = []
    docstring = None
    with open(file_path, 'r', encoding='utf-8') as f:
    lines = f.readlines()
    # Process comments and docstrings...
    return {'docstring': docstring, 'comments': comments}

This allows our system to understand the purpose and functionality of code blocks, relationships between different components, and answer questions that need information from multiple parts of the codebase. The Gemini model then generates natural language explanations that capture the semantic meaning and functional purpose of each file.

2. Few-Shot Prompting

Our implementation leverages few-shot prompting to dramatically improve the quality and consistency of code explanations. Rather than simply asking the model to explain code, we provide carefully crafted examples of well-explained code snippets:

prompt0="""
You are a code interpreter. Describe what the code given by the user does in simple, detailed English. Use points when necessary. Include function names, logic, purpose, and any key structures in as much detail as possible. Only output the "Explanation" part. Nothing less, Nothing more.
"""

example1=f"""
code:
{code1}
Explanation:
This code defines a function called getHeight that calculates the height of a binary tree…
"""

By providing these examples, we "teach" the model our expected format and depth of explanation. Each code file is then processed with these examples as context:

prompt=prompt0+example1+example2+f"""
code:
{code_content}
Explanation:
"""

This few-shot approach gives consistent, detailed explanations for different file types, so our system develops a better understanding of the codebase even if the code implementations are very diverse.

3. Embeddings

We have used Google's text-embedding-004 model, which transforms code explanations into high-dimensional vectors that capture semantic meaning. This approach allows us to capture the semantic essence of code explanations, making it possible to find relevant information even when queries don't use the exact terminology present in the codebase.

4. Vector Database/Storage

We used ChromaDB to store and efficiently query our embeddings:

embed_func=GeminiEmbeddingFunction()
embed_func.document_mode=True
chroma_client=chromadb.Client()
db=chroma_client.get_or_create_collection(name=db_name,embedding_function=embed_func)

For each file in the repository, we generate AI explanations and store them with metadata:

db.add(documents=[response.text],ids=[document_id],metadatas=[{"file_name":file_name}])

This approach allows for fast similarity searches when users ask questions.

5. Retrieval Augmented Generation (RAG)

The real magic happens when a user asks a question. We transform their query into the same vector space and retrieve the most relevant code explanations:

def query_code_database(query_text, n_results=15):
    embed_func.document_mode = False
    result = db.query(query_texts=[query_text], n_results=n_results)
    return result

Then, we use these retrieved explanations to generate a comprehensive answer using Gemini. This RAG approach ensures that responses are grounded in the actual codebase while benefiting from the language capabilities of large language models.

RAG Implementation Highlights

We have decided to focus on understanding the code rather than just code search. Here's how we processed each file in the repository:

for i,file_name in enumerate(files):
    file_path=os.path.join(input_dir,file_name)
    code_content=read_code_from_file(file_path)
    prompt=prompt0+example1+example2+f"""
    code:{code_content}
    Explanation:
    """
    response=chat.send_message(prompt)
    document_id=f"{file_name}_{i}"
    db.add(documents=[response.text],ids=[document_id],metadatas=[{"file_name":file_name}])

This implements the core of our RAG pipeline - we create AI-generated explanations of each code file and store them in a vector database for later retrieval when a user asks a question. It's essentially the "knowledge ingestion" phase of the system. We used few-shot learning by providing example code explanations to help the model generate high-quality interpretations of each file.

Results and Impact

The results are very promising. talkToGIT can now:

Answer complex questions about implementation details and metadata in seconds
Point developers to the exact files where specific functionality is implemented
Explain architectural choices encoded in the repository structure

Here's an example of how our system responds to a user query:

Query: "How does the system analyze Python files?"

Response: The system analyzes Python files using the ast (Abstract Syntax Tree) module
to parse and extract structural information. In the code_analyzer.py file, the analyze_python_file function:

1. Opens and reads the Python file
2. Parses the code into an AST using ast.parse()
3. Walks through the AST to identify:
- Function definitions (storing their names)
- Class definitions (storing their names)
- Import statements (tracking imported modules)
4. Returns a structured dictionary with this information

This analysis provides a comprehensive view of each file's structure and components, which is then enriched with AI-generated explanations and stored in the vector database for efficient retrieval.

For Python-language files, additionally, there are various separate functions which provide additional information about the files, including the classes, modules, dependencies etc. Example: extract_top_level_comments_and_docstrings → that extracts documentation from the file's header section.

Future Work

While our current implementation delivers significant value, we recognize several areas for improvement:

Code Change Awareness: The system currently provides a static view of the codebase. We plan to implement continuous integration to update explanations as code evolves.
Language Coverage: We've optimized for Python code analysis but aim to expand robust support for other programming languages.
Self-Improvement: We're exploring ways for the system to refine its understanding through user feedback and corrections.
Dual Embedding: Currently, we embed natural language descriptions of each of the code files. We plan to extend this approach to embed both the code files themselves along with their natural language descriptions to make the system more robust.

Conclusion

talkToGIT shows how Gen AI can help developers by bridging the gap between natural language and code understanding. We have used embeddings, vector storage, and Retrieval-Augmented Generation to create this tool that significantly reduces the terror of understanding unfamiliar code. We believe tools like this will become essential components of any developer toolkit, so that engineers can focus more on creative problem-solving and less on understanding the nitty-gritties of implementation.

This project shows just one application of Gen AI in the software development lifecycle, but we believe it is of paramount importance. We're excited about the potential to enhance developer productivity and make codebases more accessible to everyone. Feel free to reach out with questions or collaboration opportunities!

Project Link

Project link: talkToGIT

Team Members: Srichandra Lolla , Sravani Bobba

talkToGIT- an Overview