It’s been over six weeks since I started my Outreachy internship at Mozilla Firefox, where I've been working on improving search functionality for PerfCompare. This experience has been more than implementing a new feature—it has been about overcoming challenges, learning new tools, and adapting goals. Here's a quick update on my progress, the roadblocks faced, and how persistence and great mentorship led to a scalable solution.

Project Overview

The goal: Allow users to search for content related to patches such as bug number, summary, author, and so on.

Original Plan:

The original plan for my internship included the following key milestones:

Change the Database Schema: Add a field to the database to hold all the searchable data.
Update the Push Ingestion Script: Modify the script to populate the new database field.
Enhance the Querying API: Ensure the API utilizes this new searchable field while handling entries where the field might be empty.
Update the Frontend: Implement changes to the frontend to consume the updated API.

Understanding Full-Text Search

A full-text search is a comprehensive search method that compares every word of the search request against every word within the document or database. It’s particularly useful for searching large text fields or multiple fields at once.

In PostgreSQL, full-text search can be implemented using:

SearchVector: Combines multiple fields (e.g., revision, author, and comments) into a single vector that can be queried.
SearchQuery converts the user’s search input into a query object that is compared against the SearchVector

For example, if a user searches for revision, author, or comments, these terms can all be matched against a SearchVector containing the fields revision, author, and comments.

Implementation:

   search_param = filter_params.get("search")
        if search_param:       
            pushes = pushes.annotate(
                  search=SearchVector(
                  "revision", "author", "comments", config="english"
                )
            ).filter(search=SearchQuery(search_param, config="english"))

This works for small datasets but becomes slow at scale.

First Attempt: Adding a Dedicated `search_vector` Column

The initial approach involved adding a search_vector field to the Commit model:

class Commit(models.Model):
    """
    A single commit in a push
    """

    push = models.ForeignKey(Push, on_delete=models.CASCADE, related_name="commits")
    revision = models.CharField(max_length=40, db_index=True)
    author = models.CharField(max_length=150)
    comments = models.TextField()
    search_vector = SearchVectorField(null)

    # Populate search_vector on save  
    def update_search_vector(self):  
        self.search_vector = SearchVector('revision', 'author', 'comments')  

    def save(self, *args, **kwargs):  
        self.update_search_vector()  
        super().save(*args, **kwargs)

The search query implementation looked like this:

search_term = filter_params.get("search")
        if search_term:
            search_query = SearchQuery(search_term)  
            pushes =    pushes.filter(commits__search_vector=search_query)

The search_term can be any general search input (author, bug number, hash, summary, etc.). The commits__search_vector is used to search across the combined fields that are part of the SearchVector.

Issue Encountered

However, I faced an issue where the search_vector field remained empty when a new commit was added, even though the logic put in place should have populated it. After reaching out to my mentors for guidance, it was pointed out that the approach of using an additional column for search_vector could add unnecessary overhead in the database because it requires additional space and constant updates to the column when the data changes.

The Improved Approach: GIN Indexes and Dynamic SearchVectors

Instead of persisting search_vector, we index the fields directly using a GIN (Generalized Inverted Index).

Why Use GIN Indexes?

Speed: GIN indexes optimize full-text search by avoiding sequential scans.
Efficiency: No extra storage—vectors are computed on the fly during queries

GIN Index Implementation:

 class Meta:
        db_table = "commit"
        unique_together = ("push", "revision")
        indexes = [
            GinIndex(
                SearchVector('revision', 'author', 'comments', config='english'),
                name='search_vector_idx',
            ),
        ]

To use the GIN index effectively, the search query must reference the exact fields indexed in the SearchVector

Implementation:

The GIN index lives on the Commit model, so we first search within Commit records. The search returns push_id values from matching commits, which we then use to filter the Push queryset.

   search_param = filter_params.get("search")
        if search_param:
            filtered_commits = Commit.objects.annotate(
             # Fields MUST match the GIN index's fields (revision, author, comments)
                search=SearchVector("revision", "author", "comments", config="english")
            ).filter(
                search=SearchQuery(search_param, config="english")
            ).values_list("push_id", flat=True)
            pushes = pushes.filter(id__in=filtered_commits)

This allows flexible and efficient searches without modifying the database schema.

**Update*: Optimizing the Search Approach**

After further refinement, we updated our search implementation to improve efficiency and relevance. The initial approach effectively leveraged the GIN index but lacked repository-specific filtering and returned all matching results, which could impact performance.

Our final implementation introduces the following improvements:

Repository Filtering
- We now filter commits by push__repository=repository, ensuring the search is scoped to a specific repository.
- This reduces unnecessary database scans and improves query precision.
Sorting by Most Recent
- Results are ordered by -push__time to prioritize the latest commits.
- This makes the most relevant updates appear first.
Limiting Results to 200
- We cap the results at 200 using [:200] to prevent performance degradation when handling large datasets.
- This ensures a balance between efficiency and comprehensive results.

Here’s the final implementation:

search_param = filter_params.get("search")
        if search_param:
            repository = Repository.objects.get(name=project)
            filtered_commits = (
                Commit.objects.annotate(
                    search=SearchVector("revision", "author", "comments", config="english")
                )
                .filter(
                    search=SearchQuery(search_param, config="english"),
                    push__repository=repository,
                )
                .values_list("push_id", flat=True)
                # Get most recent results and limit result to 200
                .order_by("-push__time")
                .distinct()[:200]
            )
            pushes = pushes.filter(id__in=filtered_commits)

Challenges encountered

Initially, I tried querying from the Push model:


pushes = Push.objects.annotate(
    search_vector=SearchVector(
        'commits__revision', 'commits__author', 'commits__comments', config='english'
    )
).filter(search_vector=SearchQuery(search_param, config='english'))

Why This Failed:

The GIN index was defined on Commit fields (revision, author, comments), but the query used commits__revision, etc., through the Push model’s reverse relationship.
PostgreSQL couldn’t use the index because the SearchVector in the query did not match the indexed fields exactly.

The Fix: Query Commit directly, then map results to Push.

Final Outcome

By using SearchVector with a GIN index:

We achieve efficient full-text search across multiple fields.
We avoid adding additional fields to the database schema.
We reduce the performance overhead associated with large datasets.

Remaining Tasks

Write thorough tests to ensure functionality and performance.
Integrate the updated search functionality with the frontend.

Conclusion

This internship has been a challenging yet rewarding journey. Implementing, researching, and receiving feedback hasn’t been easy, but it has been incredibly fulfilling. I’ve learned so much, and persistence has truly been my backbone through it all.

Here are some of the resources that guided me during the implementation:

I’m excited to see how the final implementation will enhance PerfCompare’s user search experience!

Optimizing Full-Text Search in Django: My Outreachy Internship Journey at Firefox

Project Overview

Understanding Full-Text Search

First Attempt: Adding a Dedicated `search_vector` Column

Issue Encountered

The Improved Approach: GIN Indexes and Dynamic SearchVectors

**Update*: Optimizing the Search Approach**

Challenges encountered

Final Outcome

Remaining Tasks

Conclusion

Subscribe to my newsletter

Chineta Adinnu

Chineta Adinnu

Optimizing Full-Text Search in Django: My Outreachy Internship Journey at Firefox

Project Overview

Understanding Full-Text Search

First Attempt: Adding a Dedicated search_vector Column

Issue Encountered

The Improved Approach: GIN Indexes and Dynamic SearchVectors

Update*: Optimizing the Search Approach

Challenges encountered

Final Outcome

Remaining Tasks

Conclusion

Subscribe to my newsletter

Chineta Adinnu

Chineta Adinnu

First Attempt: Adding a Dedicated `search_vector` Column

**Update*: Optimizing the Search Approach**