Optimizing Full-Text Search in Django: My Outreachy Internship Journey at Firefox

Chineta AdinnuChineta Adinnu
5 min read

It’s been over six weeks since I started my Outreachy internship at Mozilla Firefox, where I've been working on improving search functionality for PerfCompare. This experience has been more than implementing a new feature—it has been about overcoming challenges, learning new tools, and adapting goals. Here's a quick update on my progress, the roadblocks faced, and how persistence and great mentorship led to a scalable solution.

Project Overview

The goal: Allow users to search for content related to patches such as bug number, summary, author, and so on.

Original Plan:

The original plan for my internship included the following key milestones:

  • Change the Database Schema: Add a field to the database to hold all the searchable data.

  • Update the Push Ingestion Script: Modify the script to populate the new database field.

  • Enhance the Querying API: Ensure the API utilizes this new searchable field while handling entries where the field might be empty.

  • Update the Frontend: Implement changes to the frontend to consume the updated API.

A full-text search is a comprehensive search method that compares every word of the search request against every word within the document or database. It’s particularly useful for searching large text fields or multiple fields at once.

In PostgreSQL, full-text search can be implemented using:

  • SearchVector: Combines multiple fields (e.g., revision, author, and comments) into a single vector that can be queried.

  • SearchQuery converts the user’s search input into a query object that is compared against the SearchVector

For example, if a user searches for revision, author, or comments, these terms can all be matched against a SearchVector containing the fields revision, author, and comments.

Implementation:

   search_param = filter_params.get("search")
        if search_param:       
            pushes = pushes.annotate(
                  search=SearchVector(
                  "revision", "author", "comments", config="english"
                )
            ).filter(search=SearchQuery(search_param, config="english"))

This works for small datasets but becomes slow at scale.

First Attempt: Adding a Dedicated search_vector Column

The initial approach involved adding a search_vector field to the Commit model:

class Commit(models.Model):
    """
    A single commit in a push
    """

    push = models.ForeignKey(Push, on_delete=models.CASCADE, related_name="commits")
    revision = models.CharField(max_length=40, db_index=True)
    author = models.CharField(max_length=150)
    comments = models.TextField()
    search_vector = SearchVectorField(null)

    # Populate search_vector on save  
    def update_search_vector(self):  
        self.search_vector = SearchVector('revision', 'author', 'comments')  

    def save(self, *args, **kwargs):  
        self.update_search_vector()  
        super().save(*args, **kwargs)

The search query implementation looked like this:

search_term = filter_params.get("search")
        if search_term:
            search_query = SearchQuery(search_term)  
            pushes =    pushes.filter(commits__search_vector=search_query)

The search_term can be any general search input (author, bug number, hash, summary, etc.). The commits__search_vector is used to search across the combined fields that are part of the SearchVector.

Issue Encountered

However, I faced an issue where the search_vector field remained empty when a new commit was added, even though the logic put in place should have populated it. After reaching out to my mentors for guidance, it was pointed out that the approach of using an additional column for search_vector could add unnecessary overhead in the database because it requires additional space and constant updates to the column when the data changes.

The Improved Approach: GIN Indexes and Dynamic SearchVectors

Instead of persisting search_vector, we index the fields directly using a GIN (Generalized Inverted Index).

Why Use GIN Indexes?

  • Speed: GIN indexes optimize full-text search by avoiding sequential scans.

  • Efficiency: No extra storage—vectors are computed on the fly during queries

GIN Index Implementation:

 class Meta:
        db_table = "commit"
        unique_together = ("push", "revision")
        indexes = [
            GinIndex(
                SearchVector('revision', 'author', 'comments', config='english'),
                name='search_vector_idx',
            ),
        ]

To use the GIN index effectively, the search query must reference the exact fields indexed in the SearchVector

Implementation:

The GIN index lives on the Commit model, so we first search within Commit records. The search returns push_id values from matching commits, which we then use to filter the Push queryset.

   search_param = filter_params.get("search")
        if search_param:
            filtered_commits = Commit.objects.annotate(
             # Fields MUST match the GIN index's fields (revision, author, comments)
                search=SearchVector("revision", "author", "comments", config="english")
            ).filter(
                search=SearchQuery(search_param, config="english")
            ).values_list("push_id", flat=True)
            pushes = pushes.filter(id__in=filtered_commits)

This allows flexible and efficient searches without modifying the database schema.

Challenges encountered

Initially, I tried querying from the Push model:


pushes = Push.objects.annotate(
    search_vector=SearchVector(
        'commits__revision', 'commits__author', 'commits__comments', config='english'
    )
).filter(search_vector=SearchQuery(search_param, config='english'))

Why This Failed:

  • The GIN index was defined on Commit fields (revision, author, comments), but the query used commits__revision, etc., through the Push model’s reverse relationship.

  • PostgreSQL couldn’t use the index because the SearchVector in the query did not match the indexed fields exactly.

The Fix: Query Commit directly, then map results to Push.

Final Outcome

By using SearchVector with a GIN index:

  • We achieve efficient full-text search across multiple fields.

  • We avoid adding additional fields to the database schema.

  • We reduce the performance overhead associated with large datasets.

Remaining Tasks

  • Verify indexing is working as intended.

  • Optimize Subquery Performance for Large Datasets(add ordering and limit results)

  • Write thorough tests to ensure functionality and performance.

  • Integrate the updated search functionality with the frontend.

Conclusion

This internship has been a challenging yet rewarding journey. Implementing, researching, and receiving feedback hasn’t been easy, but it has been incredibly fulfilling. I’ve learned so much, and persistence has truly been my backbone through it all.

Here are some of the resources that guided me during the implementation:

I’m excited to see how the final implementation will enhance PerfCompare’s user search experience!

2
Subscribe to my newsletter

Read articles from Chineta Adinnu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Chineta Adinnu
Chineta Adinnu

Hi, I’m Chineta Adinnu! I’m a frontend developer with a passion for creating dynamic and user-friendly web experiences. On this blog, I share my journey in frontend development, from learning new technologies to implementing them in real projects. I dive into frameworks like React and Next.js, explore state management with Redux Toolkit, and offer practical tips based on my hands-on experience. My goal is to provide valuable insights and tutorials that can help others navigate the ever-evolving world of web development. Join me as I document my learning process, share challenges and solutions, and offer guidance on the latest frontend technologies. If you’re interested in tech or looking for practical advice, you’re in the right place! Feel free to connect if you have questions or want to discuss tech! Check out some of my articles on Medium: https://medium.com/@chinetaadinnu."