Optimizing Full-Text Search in Django: My Outreachy Internship Journey at Firefox
data:image/s3,"s3://crabby-images/26a17/26a17bb8dc8564e6f91ea6c2039c375681457890" alt="Chineta Adinnu"
data:image/s3,"s3://crabby-images/8b3f7/8b3f72926ce29a6fee1127da6569d11c10c20c06" alt=""
It’s been over six weeks since I started my Outreachy internship at Mozilla Firefox, where I've been working on improving search functionality for PerfCompare. This experience has been more than implementing a new feature—it has been about overcoming challenges, learning new tools, and adapting goals. Here's a quick update on my progress, the roadblocks faced, and how persistence and great mentorship led to a scalable solution.
Project Overview
The goal: Allow users to search for content related to patches such as bug number, summary, author, and so on.
Original Plan:
The original plan for my internship included the following key milestones:
Change the Database Schema: Add a field to the database to hold all the searchable data.
Update the Push Ingestion Script: Modify the script to populate the new database field.
Enhance the Querying API: Ensure the API utilizes this new searchable field while handling entries where the field might be empty.
Update the Frontend: Implement changes to the frontend to consume the updated API.
Understanding Full-Text Search
A full-text search is a comprehensive search method that compares every word of the search request against every word within the document or database. It’s particularly useful for searching large text fields or multiple fields at once.
In PostgreSQL, full-text search can be implemented using:
SearchVector
: Combines multiple fields (e.g.,revision
,author
, andcomments
) into a single vector that can be queried.SearchQuery
converts the user’s search input into a query object that is compared against theSearchVector
For example, if a user searches for revision, author, or comments, these terms can all be matched against a SearchVector
containing the fields revision
, author
, and comments
.
Implementation:
search_param = filter_params.get("search")
if search_param:
pushes = pushes.annotate(
search=SearchVector(
"revision", "author", "comments", config="english"
)
).filter(search=SearchQuery(search_param, config="english"))
This works for small datasets but becomes slow at scale.
First Attempt: Adding a Dedicated search_vector
Column
The initial approach involved adding a search_vector
field to the Commit
model:
class Commit(models.Model):
"""
A single commit in a push
"""
push = models.ForeignKey(Push, on_delete=models.CASCADE, related_name="commits")
revision = models.CharField(max_length=40, db_index=True)
author = models.CharField(max_length=150)
comments = models.TextField()
search_vector = SearchVectorField(null)
# Populate search_vector on save
def update_search_vector(self):
self.search_vector = SearchVector('revision', 'author', 'comments')
def save(self, *args, **kwargs):
self.update_search_vector()
super().save(*args, **kwargs)
The search query implementation looked like this:
search_term = filter_params.get("search")
if search_term:
search_query = SearchQuery(search_term)
pushes = pushes.filter(commits__search_vector=search_query)
The search_term
can be any general search input (author, bug number, hash, summary, etc.). The commits__search_vector
is used to search across the combined fields that are part of the SearchVector
.
Issue Encountered
However, I faced an issue where the search_vector
field remained empty when a new commit was added, even though the logic put in place should have populated it. After reaching out to my mentors for guidance, it was pointed out that the approach of using an additional column for search_vector
could add unnecessary overhead in the database because it requires additional space and constant updates to the column when the data changes.
The Improved Approach: GIN Indexes and Dynamic SearchVectors
Instead of persisting search_vector
, we index the fields directly using a GIN (Generalized Inverted Index).
Why Use GIN Indexes?
Speed: GIN indexes optimize full-text search by avoiding sequential scans.
Efficiency: No extra storage—vectors are computed on the fly during queries
GIN Index Implementation:
class Meta:
db_table = "commit"
unique_together = ("push", "revision")
indexes = [
GinIndex(
SearchVector('revision', 'author', 'comments', config='english'),
name='search_vector_idx',
),
]
To use the GIN index effectively, the search query must reference the exact fields indexed in the SearchVector
Implementation:
The GIN index lives on the Commit
model, so we first search within Commit
records. The search returns push_id
values from matching commits, which we then use to filter the Push
queryset.
search_param = filter_params.get("search")
if search_param:
filtered_commits = Commit.objects.annotate(
# Fields MUST match the GIN index's fields (revision, author, comments)
search=SearchVector("revision", "author", "comments", config="english")
).filter(
search=SearchQuery(search_param, config="english")
).values_list("push_id", flat=True)
pushes = pushes.filter(id__in=filtered_commits)
This allows flexible and efficient searches without modifying the database schema.
Challenges encountered
Initially, I tried querying from the Push
model:
pushes = Push.objects.annotate(
search_vector=SearchVector(
'commits__revision', 'commits__author', 'commits__comments', config='english'
)
).filter(search_vector=SearchQuery(search_param, config='english'))
Why This Failed:
The GIN index was defined on
Commit
fields (revision
,author
,comments
), but the query usedcommits__revision
, etc., through thePush
model’s reverse relationship.PostgreSQL couldn’t use the index because the
SearchVector
in the query did not match the indexed fields exactly.
The Fix: Query Commit
directly, then map results to Push
.
Final Outcome
By using SearchVector
with a GIN index:
We achieve efficient full-text search across multiple fields.
We avoid adding additional fields to the database schema.
We reduce the performance overhead associated with large datasets.
Remaining Tasks
Verify indexing is working as intended.
Optimize Subquery Performance for Large Datasets(add ordering and limit results)
Write thorough tests to ensure functionality and performance.
Integrate the updated search functionality with the frontend.
Conclusion
This internship has been a challenging yet rewarding journey. Implementing, researching, and receiving feedback hasn’t been easy, but it has been incredibly fulfilling. I’ve learned so much, and persistence has truly been my backbone through it all.
Here are some of the resources that guided me during the implementation:
Stack Overflow: Indexing a SearchVector vs. Having a SearchVectorField
Stack Overflow: Use of Full-Text Search GIN in a View (Django)
I’m excited to see how the final implementation will enhance PerfCompare’s user search experience!
Subscribe to my newsletter
Read articles from Chineta Adinnu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
data:image/s3,"s3://crabby-images/26a17/26a17bb8dc8564e6f91ea6c2039c375681457890" alt="Chineta Adinnu"
Chineta Adinnu
Chineta Adinnu
Hi, I’m Chineta Adinnu! I’m a frontend developer with a passion for creating dynamic and user-friendly web experiences. On this blog, I share my journey in frontend development, from learning new technologies to implementing them in real projects. I dive into frameworks like React and Next.js, explore state management with Redux Toolkit, and offer practical tips based on my hands-on experience. My goal is to provide valuable insights and tutorials that can help others navigate the ever-evolving world of web development. Join me as I document my learning process, share challenges and solutions, and offer guidance on the latest frontend technologies. If you’re interested in tech or looking for practical advice, you’re in the right place! Feel free to connect if you have questions or want to discuss tech! Check out some of my articles on Medium: https://medium.com/@chinetaadinnu."