MongoDB Indexing Explained: Types, Best Practices & Performance Tips

When searching in a vast database, it is incredibly slow and inefficient to search the data entries one by one. The searching is done similarly to how linear search works in an array. We needed to introduce something that would help optimize searching in the database.

What is Indexing?

Indexing is a technique that is used to organize the data in such a way that it is easy for the system to search the records in the database. Just like how Binary Search helps in optimizing the searching indexing does the same. And what does it require? It needs the data to be sorted.

Similarly, the indexing technique optimizes searching by creating a new data structure that stores pointers to the records, organized according to the sorted values of the selected attribute.

Key Points About Indexing:

We do not create indexes for all attributes in a collection. Instead, indexes are created only for selected attributes that are likely to be frequently queried.
Indexes are typically stored in a B-tree or other tree-like data structures, allowing for efficient search, insertion, and deletion operations.

How Does Indexing Work?

When we apply an index to a field, it stores the index key and pointer document to the collection. These keys are stored in a sorted manner. So whenever a search query is performed using that particular field, an index scan is performed instead of a linear scan, just like how binary search works. Thus we get an optimized search through the B-tree.

Let's see how we can implement indexing in our database

Before proceeding further I will encourage you to go to open your terminal and type mongosh to enter the mongodb database. and try to code along for practical understanding.

Copy the dummy data into the database. The data below was generated using ChatGPT.

Enter any database and run the following query:

db.users.insertMany([
  {
    "userId": "U001",
    "name": "John Doe",
    "email": "johndoe@example.com",
    "age": 28,
    "gender": "male",
    "status": "active"
  },
  {
    "userId": "U002",
    "name": "Jane Smith",
    "email": "janesmith@example.com",
    "age": 34,
    "gender": "female",
    "status": "inactive"
  },
  {
    "userId": "U003",
    "name": "Alice Brown",
    "email": "alicebrown@example.com",
    "age": 23,
    "gender": "female",
    "status": "active"
  },
  {
    "userId": "U004",
    "name": "Bob Johnson",
    "email": "bobjohnson@example.com",
    "age": 40,
    "gender": "male",
    "status": "active"
  },
  {
    "userId": "U005",
    "name": "Charlie Davis",
    "email": "charliedavis@example.com",
    "age": 27,
    "gender": "male",
    "status": "inactive"
  },
  {
    "userId": "U006",
    "name": "Diana Evans",
    "email": "dianaevans@example.com",
    "age": 32,
    "gender": "female",
    "status": "active"
  },
  {
    "userId": "U007",
    "name": "Ethan Green",
    "email": "ethangreen@example.com",
    "age": 25,
    "gender": "male",
    "status": "inactive"
  },
  {
    "userId": "U008",
    "name": "Fiona Harris",
    "email": "fionaharris@example.com",
    "age": 29,
    "gender": "female",
    "status": "active"
  },
  {
    "userId": "U009",
    "name": "George Knight",
    "email": "georgeknight@example.com",
    "age": 36,
    "gender": "male",
    "status": "active"
  },
  {
    "userId": "U010",
    "name": "Hannah Lee",
    "email": "hannahlee@example.com",
    "age": 22,
    "gender": "female",
    "status": "inactive"
  }
])

Before creating indexing I want you to run this query on the Mongo playground, and notice the output that it generates,

db.users.find().explain("executionStats")

You will now receive a substantial object containing statistics about the query execution performed by MongoDB.

The explain() method is used to get statistics about the query. If we pass the executionStats parameter, it provides an additional object called executionStages, which helps us understand the search efficiency in depth.

The docExamined key shows the number of documents scanned. Since we provided no condition for the selection, it scanned all the documents. The nReturned key tells us how many records were returned that match the condition provided.

Firstly let’s perform a simple search age < 30 along with the explain() method. You will notice that the docExamined is 10, which is correct according to our understanding since it will search all the documents in the collection.

db.users.find({ age: { "$lte": 30 }}).explain("executionStats")

You will notice that 6 records are returned to us, which are below the age of 30, and the total documents scanned is 10, which is expected since we have a total of 10 documents.

Now, we will understand how creating indexes can help optimize the search.

Run the following query to create an index on the age field.

db.users.createIndex({"age": 1})

This will create an index on the age field and return the name of the index created. The value 1 here means that the index will be sorted in ascending order.

You can also view the existing indexes by running:

db.users.getIndexes()

You can see that we have two existing indexes: one that we created on the age field and another on id that was created by MongoDB by default.

NOTE: MongoDB does not provide a feature to view the index collection directly, but we can observe its effects. You can also delete an index by running db.collectionName.dropIndex("indexName").

Now, let’s run the same query to find the users with an age below 30.

db.users.find({ age: { "$lte": 30 }}).explain("executionStats")

Now you will be the power of indexing,

You can clearly see that instead of a Collection Scan, an Index Scan is performed.

You can clearly observe that now MongoDB scanned 6 documents and returned 6 documents, meaning only the required documents are searched and returned to us. Unlike previously, when MongoDB was scanning every document. Currently, we only have 10 documents, but imagine if we had millions of documents indexing would greatly help in searching for the documents.

The Drawbacks of Indexing

Indexes should be created wisely. While they help speed up query performance, there are some drawbacks:

Increased Storage: Indexes consume additional storage because they store pointers and index keys separately.
Decreased Write Speed: Since the index needs to be updated along with the data in the collection, write operations (like insertions, updates, and deletions) become slower as both the collection and the index B-tree need to be modified.

When Not to Use Indexing?

When the collection is Small
When the collection is Frequently Updated
Queries are Complex (Multiple Fields)
The collection is Large (make fewer indexes)

I think by now, you might have started to understand how indexing actually helps in query optimization. If you understand this basic concept, then you are ready to implement it in your own projects. But we are not done yet. We will dive deeper into indexing, understand everything about it, and explore how we can use indexing in more complex projects.

Bonus: Indexing not only speeds up data retrieval but also helps with sorting. Most databases use B-tree indexes, which store values in a sorted order by default. So, when you run a query to fetch records in a sorted manner (like ORDER BY), the database doesn’t have to do extra work, it simply walks through the already sorted index.

Compound Indexing:

Compound indexing is when we create an index on multiple fields together. Before proceeding, please delete the index we created on the age field by running db.users.dropIndex('age_1').

Now, let’s create an index on both the age and gender fields. And well ~~size-matters~~ 😏— order matters!

db.users.createIndex({"age": 1, "gender" :1})

Now, let’s understand how this compound index is stored. MongoDB will first sort the documents by the age field. Within each group of documents with the same age, the documents will then be sorted by the gender field. This allows MongoDB to efficiently search for documents based on both age and gender in the order they are indexed.

Search using both age and gender:
This triggers an Index Scan using the compound index, ensuring an efficient search.
Search using age only:
This will also trigger an Index Scan, as age is part of the compound index.
Search using gender only:
This will result in a Collection Scan, since gender is not indexed by itself.

By running these queries, you'll notice differences in performance, as MongoDB will choose different types of scans (Index Scan vs. Collection Scan) depending on the fields queried. You can observe the difference in the stage key inside the executionStages. It will have two possible values: IXSCAN (Index Scan) and COLLSCAN (Collection Scan).

db.users.find({age: {$gte: 27}, gender: "male"}).explain("executionStats")
db.users.find({age: {$gte: 27}}).explain("executionStats")
db.users.find({gender: "male"}).explain("executionStats")

Additional concepts in indexing

Partial Filter Expression:

In a situation where you want to create an index only for a limited set of data or exclude certain data, you can use a partialFilterExpression. This will include only the records that pass the filter for the creation of the index.

For example, suppose you have a list of teachers who are potential candidates for hiring, and you know that you will only hire teachers whose age is above 25. In this case, you can apply a filter on age (e.g., age > 25) for the index. This helps save space and optimizes performance.

db.users.createIndex({ age: 1 }, { partialFilterExpression: { age: { $gt: 22 } } })

Covered Query:

In a covered query, all the fields involved in the query are part of the index. When performing the search, MongoDB doesn’t need to access the actual documents in the collection and can directly retrieve the results from the index. In a covered query, the _id field can be excluded from the results if not needed, and only the required fields are used for the search. This helps MongoDB quickly find the required record without scanning the entire collection.

For example, let’s say your app will only search by name. You create an index on name, and when you query for it, MongoDB will quickly fetch the result directly from the index, without going through the documents. This makes the query faster. The _id index is still there, but MongoDB can ignore it for this query, which saves time and resources.

This can be done by specifying _id: 0 in the query, which tells MongoDB not to include the _id field in the result.

db.users.find({name: "Jane Smith"}, {_id: 0, name: 1})

Winning Plan

In case of multiple indexes on the same field suppose we have name_1 and name_1_age_1 and we perform a search on the name, MongoDB checks the performance of each index on a sample of documents when the query is run. It then sets the best-performing one as the winning plan.

For the second query of a similar type, it doesn't race them again. Instead, it uses the previously selected winning plan from the cache.

This cache gets reset in a few cases:

After 1000 write operations
If the index is dropped or recreated
If the MongoDB server is restarted
If other indexes are added or modified

So when any of these happen, MongoDB will re-evaluate the indexes again and decide a new winning plan if needed.

Multikey Index

A multikey index is an index that can be created on an array field. When this is done, MongoDB will create a separate index entry for each value in each array, so it can quickly look up documents that match a specific value.

It is not recommended to use this since it consumes a lot of space if not used properly.

Text Search

Text search is used to search the text stored in a field even if the whole text is not provided. Suppose each user has a bio field that contains data like "I like reading books". If I want to search for that user, I would have to write the whole string "I like reading books", which is not very practical for larger texts. To counter this problem, we use text search.

We use text search to perform a full-text search.

There can only be one text index per collection.
The string is tokenized — each word in the string is broken down, and suffixes are removed. For example, "running" will be stored as "run".

As mentioned above, we can only create a single text index per collection, but within that one text index, we can add multiple fields.

In the example below, we are creating the text index for both bio and name. We are only creating one index, but it is applied on two fields:

db.users.createIndex({ name: "text", bio: "text" })
db.users.find({ $text: { $search: "Running" } })

It will return all records that have the word "run" in the bio.

We can also give one more argument — weights. It helps assign priority to different fields.

So here, we can give more priority to name compared to bio, so that matches in the name field get more weight than matches in bio:

db.users.createIndex(
  { name: "text", bio: "text" },
  { weights: { name: 1000, bio: 1 } }
)

Note: When you create an index, it will lock the collection until the index is created. So, if you create an index in a collection that has millions of documents, it will lock the entire collection during the process.

To prevent this, we can set background: true while creating the index, so that the index is created in the background without locking the collection.

db.user.createIndex({ ... }, { background: true })

Bas bhai, indexing ki kahaani yahin khatam hoti hai!
Agar blog pasand aaya toh ek like, ek share aur thoda pyaar dikhana mat bhoolna. 😄
And haan agar kuch chhoot gaya ho, toh comment mein bata dena. Tech baatein baatne se hi badhti hain!

This blog is heavily influenced by insights from a video I watched on YouTube. Link: [Click here]

How MongoDB Indexing Improves Query Performance: A Complete Guide for Optimizing Searches

Table of contents