MongoDB Aggregation: The Secret Weapon for Data Transformation

Table of contents
- What's an Aggregation Pipeline?
- Most Common Aggregation Pipeline Operators
- 1. $match: Your Data Filter
- 2. $group: Summarize Your Data
- 3. $project: Reshape Your Documents
- 4. $sort: Order Your Results
- 5. $limit: Control Your Output Size
- 6. $skip: Pagination Power
- 7. $unwind: Flatten Your Arrays
- 8. $lookup: Join Collections (SQL-style)
- 9. $addFields: Add New Fields with Expressions
- 10. $count: Get a Quick Count
- 11. $out: Store Aggregated Results in a New Collection
- 12. $merge: Intelligently Write Aggregated Results to a Collection
- Putting It All Together: A Complete Example

Ever looked at your raw data and wished you could transform it, summarize it, and get meaningful insights out of it? That's exactly what MongoDB's Aggregation Pipelines are for! Think of it like an assembly line for your data, where each "station" (called a "stage") performs a specific operation, transforming the data as it moves along, until you get exactly the output you need.
Whether you're just starting with MongoDB or have some experience, this guide will help you understand the magic of aggregation pipelines, focusing on the most commonly used operators. We'll break down what each operator does, give you clear examples, and show you real-world scenarios where they shine. Let's dive in!
What's an Aggregation Pipeline?
Imagine you have a huge box of LEGOs. If you want to build a specific castle, you wouldn't just dump them all out and hope for the best. Instead, you might:
Sort by color.
Filter out all the tiny pieces you don't need.
Group similar shapes together.
Assemble specific sections.
That's precisely what an aggregation pipeline does for your data. You chain together different operations (stages) to process your documents step-by-step, ultimately delivering a refined, insightful result. Each stage takes a stream of documents as input, processes them, and passes the modified documents to the next stage.
Most Common Aggregation Pipeline Operators
Let's explore the workhorses of the aggregation pipeline:
1. $match
: Your Data Filter
What it does:
$match
is like the bouncer at a club. It lets only the documents that meet your specified conditions pass through to the next stage. It's similar to thefind()
method but used within the aggregation pipeline.When to use it: Always try to use
$match
as early as possible in your pipeline! Why? Because it reduces the number of documents that need to be processed by subsequent stages, making your queries much faster and more efficient.Example: Imagine you have a collection of "orders" and you only want to analyze orders placed in "2024" with a "totalAmount" greater than "100".
db.orders.aggregate([ { $match: { orderDate: { $gte: ISODate("2024-01-01T00:00:00Z"), $lt: ISODate("2025-01-01T00:00:00Z") }, totalAmount: { $gt: 100 } } } ]);
Scenario: Filtering large datasets for specific time periods or based on certain criteria before performing complex calculations.
2. $group
: Summarize Your Data
What it does:
$group
is your go-to for summarizing data. It groups documents by a specified key (or multiple keys) and then lets you perform calculations on those groups using "accumulator" operators (like$sum
,$avg
,$max
,$min
,$count
).When to use it: When you need to calculate totals, averages, counts, or find minimum/maximum values across categories or groups in your data.
Example: You want to find the total sales for each product category
db.products.aggregate([ { $group: { _id: "$category", // Group by the 'category' field totalSales: { $sum: "$price" }, // Calculate the sum of 'price' for each category numberOfProducts: { $sum: 1 } // Count the number of products in each category } } ]);
Scenario: Generating sales reports by product type, calculating average customer ratings per region, or counting unique users.
3. $project
: Reshape Your Documents
What it does:
$project
allows you to reshape the documents in your pipeline. You can:Include specific fields.
Exclude specific fields.
Rename fields.
Create new fields based on existing ones or expressions.
When to use it: When you only need a subset of fields from your documents, or when you want to compute new fields for further analysis or a cleaner output.
Example: You have user documents with sensitive information and you only want to show their
name
,email
, andregistrationDate
. You also want to combine theirfirstName
andlastName
into a singlefullName
field.db.users.aggregate([ { $project: { _id: 0, // Exclude the default _id field fullName: { $concat: ["$firstName", " ", "$lastName"] }, // Create a new field email: 1, // Include the email field registrationDate: 1 // Include the registrationDate field } } ]);
Scenario: Preparing data for display in a user interface, creating new calculated fields (e.g.,
totalPrice = quantity * unitPrice
), or simplifying document structure.
4. $sort
: Order Your Results
What it does:
$sort
arranges the documents in your pipeline in a specific order, either ascending (1) or descending (-1), based on one or more fields.When to use it: When you need to present your data in a particular order, like sorting products by price, users by registration date, or articles by popularity.
Example: You want to see the top 5 most expensive products.
db.products.aggregate([ { $sort: { price: -1 } } // Sort by 'price' in descending order ]);
Scenario: Creating leaderboards, displaying search results in a meaningful order, or preparing data for reports that require ordered presentation.
5. $limit
: Control Your Output Size
What it does:
$limit
restricts the number of documents that pass through to the next stage. It's like saying, "Just give me the first 10 results."When to use it: Often used with
$sort
for pagination or to get the "top N" or "bottom N" results. It's crucial for performance when you only need a small subset of results.Example: After sorting products by price, you want to see only the top 3.
db.products.aggregate([ { $sort: { price: -1 } }, { $limit: 3 } // Limit to the first 3 documents ]);
Scenario: Implementing pagination for web applications, finding the top-selling items, or displaying a limited number of most recent activities.
6. $skip
: Pagination Power
What it does:
$skip
bypasses a specified number of documents at the beginning of the stream. It's like saying, "Skip the first 10 items and then give me the rest."When to use it: Primarily used for pagination in combination with
$limit
. If you're on "page 2" and each page shows 10 items, you'd$skip
the first 10 and then$limit
to the next 10.Example: You want to get the second page of products, with each page showing 5 items.
db.products.aggregate([ { $sort: { name: 1 } }, // Sort for consistent pagination { $skip: 5 }, // Skip the first 5 documents { $limit: 5 } // Get the next 5 documents ]);
Scenario: Building pagination features for search results, product listings, or news feeds.
7. $unwind
: Flatten Your Arrays
What it does:
$unwind
deconstructs an array field from the input documents to output a document for each element. This means if a document has an array with three items,$unwind
will create three separate documents, each containing one of those array items, along with all the other fields from the original document.When to use it: When you need to process individual elements within an array as if they were separate documents, often before grouping or filtering based on array elements.
Example: You have a
books
collection where each book document has anauthors
array. You want to see each author as a separate entry to count how many books each author has.// Original document: // { "_id": 1, "title": "The Great Novel", "authors": ["Alice", "Bob"] } db.books.aggregate([ { $unwind: "$authors" } // Deconstructs the 'authors' array ]); // Output documents: // { "_id": 1, "title": "The Great Novel", "authors": "Alice" } // { "_id": 1, "title": "The Great Novel", "authors": "Bob" }
Scenario: Analyzing individual tags associated with products, processing line items in an order, or calculating statistics for array elements.
8. $lookup
: Join Collections (SQL-style)
What it does:
$lookup
performs a left outer join to an unsharded collection in the same database. It brings in documents from another collection based on a matching field, similar to aJOIN
operation in SQL.When to use it: When your data is distributed across multiple collections (e.g., orders and customers), and you need to combine related information into a single view for analysis.
Example: You have an
orders
collection withcustomerId
and acustomers
collection with customer details. You want to see order details along with the customer's name and email.db.orders.aggregate([ { $lookup: { from: "customers", // The collection to join with localField: "customerId", // Field from the input documents (orders) foreignField: "_id", // Field from the "customers" collection as: "customerInfo" // The name of the new array field to add to the input documents } }, { $unwind: "$customerInfo" // Unwind if customerInfo is an array (typical for $lookup) }, { $project: { _id: 0, orderId: "$_id", totalAmount: 1, customerName: "$customerInfo.name", customerEmail: "$customerInfo.email" } } ]);
Scenario: Fetching customer details with their orders, linking product reviews to product information, or combining user profiles with their activity logs.
9. $addFields
: Add New Fields with Expressions
What it does:
$addFields
adds new fields to documents while preserving all existing fields. It's very similar to$project
but specifically designed for adding fields without removing others, making your pipeline more readable if your goal is just augmentation.When to use it: When you want to compute new fields based on existing data, or add static values to your documents, without losing any of the original fields.
Example: You want to add a
totalPrice
field to yourorderItems
documents, calculated asquantity * price
.db.orderItems.aggregate([ { $addFields: { totalPrice: { $multiply: ["$quantity", "$price"] } // Calculate totalPrice } } ]);
Scenario: Calculating subtotals, applying discounts, or adding timestamps to documents within the pipeline.
10. $count
: Get a Quick Count
What it does:
$count
simply counts the number of documents that pass through its stage and outputs a single document containing that count.When to use it: When you need a quick count of documents that meet certain criteria after they've gone through previous pipeline stages (like
$match
).Example: You want to count how many "active" users you have.
db.users.aggregate([ { $match: { status: "active" } }, { $count: "activeUserCount" } // Output field name for the count ]); // Output: { "activeUserCount": 123 }
Scenario: Getting the number of items after filtering, counting the results of a complex aggregation, or generating quick statistics for dashboards.
11. $out
: Store Aggregated Results in a New Collection
What it does: $out
takes all the documents that are the result of the preceding stages in the aggregation pipeline and writes them into a specified new collection. If a collection with that name already exists, $out
will completely drop it and then create a new one with the pipeline's results.
When to use it:
When you need to create a new collection containing the aggregated or transformed data.
When you want to replace an existing collection entirely with fresh, aggregated data.
For generating snapshot reports or denormalized views that don't need incremental updates.
As a simple way to export complex aggregation results for further querying or analysis.
Example: You've calculated monthly average sales for electronics in 2024 and want to save this summary into a new collection.
db.sales.aggregate([
// ... (previous aggregation stages: $lookup, $unwind, $match, $addFields, $group, $sort, $project) ...
{
$out: "electronicsMonthlySales2024" // The name of the collection to create/overwrite
}
]);
Scenario: Generating daily/weekly/monthly reports (e.g., daily_sales_summary
), creating a materialized view of complex joins, or preparing data for export where the previous data is no longer relevant.
12. $merge
: Intelligently Write Aggregated Results to a Collection
What it does: $merge
writes the aggregated results into a specified collection. Unlike $out
, $merge
offers fine-grained control over how new documents interact with existing ones if the target collection already exists. It can insert new documents, replace existing ones, merge fields into existing documents, or even perform a custom update pipeline.
When to use it:
When you need to incrementally update a summary or materialized view collection.
When you want to combine new aggregated data with existing data in a smart way without dropping the entire collection.
For ETL (Extract, Transform, Load) processes where you transform data and load it into a target collection while maintaining its history or specific update rules.
When you need to preserve indexes on the target collection, as
$merge
does not drop and recreate the collection.
Example: You have a daily_inventory_snapshot
collection and want to update it with today's inventory data. You want to replace records for products that already exist for today and insert new records for products not seen before.
db.inventory_logs.aggregate([
// ... (aggregation stages to calculate today's inventory, e.g., $match, $group) ...
{
$project: {
_id: 0,
productId: "$_id",
snapshotDate: ISODate("2025-07-30T00:00:00Z"), // Assuming current date for snapshot
currentStock: "$totalStock"
}
},
{
$merge: {
into: "daily_inventory_snapshot", // The target collection
on: ["productId", "snapshotDate"], // Fields to match on for existing documents
whenMatched: "replace", // If a match is found, replace the old document with the new one
whenNotMatched: "insert" // If no match, insert the new document
}
}
]);
Scenario: Maintaining real-time dashboards, managing slowly changing dimensions, updating materialized views hourly/daily/weekly, or implementing robust data warehousing strategies within MongoDB.
Putting It All Together: A Complete Example
Let's imagine you have a collection of sales
documents, and you want to find the average sale amount for "electronics" products sold in "2024", grouped by month.
Sales Document Example:
{
"item": "Laptop",
"category": "electronics",
"price": 1200,
"quantity": 1,
"saleDate": ISODate("2024-03-15T10:00:00Z")
}
Aggregation Pipeline:
db.sales.aggregate([
// Stage 1: Filter for electronics sales in 2024
{
$match: {
category: "electronics",
saleDate: { $gte: ISODate("2024-01-01T00:00:00Z"), $lt: ISODate("2025-01-01T00:00:00Z") }
}
},
// Stage 2: Add a 'month' field for grouping
{
$addFields: {
month: { $month: "$saleDate" } // Extracts the month from saleDate
}
},
// Stage 3: Group by month and calculate average sale amount
{
$group: {
_id: "$month", // Group by the newly created 'month' field
averageSaleAmount: { $avg: { $multiply: ["$price", "$quantity"] } }, // Calculate total sale amount for each item and then average
totalSalesCount: { $sum: 1 } // Count sales per month
}
},
// Stage 4: Sort the results by month
{
$sort: { _id: 1 } // Sort by month (ascending)
},
// Stage 5: Project to rename _id to monthName for better readability
{
$project: {
_id: 0, // Exclude the default _id
month: "$_id", // Rename _id to month
averageSaleAmount: { $round: ["$averageSaleAmount", 2] }, // Round to 2 decimal places
totalSalesCount: 1
}
}
]);
Explanation of the pipeline:
$match
: We start by filtering out only the "electronics" sales that happened within the year 2024. This makes sure we're only working with relevant data from the beginning.$addFields
: We then create a new field calledmonth
by extracting the month number from thesaleDate
. This will be useful for grouping.$group
: Now, we group our filtered and augmented documents by themonth
field. For each month, we calculate theaverageSaleAmount
(price * quantity) and thetotalSalesCount
.$sort
: We sort the grouped results bymonth
in ascending order, so the output is chronological.$project
: Finally, we reshape the output for better readability. We exclude the default_id
field (which is the month number from$group
), rename it tomonth
, round theaverageSaleAmount
to two decimal places, and includetotalSalesCount
.
If you were to run this pipeline with the single sales document provided, the output would be:
[
{
"month": 3,
"averageSaleAmount": 1200.00,
"totalSalesCount": 1
}
]
This output correctly identifies the average sale amount for electronics in March 2024 based on your single sample document. If you had more documents, the averages and counts for each month would reflect all matching sales.
The above pipeline which we have created does not include $lookup
stage. If you want to integrate data from another collection, a $lookup
stage would be essential.
Let's imagine a scenario where your sales documents only contain an item_id
, and the actual item details (like category
and price
) are stored in a separate products
collection.
New Scenario:
sales
Collection Document Example:**JSON{ "item_id": "P001", "quantity": 1, "saleDate": ISODate("2024-03-15T10:00:00Z") }
products
Collection Document Example:**JSON{ "_id": "P001", "name": "Laptop", "category": "electronics", "price": 1200 }
Here's how you would modify the aggregation pipeline to include a $lookup
stage to retrieve the category
and price
from the products
collection before performing the calculations:
Aggregation Pipeline with $lookup
:
db.sales.aggregate([
// Stage 1: Lookup product details from the 'products' collection
{
$lookup: {
from: "products", // The collection to join with
localField: "item_id", // Field from the input documents (sales)
foreignField: "_id", // Field from the "from" documents (products)
as: "productDetails" // The name of the new array field to add to the input documents
}
},
// Stage 2: Unwind the productDetails array (since $lookup returns an array)
{
$unwind: "$productDetails"
},
// Stage 3: Filter for electronics sales in 2024 (using the looked-up category)
{
$match: {
"productDetails.category": "electronics", // Access category from the looked-up productDetails
saleDate: { $gte: ISODate("2024-01-01T00:00:00Z"), $lt: ISODate("2025-01-01T00:00:00Z") }
}
},
// Stage 4: Add a 'month' field for grouping
{
$addFields: {
month: { $month: "$saleDate" } // Extracts the month from saleDate
}
},
// Stage 5: Group by month and calculate average sale amount
{
$group: {
_id: "$month", // Group by the newly created 'month' field
averageSaleAmount: { $avg: { $multiply: ["$productDetails.price", "$quantity"] } }, // Use looked-up price
totalSalesCount: { $sum: 1 } // Count sales per month
}
},
// Stage 6: Sort the results by month
{
$sort: { _id: 1 } // Sort by month (ascending)
},
// Stage 7: Project to rename _id to monthName for better readability
{
$project: {
_id: 0, // Exclude the default _id
month: "$_id", // Rename _id to month
averageSaleAmount: { $round: ["$averageSaleAmount", 2] }, // Round to 2 decimal places
totalSalesCount: 1
}
}
]);
Explanation of the added $lookup
and $unwind
stages:
$lookup
(Stage 1):from: "products"
: Specifies that we want to join with theproducts
collection.localField: "item_id"
: This is the field in thesales
collection that contains the value we want to match.foreignField: "_id"
: This is the field in theproducts
collection that contains the value to match againstitem_id
.as: "productDetails"
: The results of the join (the matching product documents) will be added as an array namedproductDetails
to eachsales
document.
$unwind: "$productDetails"
(Stage 2):- Since
$lookup
creates an array (even if there's only one match),$unwind
deconstructs that array. For each element in theproductDetails
array, it outputs one document. If aproductDetails
array has multiple elements (due to multiple matches in the$lookup
),$unwind
would create multiple documents, one for each matched product. In this specific scenario, where_id
is unique, it essentially "flattens" the array so you can directly accessproductDetails.category
andproductDetails.price
.
- Since
This modified pipeline addresses your request to include $lookup
, demonstrating how to enrich your sales data with product information from another collection before performing the aggregation.
Here's what $lookup
does:
It performs a left outer join in memory, within the aggregation pipeline. This means it combines documents from one collection (
sales
in your example) with documents from another collection (products
) based on a specified common field (item_id
insales
and_id
inproducts
).It adds a new array field to the input documents. In your example, for each
sales
document, it adds a new field namedproductDetails
. ThisproductDetails
field is an array that contains all the matching documents from theproducts
collection.This new field is temporary. The
productDetails
field (and its contents) exists only for the duration of the aggregation pipeline. It's an ephemeral construct that allows you to work with joined data during the aggregation process. It does not modify your originalsales
orproducts
collections, nor does it create a new collection in your database.
To be clear:
Your
sales
collection remains as it is.Your
products
collection remains as it is.No new collection named
productDetails
(or any other name) is created on your disk or in your database.
The data that $lookup
and subsequent stages (like $unwind
) process is streamed through the pipeline. When the pipeline finishes, the transformed data is returned as the result of the aggregate()
command, but no permanent changes are made to your collections unless you explicitly write the results to a new collection using an additional stage like $out
or $merge
.
In your provided pipeline, the productDetails
array is immediately deconstructed by the $unwind
stage, effectively "flattening" the joined data so you can directly access fields like productDetails.category
in the subsequent $match
and $group
stages.
And thats a wrap!
MongoDB's aggregation pipelines are incredibly powerful for data analysis and transformation. By mastering these fundamental operators, you'll be well on your way to unlocking deep insights from your NoSQL data!
Happy aggregating!
You can learn more about mongodb aggreation pipelines over here: Link
Subscribe to my newsletter
Read articles from Vedant Swami directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Vedant Swami
Vedant Swami
developer, designer, blogger,Ex. Web Dev @ startup