Part 3: Data Mesh Optimization

Recap: Where We Stand

Till now in this series, we established a secure, governed AWS data mesh platform, onboarded a real-world dataset, and enabled both publishers and consumers to collaborate seamlessly using S3, Glue, Lake Formation, and Athena. Our workflows now support secure, flexible data onboarding and self-service analytics, setting the stage for scalable growth.

The Challenge: Scaling Discovery, Performance, and Cost

As our data mesh evolves and more data products are onboarded, our initial approach will start to show its limitations:

Inefficiency and Increased Cost: Each crawler run scans all data in the specified path, leading to higher compute usage and AWS charges as our data grows.
Time-Consuming Operations: Full-path scans delay metadata updates, slowing down data onboarding and access for consumers.
Lack of Responsiveness: Manual scans can’t react instantly to new data arrivals, introducing latency for analytics and reporting.
Unoptimized Queries: Querying large, unpartitioned tables in Athena results in unnecessary data scans and higher costs, especially as datasets scale.
Missed Optimization Opportunities: Without partitioning, queries that could target specific slices of data (like by date or region) end up reading the whole table, wasting both time and resources.

Clearly, we need a smarter, more scalable approach to keep our metadata catalog up to date and optimize query performance as our data mesh grows.

The Next Phase: Evolving to Event-Driven Discovery and Partitioned Data

To overcome these challenges, we’re introducing several key optimizations:

Event-Driven Crawlers: By leveraging S3 event notifications and SQS, our Glue crawler now processes only new or changed data, rather than scanning the entire path every time.
Scheduled Automation: Crawlers are now scheduled to run at regular intervals, ensuring timely updates with minimal manual intervention.
Partitioned Data Layout: We’re restructuring our tables by partitioning on key columns( for our 1st dataset, we will choose call_date), allowing Athena to scan only relevant slices of data for each query.
Optimized Query Performance: With partitioned tables, queries that filter on the partition column are significantly faster and less expensive, as Athena reads only the necessary data.
Reduced Costs and Overhead: This hybrid approach minimizes unnecessary scans, reduces compute costs, and keeps your metadata catalog always up to date.

With these enhancements, our data mesh becomes more responsive, scalable, and cost-effective which empowers teams to deliver faster insights and better business outcomes.

Choosing the Right Partition Columns: A Key to Efficient Data Mesh

Before implementing partitioned tables in our AWS Data Mesh, it’s crucial to select the right partition columns. The right choice can dramatically improve query speed and reduce costs.

How to Pick Partition Columns

Align with Query Patterns: Choose columns most commonly used in query filters (for example, call_date if most queries filter by date).
Balance Cardinality: Avoid too many partitions (high-cardinality) or too few (low-cardinality). Aim for a manageable number that matches your data size and usage.
Consider Data Distribution: Ensure partitions are evenly sized. Avoid columns where most data falls into a single partition.
Match Business Needs: Time-based data often benefits from date-based partitions (year, month, day). For other datasets, consider geography, category, or other business-relevant fields.

Example: For the San Francisco Safety Data, partitioning by call_date makes sense since most queries filter by date, enabling faster, cheaper queries.

Best Practice: Review query logs and data distribution regularly, and adjust your partitioning strategy as your data mesh evolves.

With our partitioning strategy in place and a clear understanding of why event-driven discovery is essential, we are ready to put these concepts into action.

Automating Metadata Discovery: Event-Based Crawlers

In this phase, we will modernize our data mesh by automating metadata discovery using AWS-native, event-driven glue crawlers. Instead of relying on manual or scheduled full-path scans, our new approach ensures that metadata is updated instantly and efficiently whenever new data lands in S3.

What We’re About to Do

Automate Discovery: Set up S3 event notifications and connect them to an SQS queue, so every new data file triggers a notification.
Event-Driven Crawlers: Configure AWS Glue to listen to the SQS queue and process only new or changed files, not the entire dataset.
Continuous Updates: Schedule the crawler for regular runs, but with event-driven triggers, it only works when there’s something new, saving both time and cost.

This automation ensures our metadata catalog is always current, reduces operational overhead, and only incurs costs when actual updates are needed. It also lays the foundation for a scalable, production-ready data mesh that can handle growth without bottlenecks.

Step 1: Setting Up SQS and Dead-Letter Queue for Event Notifications

To enable real-time metadata updates, we need a way to capture every new data file that lands in S3. By using Amazon SQS with a Dead-Letter Queue (DLQ), we ensure reliable delivery of event notifications and safe handling of any processing failures.

💡

A Dead-Letter Queue (DLQ) is used alongside SQS to capture messages that cannot be processed successfully, making it easier to troubleshoot and prevent data loss in our workflow.

Create Dead-letter queue

Navigate to the AWS SQS Console. Update region as per need.
Click Create queue.
We will select Standard. (Difference between Standard and FIFO, can be checked here).
Name the queue as aws-data-mesh-crawler-queue-dlq.
Let’s keep Configuration settings as default. As for our use-case, these setting should suffice.
For Encryption settings, select SSE-KMS and provide key aws-data-mesh-key which we created in part 1.
For Access policy, click on Advanced, Copy the policy from the GitHub link.
1. Replace <region> with region in which we are building this setup. For me, it is us-east-1.
2. Replace <account_number> with our account_id.
For Redrive allow policy setting, we will keep it Disabled which means it can be used as DLQ for all the queues.
For Dead-letter queue setting, keep it Disabled. As this queue itself is DLQ.
Keep Tags as empty, and click Create queue.

Create Crawler Queue

Go to homepage of SQS Console.
Follow steps 2-6 from above, with name aws-data-mesh-crawler-queue.
For Access policy, click on Advanced, Copy the policy from the GitHub link.
1. Replace <region> with region in which we are building this setup. For me, it is us-east-1.
2. Replace <account_number> with our account_id.
For Dead-letter queue setting, set it as Enabled. Select queue created above. and select max receives as 1. This will ensure, as soon as message fails, it will go to DLQ
Keep Tags as empty, and click Create queue.

This forms the backbone of our event-driven discovery crawlers.

Copy ARNs of Queue and its DLQ. We will need it step 3

Step 2: Allow Access to KMS Key from S3

In this step, we will update the KMS key policy to allow Amazon S3 to send encrypted notifications to our SQS queue. This is necessary because S3 needs permission to use the KMS key when messages sent to the encrypted SQS queue.

Navigate to AWS KMS Console and Click on Customer Managed Keys.
Select the KMS key used for encrypting the SQS queue. (From Part 1)
Go to the Key Policy tab and switch to policy view.
Click Edit to modify the key policy.
We can see four existing statements, add a new statement to grant S3 the necessary access.
Visit the provided GitHub link, copy the required policy statement, and paste it into the key policy.
Save the changes to apply the updated policy.

Step 3: Configuring S3 to Send Events

Next, We want S3 to automatically notify our system whenever new data arrives, so our metadata catalog stays current without manual intervention.

We will use S3 Event Notifications to implement this.

Navigate to the AWS S3 Console. Update region as per need.
Select our datalake bucket (named with aws-data-mesh-<account_number>-<region>).
Go to Properties tab
Scroll down to Event notifications
Click on Create event notification
1. Let’s name the event as aws-crawler-notify.
2. If we want to apply this settings to specific Prefix or suffix, we provide the details. For now, we will keep it blank, as it will apply to all the objects.
3. Select All object create events, as it will notify for all the object create events. We don’t need notification for any other objects.
4. Scroll down to Destination, select SQS queue. Select aws-data-mesh-crawler-queue which we created in previous step.
5. Click Save changes.

This ensures every new or updated file is tracked and ready for discovery.

Step 4: Updating Glue Crawler for Event-Driven Discovery

Currently, our crawler scan all data, which is slow and costly as datasets grow. By making the crawler event-driven, we process only new or changed files, keeping operations fast and efficient.

Firstly, we will add SQS permissions to our crawler role, as our crawler need access to SQS, to read messages, and once messages are processed it will purge the message as well.

Navigate to Amazon IAM Console.
Under Access Management, Go to Roles.
Select AWSGlueServiceRole-datamesh.
Under Permissions tab, Drop down for Add permissions. Click Create inline policy.
Click on JSON and copy policy from this GitHub Link.
1. Replace <region> with region in which we are building this setup. For me, it is us-east-1.
2. Replace <account_number> with our account_id.
Click Next.
Name the policy as SQSAccessPolicy.
Click Create Policy.

Now, We will update our existing Glue Crawler with these changes

Navigate to the AWS Glue Conso le. Update region as per need.
Under Data Catalog, Go to Crawlers console.
Click on sf_safety_data_crawler. Click Edit in the top right corner.
In the wizard, lets update settings step-by-step.
1. Proceed to Step 2: Choose data sources and classifiers and click Edit.
2. Select existing S3 data source and click Edit.
3. Under Subsequent crawler runs. Choose Crawl based on events and paste the SQS and DLQ ARNs we copied earlier.
4. Click on Update S3 data source.

This will configure our crawler as event-driven crawler.

In the side navigation, go to Set output and scheduling.
Click on Advance Options
Select Update all new and existing partitions with metadata from the table.
This ensures all partitions use a consistent schema and helps prevent issues like HIVE_PARTITION_SCHEMA_MISMATCH, which can occur if partitions have mismatched or inferred schemas.
Under Crawler schedule, choose Daily from the Frequency dropdown.
Set the schedule to run once daily by setting time 10 minutes after the current time.
I have set it for midnight, so, it shows At 12:00 AM.

Save your changes to complete the update.

This will schedule our crawler to run every day once.

💡

The optimal crawler schedule depends on how frequently new data is published for each dataset. Adjust the frequency to match the dataset with the most frequent updates, ensuring timely metadata discovery without unnecessary runs.
Click on Next, then click on Update.

With this configuration, the Glue crawler no longer scans our entire S3 path on every run. Instead, it checks the SQS queue for new S3 event notifications and processes only the files that have been added or changed since the last crawl. If there are no new events in the queue, the crawler stops immediately, saving both time and cost.

This event-driven approach ensures our Data Catalog is always up to date, enables near real-time metadata updates, and scales efficiently as our data grows, making our data mesh more responsive and cost-effective.

Optimizing Data Layout: Data Partitioning

As our dataset grows, scanning the entire table for every query becomes increasingly slow and expensive. To address this, we are now partitioning our table by the call_date column, which allows us to optimize both performance and cost in our data mesh.

Why We’re Doing This

Full-table scans are inefficient, slow and expensive, especially as data volumes increase. Partitioning allows query engines like Athena to read only the relevant slices of data, dramatically improving performance and reducing costs. This approach also aligns with common query patterns, where users frequently filter data by date.

What We’re About to Do

Use Provided Partitioned Files: For this step, we’ll use the partitioned CSV files, which are already organized by the call_date column.
Upload to S3 with Partitioned Prefixes: Upload these files to our datalake bucket, following a partitioned directory structure.
Let the Event-Driven Crawler Run: Once uploaded, the event-driven Glue crawler will automatically detect new partitions and update the Glue Data Catalog, ensuring our metadata is always current without manual effort.

By implementing this partitioned data layout now, we’re setting the stage for scalable, efficient analytics. In future phases, we’ll explore ETL processes and further optimizations, but for now, simply use the provided partitioned files to experience these immediate benefits.

Step 1: Download Partitioned Files

Download zip file from here containing the partitioned dataset.
Unzip the file on the system.

Step 2: Upload Partitioned Dataset

Navigate to the AWS S3 console.
Select our datalake bucket. Click on sf_safety_data.
Click on Create folder, name it sf_safety_data_csv_partitioned. (Crawler will detect this as our table name).
Upload 10 to 15 partition folders from system to S3.

Uploading may take several minutes depending on the size.

Step 3: Data Discovery

The event-driven Glue crawler will automatically pick up the new table and its partitions.

Navigate to the AWS Glue console and go to Crawlers console.
Since, we have scheduled it after 10 minutes, in previous step, lets wait till it runs

We can notice that the crawler runs exactly at the scheduled time we configured, creating a new table and registering all the uploaded partitions as expected.

Partitioning in Practice: Efficient Queries with Athena

With partitioned data in place, let’s see how partitioning works in practice and how to run Athena queries that take full advantage of this structure for both simple and complex analytics.

Athena Table Overview

After logging in as DataMeshUser, we can see two tables in the Athena console:

sf_safety_data_csv_non_partitioned
sf_safety_data_csv_partitioned

We’ll run queries on both tables and observe the "Data scanned" value to compare efficiency.

💡

Note: The column name has been renamed to call_date (format: yyyy-mm-dd) for partitioned table. Ensure you use the correct column name and format in your queries.

Example Athena Queries

Get list of Partitions
```
 SHOW PARTITIONS sf_safety_data_csv_partitioned
```
This lists all partitions loaded in our table, which we can use to compare results between both tables.
Filtered Query by Partition Column

I will take 31-03-2024 partition for my example, please make sure to take this value according to partition loaded in the table
```
 SELECT * 
 FROM sf_safety_data_csv_partitioned
 WHERE call_date = '2024-03-31'
```
```
 SELECT * 
 FROM sf_safety_data_csv_non_partitioned
 WHERE "Call Date" = '03/31/2024'
```
The partitioned table scans only the relevant partition(413.23 KB), while the non-partitioned table scans the entire dataset before filtering (2.68GB).

How to Experiment

Try uploading additional partitioned files to your datalake bucket and let the crawler run as scheduled.

Query the new partitions in Athena and observe the data scanned and time taken.
Screenshots of your query results can help visualize the impact.
This hands-on practice will deepen your understanding of how event-driven discovery and partitioning drive efficiency in a modern data mesh.

By following these steps, you’ll see firsthand how partitioning and event-driven metadata discovery make your data mesh more scalable, efficient, and cost-effective.

Summary

In this part of the series, we took major steps to modernize and optimize our AWS data mesh platform for both performance and scalability. Here’s a quick summary of what we accomplished:

Automated Metadata Discovery: We set up an event-driven workflow using S3 event notifications, SQS, and AWS Glue crawlers. This ensures that new data is discovered and cataloged instantly, without the need for manual or full-path scans.
Partitioned Data Layout: We organized our dataset by the call_date column, uploaded partitioned files to S3, and let the event-driven crawler update the Glue Data Catalog. This structure enables Athena to scan only relevant data, making queries faster and more cost-effective.
Hands-On Athena Querying: We compared queries on both partitioned and non-partitioned tables, demonstrating how partitioning drastically reduces the amount of data scanned and improves query performance.

What’s Next: Auditing and Monitoring

In the next part of this series, we’ll focus on strengthening our data mesh with robust auditing and monitoring.

Monitoring - We will create alarms to notify us of Glue crawler failures, ensuring timely awareness and response to issues.
Auditing - We’ll set up AWS CloudTrail to capture and analyze key events, enabling detailed tracking of activity and changes across your platform.

Stay tuned as we take our data mesh to the next level of reliability and transparency!

If you have questions, run into issues, or want to share your results, please leave a comment below. Your feedback and insights help us all build better data platforms together

Part 3 – Optimizing Data Mesh