Part 4 – Auditing & Monitoring in Data Mesh


Recap: Where We Stand
Welcome back to the AWS Data Mesh series. So far, we have designed, built, and optimized a scalable data mesh architecture using AWS native services. If you are just joining, you can start from the first part of the series here.
Why Monitoring and Auditing Matter
As our data mesh grow and become central to business operations, the risks of not having proper monitoring and auditing increase dramatically. Without these capabilities, we may face:
Undetected Failures: Glue crawlers or ETL jobs can fail silently, leading to stale or missing data that impacts analytics and decision making.
Security Blind Spots: Unauthorized changes to tables or permissions may go unnoticed, exposing sensitive data or disrupting operations.
Compliance Risks: Without a clear audit trail, demonstrating regulatory compliance becomes difficult and time consuming.
Delayed Troubleshooting: Identifying the root cause of issues is challenging without detailed logs and real time alerts.
Lack of Accountability: When actions are not tracked, it is hard to know who made critical changes or when they occurred.
To address these challenges, in this part, we shift our attention to observability which will make our data mesh transparent, secure, and operationally robust. We will be setting up monitoring for Glue crawlers, auditing activity with CloudTrail, analyzing logs using Athena, and capturing key governance events with EventBridge. By the end, we will have a hands on framework for auditing and monitoring, and a foundation for future enhancements such as disaster recovery and resilience.
Monitoring Glue Crawlers with CloudWatch
Before we can respond to issues in our data mesh, we first need to know when and where they happen. AWS Glue crawlers play a critical role in discovering and cataloging our data, but failures can easily go unnoticed without proper monitoring. By leveraging CloudWatch logs and metric filters, we can automatically detect crawler errors and receive instant notifications, helping us maintain data freshness and reliability.
What we are about to do:
Explore AWS Glue crawler logs in CloudWatch for real time and historical monitoring.
Create a metric filter in CloudWatch to automatically scan crawler logs for error patterns, such as the keyword "ERROR".
Set up a custom CloudWatch metric that increments each time an error is detected in the logs.
Use this metric to trigger alarms and notifications, ensuring immediate awareness of crawler failures.
This setup will give us real time visibility into the health of our Glue crawlers. By automatically detecting and alerting us to errors, we can address issues quickly before they impact our data mesh’s reliability. This proactive monitoring reduces manual effort, helps maintain data quality, and ensures our team can respond rapidly to operational problems.
Step 1: Setting Up CloudWatch Logs and Metric Filters
AWS Glue automatically sends crawler logs to CloudWatch, making it easy to monitor crawler activity and issues.
Let’s first explore the logs, which are generated by Glue Crawlers
Open the AWS CloudWatch Console and navigate to Logs > Log groups in the sidebar.
Locate and select the log group named
/aws-glue/crawlers
Click on the log stream corresponding to your crawler, named,
sf_safety_data_crawler
Review the logs to see details such as the number of partitions added or any new tables created.
Now, we will create Metric Filter for these logs
In the log stream, use the Filter events search bar to look for the keyword
ERROR
. For our crawler, there will be no error as of now, if all the steps during crawler creation were followed correctly.Click the Create Metric filter button in the top right corner.
Define the metric filter to count occurrences of the word
ERROR
:Filter name:
CrawlerErrors
.Metric namespace: Create new, name it
DataMesh/Crawlers
.Metric name:
CrawlerErrorCount
.Metric value:
1
.Default value:
0
.Unit:
None
.
Return to the log group and select the Metric filters tab to confirm our filter is active. This metric will now track the count of error keywords in our crawler logs.
Step 2: Creating an Alarm and Notification
With the metric filter in place, we can now set up an alarm to notify us if any errors are detected by our Glue crawler.
Wait a few minutes for the metric to populate.
From metric filters tab. select the metric we created and click Create alarm.
In the alarm configuration window:
Set the Period to 5 minutes (or adjust as needed).
For the threshold, choose Greater than and set the value to 0 (this will trigger the alarm if any error is detected).
💡We can also select option as Greater/Equal and threshold value as 1Under Additional configuration, select Maintain the alarm state for missing data treatment.
This will ensure where there is no data, it will keep alarm state
Click Next to move to notifications.
For Alarm state trigger, keep it as In alarm.
For notifications, choose Create new SNS topic.
Name the topic, for example,
CrawlerErrorsTopic
.Enter your email address in the email field.
Click Create topic.
(Optional) If you want notifications when the alarm returns to OK, add another notification for that state.
Scroll down and click Next.
Name the alarm, such as
CrawlerErrorAlarm
, then click Next.Review your configuration and click Create alarm.
Our alarm is now created. It may show as "Insufficient data" for a few minutes, then it will move to "OK" once the metric updates.
After creating the SNS topic, we will receive an email from AWS at the address you provided. Open the email and click on Confirm subscription to start receiving alarm notifications.
This will complete our notification setup for crawler failures, ensuring we receive an email alert whenever an error is detected in our Glue crawler logs
Step 3: Testing our Alarm
n Lake Formation, revoke the Data Location permission for our crawler’s IAM role.
Upload a few partition folders to our datalake bucket.
Allow the crawler to run. When it fails due to missing permissions, we should receive an email notification within a few minutes.
This test confirms that your monitoring and alerting setup is working as expected.
Example Errors: Failure due to schema complexity or connector issues are most common errors
Tips for Troubleshooting Crawler Failures
Check CloudWatch logs for detailed error messages.
Segment large schemas and run crawlers on smaller parts to isolate issues.
Verify JDBC driver compatibility if connecting to external databases.
Increase timeouts or resources for complex crawls.
Re-run failed crawlers after addressing issues to confirm resolution.
Auditing: CloudTrail Setup
Now that we have established proactive monitoring for Glue crawler errors using CloudWatch, the next step is to effectively track and analyze activity in our data it is essential to capture a comprehensive audit trail of all activity within our data mesh environment. AWS CloudTrail provides this visibility by recording API calls and resource modifications across our AWS environment. In this phase, we will set up CloudTrail to log activity, organize metadata in Glue, enforce access controls, and analyze events using Athena.
What we will do in this phase:
Set up AWS CloudTrail to record all management events and store logs in a secure S3 bucket for auditing and compliance.
Create a dedicated Glue database and table to catalog CloudTrail logs, making them searchable in Athena.
Restrict access to the Glue database and table so only administrators can view or modify audit data, ensuring sensitive logs remain protected.
Use Athena to run SQL queries on CloudTrail logs for deep analysis, such as tracking user actions, identifying changes, and supporting investigations.
Capturing and organizing CloudTrail logs allows us to maintain a comprehensive, tamper-evident record of all activity in our data mesh. By restricting access, we protect sensitive audit data from unauthorized users. Querying these logs with Athena empowers us to quickly answer critical questions, troubleshoot issues, and demonstrate compliance with security and governance requirements.
Step 1: Allow Access to KMS Key from CloudTrail
Since we are encrypting our CloudTrail logs, CloudTrail needs permission to use our KMS key.
Navigate to AWS KMS Console., select Customer managed keys, and choose our KMS key.
Under the Key policy tab, click Edit.
Add a new statement to the policy that grants CloudTrail the necessary permissions. We can find an policy statement from the GitHub Link.
Make sure not to modify or remove any existing statements in your key policy.
Click Save changes.
Tip: If we create the KMS key from the CloudTrail console, the required policy is added automatically. IAs we have created it via the KMS console, this manual step was necessary
Step 2: Create S3 Bucket for Audit Logs
We will create a dedicated S3 bucket to store our CloudTrail audit logs. Using a dedicated bucket for audit logs helps keep our logs organized and secure. We can further restrict access to this bucket as needed.
Follow steps as given in previous parts to create S3 bucket, name it as aws-data-mesh-audit-trail-<account_id>-<region>
. Add bucket policy as well to this bucket.
Step 3: Setting Up CloudTrail
Navigate to AWS CloudTrail Console. From side bar, click on Dashboard, Click on Create trail.
Let’s name the trail as
data-mesh-audit-trail
.For Storage location, Select Use existing S3 bucket, Browse to bucket created in previous step
We will use our existing KMS Key. Select Existing and select the key.
Leave the remaining settings at their default values and click Next.
For Event type, select only Management events, to avoid unnecessary costs from data and insight events.
Select Exclude AWS KMS events and Exclude Amazon RDS Data API events events to filter out high-volume, less relevant events.
Click Next, review configuration and click Create trail.
Go to the S3 Console and open the bucket we created. We should now see that audit logs are being saved to this bucket.
Step 4: Setup Glue Database & Table
Follow the steps from the previous part to create a new Glue database named data-mesh-logging
with the location set to s3://aws-data-mesh-audit-trail-<account_id>-<region>/AWSLogs/<account_id>/
.
For cloudtrail table, we will run DDL statement in Athena, with partition projection.
Before running DDL to create table, we need to perform following preparatory steps:
Register our cloudtrail S3 bucket in Lake Formation to enable data access control.
Add a new policy named
CloudTrailBucketAccess
to theLFServiceRole
. Copy the policy from the provided GitHub link.Provide access to
Administrator
user, fordata-mesh-logging
in Lake Formation
Create Table via DDL statement:
Navigate to AWS Athena Console.
Select
data-mesh-consumer
as workgroup.Run following DDL statement.
Replace
<region>
with region in which we are building this setup. For me, it is us-east-1.Replace
<account_number>
with our account_id.
CREATE EXTERNAL TABLE cloudtrail_logs (
eventVersion STRING,
userIdentity STRUCT<
type: STRING,
principalId: STRING,
arn: STRING,
accountId: STRING,
invokedBy: STRING,
accessKeyId: STRING,
userName: STRING,
sessionContext: STRUCT<
attributes: STRUCT<
mfaAuthenticated: STRING,
creationDate: STRING
>,
sessionIssuer: STRUCT<
type: STRING,
principalId: STRING,
arn: STRING,
accountId: STRING,
userName: STRING
>
>
>,
eventTime STRING,
eventSource STRING,
eventName STRING,
awsRegion STRING,
sourceIpAddress STRING,
userAgent STRING,
errorCode STRING,
errorMessage STRING,
requestParameters STRING,
responseElements STRING,
additionalEventData STRING,
requestId STRING,
eventId STRING,
resources ARRAY<STRUCT<
ARN: STRING,
accountId: STRING,
type: STRING
>>,
eventType STRING,
apiVersion STRING,
readOnly BOOLEAN,
recipientAccountId STRING,
serviceEventDetails STRING,
sharedEventID STRING,
vpcEndpointId STRING
)
COMMENT 'CloudTrail table with JsonSerDe and Partition Projection'
PARTITIONED BY (
`region` string,
`year` string,
`month` string,
`day` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://aws-data-mesh-audit-trail-<account_id>-<region>/AWSLogs/<account_id>/CloudTrail/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.region.type' = 'enum',
'projection.region.values' = '<region>',
'projection.year.type' = 'integer',
'projection.year.range' = '2025,2040',
'projection.month.type' = 'integer',
'projection.month.range' = '01,12',
'projection.day.type' = 'integer',
'projection.day.range' = '01,31',
'storage.location.template' = 's3://aws-data-mesh-audit-trail-<account_id>-<region>/AWSLogs/<account_id>/CloudTrail/${region}/${year}/${month}/${day}/'
);
Explanation of Partition Projection Properties:
projection.enabled
: Activates partition projection for this table.projection.<partition_key>.type
: Defines the data type (integer
,enum
,date
, orinjected
).projection.<partition_key>.range
: Specifies the valid range of values for the partition key. Athena uses this to calculate valid partition locations.storage.location.template
: This is the most important property. It defines the S3 path template that Athena uses to find our data. The${variable}
placeholders correspond to the partition keys defined in thePARTITIONED BY
clause.
Step 5: Analyzing Activity with Athena
Athena lets you run SQL queries directly on CloudTrail logs stored in S3. This is invaluable for historical audits and answering questions like "Who deleted a table?" or "When were permissions changed?"
Let’s go to Athena Console, and run few sample queries
Example Queries:
List all Lake Formation access:
SELECT eventTime, eventName, userIdentity, params.tableArn AS tableArn, params.permissions AS permissions FROM ( SELECT eventTime, eventName, userIdentity.arn as userIdentity, CAST( json_parse(requestparameters) AS ROW( tableArn VARCHAR, permissions ARRAY < VARCHAR >, auditContext ROW( additionalAuditContext VARCHAR, cellLevelSecurityEnforced boolean, expectedTableId VARCHAR ) ) ) AS params from cloudtrail_logs WHERE region = 'us-east-1' AND year = '2025' AND month = '06' AND day between '21' and '30' and eventName = 'GetDataAccess' ) ORDER BY eventTime DESC
Find all permission grants:
SELECT eventTime, eventName, userIdentity.arn as userIdentity, requestparameters from cloudtrail_logs_9 WHERE region = 'us-east-1' AND year = '2025' AND month = '06' AND day between '21' and '30' and eventName = 'BatchGrantPermissions'
Capturing Real Time Governance Events with EventBridge
While CloudTrail provides an essential historical record for auditing, many security and operational use cases require immediate action. To achieve this, we will use Amazon EventBridge to build rules that filter the stream of events from CloudTrail. This allows us to automatically capture critical governance changes, such as a table being deleted or permissions being granted, the moment they happen. This capability is the foundation for building automated alerts, security responses, and the resilient, self healing data platforms we will explore in future parts.
What we will do in this phase:
Create two distinct EventBridge rules to monitor different types of activity.
Our first rule, created using the AWS Management Console, will capture critical write operations from AWS Glue and Lake Formation, such as creating or deleting tables and changing permissions.
Our second rule, created using the AWS CLI, will specifically monitor read access requests (
GetDataAccess
) from Lake Formation, which helps in auditing who is querying the data.For both rules, we will configure a CloudWatch Log Group as the target to create dedicated, real time log streams for each type of event.
Setting up EventBridge to capture governance events provides immediate visibility into critical changes across our data mesh. Instead of waiting to discover an issue during a periodic audit, we can log it, or even act on it, the moment it occurs. This enhances our security posture by enabling rapid investigation of unauthorized activity and provides an immutable log for compliance. Most importantly, it creates the event driven foundation needed for future automation, such as triggering alerts or disaster recovery workflows.
Step 1: Capturing Write Operations
First, we will initiate the rule creation process from the EventBridge console.
Navigate to the Amazon EventBridge console.
From the sidebar, select Rules and then click Create rule.
Name the rule as
DataMeshWriteEvents
.For Rule type, ensure Rule with an event pattern is selected, and click Next.
For Event source, select AWS events or EventBridge partner events.
Scroll down to the Event pattern section.
Select Custom patterns (JSON editor) and paste the following JSON pattern to target write operations.
{ "source": ["aws.glue", "aws.lakeformation"], "detail-type": ["AWS API Call via CloudTrail"], "detail": { "eventName": [ "CreateTable", "UpdateTable", "DeleteTable", "BacthDeleteTable", "CreatePartition", "BatchCreatePartition", "DeletePartition", "BatchDeletePartition", "UpdatePartition", "BatchUpdatePartition", "GrantPermissions", "BatchGrantPermissions", "RevokePermissions", "BatchRevokePermissions" ] } }
Click Next.
In the Target 1 box, select AWS service.
Under Select a target, choose CloudWatch log group.
For Log group, let’s create dedicated log group for these events, name it as
/aws/events/datamesh-write-events
.Click Next, review the configuration, and click Create rule.
Next we will create rule for read event and then test both the rules.
Step 2: Capturing Data Access Operations
Since EventBridge rules for read events like GetDataAccess
cannot be created directly through the AWS Management Console, we will use the AWS CLI/Python to create this rule.
Enabled (with CloudTrail read-only Management events)
To create this rule, we will use AWS CloudShell, a convenient, browser based environment that comes preconfigured with the AWS CLI and Python.
What the Script Will Do
We have provided two scripts that perform the same five actions. We only need to run one of them. The script will:
Define the event pattern for
GetDataAccess
events.Create the EventBridge rule.
Create a dedicated CloudWatch Log Group.
Set the log group as the target for the rule.
Add a resource policy to the log group to grant EventBridge permission to write logs. This is a crucial step that the console handles automatically but must be done manually when using a script.
First, launch a new AWS CloudShell session.
Follow these general steps, using the specific link and filename for the chosen script from the options provided below.
Option A: Shell Script → GitHub Link, File Name → create_event_bridge_rule.sh
Option B: Python Script → GitHub Link, File Name → create_event_bridge_rule.py
Download the script from GitHub using
curl
.# Replace [GitHub Link] with the link for your chosen script curl -O [GitHub Link]
Edit the Placeholders. Replace the
<region>
and<account_id>
placeholders. Here are two ways to do this:Option A: Using
sed
for Quick Replacement (Recommended)
Thesed
command finds and replaces text directly from the command line.# Replace placeholders in your script file. # IMPORTANT: Update your-aws-region and your-aws-account-id with your actual values. sed -i 's/<region>/your-aws-region/g' [script_filename] sed -i 's/<account_id>/your-aws-account-id/g' [script_filename]
Option B: Using a Text Editor
If you prefer to edit the file interactively, you can use a command line editor likenano
orvim
.bash# Open the file with your preferred editor nano [script_filename] # OR vim [script_filename]
In nano, use
Ctrl
+\
to search and replace.In vim, use the command
:%s/<placeholder>/your-value/g
to replace all instances.
Run the Script
chmod +x [script_filename] #Make script executable #if shell script then run using below command ./[script_filename] #if python script then run using below command python3 [script_filename]
After running either script, our DataMeshAccessRule
rule will be active and logging data access events to the specified CloudWatch log group, providing a clean and organized way to monitor data access patterns.
Step 3 Testing the EventBridge Rules
With both EventBridge rules active and logging to their respective CloudWatch log groups, we now have a structured and efficient way to monitor governance changes and data access events.
To verify the setup, perform the following tests:
Test Rule 1 (Write Operations): Revoke a permission in Lake Formation to trigger a governance event. We can then check the
/aws/events/datamesh-write-events
log group to see the captured event.Remember to restore the revoked permission afterward to maintain proper access.
Test Rule 2 (Read Access): Log in as the
DataMeshUser
and execute a query in the Athena Query Editor. This will generate aGetDataAccess
event, which you can find in the/aws/events/datamesh_access_log_group
log group
With our EventBridge rules now active, we have powerful, real time visibility into critical management and governance events across our data platform. We can see when tables are altered or permissions are granted the moment it happens.
However, this only tells part of the story. To get a complete picture of our data lake’s activity, we need to go one level deeper and monitor who is accessing the individual data objects within our S3 buckets.
Auditing Object Level Access with S3 Server Access Logs
In this next section, we will set up S3 Server Access Logging to capture a detailed, object level audit trail. This will complement our management event monitoring by providing a granular record of every file interaction, which is essential for comprehensive security analysis and usage auditing.
S3 Server Access Logs can tell us who is accessing the individual files within that table, how frequently, and from where. This provides a granular, low level audit trail that is invaluable for security analysis, performance tuning, and understanding data usage patterns.
Why S3 Server Access Logs Instead of CloudTrail Data Events?
A common question is why we would use S3 Server Access Logs when CloudTrail Data Events can also track object level operations. While both are powerful, they serve different primary purposes, and for our goal of comprehensive, cost effective auditing, S3 Server Access Logs are the better choice for several reasons:
Cost: This is the most significant differentiator. S3 Server Access Logging is free, with costs incurred only for storing the log files in S3. CloudTrail Data Events, on the other hand, incur a direct cost for every event processed, which can become very expensive for high traffic buckets.
Log Detail: S3 access logs capture specific operational details not found in CloudTrail, such as object size, total request time, turn around time, and the HTTP referrer. They are also the only way to log authentication failures, which are critical for detecting security threats.
Comprehensive Record: S3 logging provides a detailed, best effort record of all requests made directly to your S3 objects, including those from presigned URLs, which may not always appear in CloudTrail.
While CloudTrail data events are delivered faster (within 5 minutes vs. a few hours for S3 logs) and integrate seamlessly with CloudWatch Events, their primary strength is for real time, high priority security alerts across multiple AWS services. For detailed, historical, and cost effective analysis of data access patterns, S3 Server Access Logging is the superior tool.
What we will do in this phase:
Enable S3 Server Access Logging on our data lake bucket to capture all object level requests.
Configure a separate S3 bucket as the destination for these logs to maintain organization and prevent logging loops.
Create a new database and a partitioned table in AWS Athena to make these raw log files easily analyzable with standard SQL.
Enabling S3 Server Access Logging gives us a complete and cost effective audit trail of every GET, PUT, and DELETE request made to our data objects. By structuring these logs in Athena, it empower our security, operations, and analytics teams to troubleshoot access issues, analyze usage patterns, optimize performance, and meet strict compliance requirements without running up significant costs associated with high volume event processing.
Step 1: Enable Server Access Logging on Your S3 Bucket
First, we will configure our main data lake bucket to send access logs to a separate destination bucket.
Navigate to the Amazon S3 console.
Create a new S3 bucket to serve as the destination for our logs, we will name it as
aws-data-mesh-access-logs-<account_id>-<region>
. This bucket must be in the same region as your source bucket and should not have logging enabled on it to avoid recursive logging.💡For this bucket, keep encryption asSSE-S3
, as server access logging doesn’t supportSSE-KMS
as of nowGo to the properties of our datalake S3 bucket.
Find the Server access logging section and click Edit.
Enable logging and select the destination bucket we just created. We can optionally specify a target prefix to keep the logs organized within the bucket. For now, we will keep it blank
- For Log object key format, select
[SourceAccountId]/[SourceRegion]/[SourceBucket]/[YYYY]/[MM]/[DD]/[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[UniqueString]
. As this will help in partitioning
- For Log object key format, select
Click Save changes. Amazon S3 will automatically update the destination bucket's policy to allow the S3 logging service principal to write log files to it.
Step 2: Create a Table in Athena
To analyze the logs, we will define a table structure in Athena that maps to the S3 access log format. Just like we did for CloudTrail Logs, we will use same database to create the table
Open the Athena console.
Run the following
CREATE TABLE
query to define the table structure.Replace
<region>
with region in which we are building this setup. For me, it is us-east-1.Replace
<account_number>
with our account_id.CREATE EXTERNAL TABLE `s3_access_logs`( `bucketowner` STRING, `bucket_name` STRING, `requestdatetime` STRING, `remoteip` STRING, `requester` STRING, `requestid` STRING, `operation` STRING, `key` STRING, `request_uri` STRING, `httpstatus` STRING, `errorcode` STRING, `bytessent` BIGINT, `objectsize` BIGINT, `totaltime` STRING, `turnaroundtime` STRING, `referer` STRING, `useragent` STRING, `versionid` STRING, `hostid` STRING, `sigv` STRING, `ciphersuite` STRING, `authtype` STRING, `endpoint` STRING, `tlsversion` STRING, `accesspointarn` STRING, `aclrequired` STRING ) PARTITIONED BY ( `year` INT, `month` INT, `day` INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$' ) LOCATION 's3://aws-data-mesh-access-logs-<account_id>-<region>/<account_id>/<region>/<datalake_bucket>/' TBLPROPERTIES ( 'projection.enabled' = 'true', 'projection.year.type' = 'integer', 'projection.year.range' = '2025,2040', 'projection.month.type' = 'integer', 'projection.month.range' = '01,12', 'projection.month.digits' = '2', 'projection.day.type' = 'integer', 'projection.day.range' = '01,31', 'projection.day.digits' = '2', 'storage.location.template' = 's3://aws-data-mesh-access-logs-<account_id>-<region>/<account_id>/<region>/<datalake_bucket>/${year}/${month}/${day}/' );
With our s3_access_logs table set up, we can now run highly efficient queries by filtering on the partition keys (year
, month
, day
, hour
) in WHERE
clause. This is crucial for controlling costs and improving query performance, as it allows Athena to calculate the exact S3 locations to scan without listing all partitions.
Step 3: Analyzing Activity with Athena
Remember to adjust the year
, month
, and day
values in the WHERE
clause to match the time range you wish to analyze.
1. Top 10 IP Addresses by Request Count
This query identifies the top 10 IP addresses making requests on a specific day, which is useful for identifying top traffic sources or potential security anomalies.
SELECT remoteip,
COUNT(*) AS request_count
FROM s3_access_logs
WHERE year = 2025
AND month = 06
AND day = 22
GROUP BY remoteip
ORDER BY request_count DESC
LIMIT 10;
2. Count Requests by HTTP Status Code
This helps us monitor the health of our S3 access by identifying the frequency of successful requests (2xx), client-side errors (4xx), or server-side errors (5xx).
SELECT httpstatus,
COUNT(*) AS status_count
FROM s3_access_logs
WHERE year = 2025
AND month = 06
AND day = 22
GROUP BY httpstatus
ORDER BY status_count DESC;
3. Find All Requests for a Specific Object
Use this for detailed auditing of every access attempt on a single, specific file on a given day.
SELECT *
FROM s3_access_logs
WHERE key = 'path/to/your/specific-object.csv'
AND year = 2025
AND month = 06
AND day = 22;
4. Count Requests by Operation Type
This query shows the breakdown of operations (e.g., REST.GET.OBJECT
, REST.PUT.OBJECT
, REST.DELETE.OBJECT
) performed on our data, helping us understand usage patterns.
SELECT operation,
COUNT(*) AS operation_count
FROM s3_access_logs
WHERE year = 2025
AND month = 06
AND day = 22
GROUP BY operation
ORDER BY operation_count DESC;
5. List Unique Users Accessing Data
This query lists the unique IAM users or roles that accessed data on a given day, providing a clear audit of who is interacting with our data.
SELECT DISTINCT requester
FROM s3_access_logs
WHERE year = 2025
AND month = 06
AND day = 22;
Summary
In this part, we built a comprehensive observability framework, transforming our data mesh from a black box into a transparent, auditable, and secure platform. We now have the tools and knowledge to monitor every layer of our data ecosystem, from high level operational health down to individual file access.
Specifically, we learned how to:
Proactively Monitor Glue Crawlers by creating CloudWatch metric filters and alarms to get immediate notifications for crawler failures.
Establish a Complete Audit Trail of all API calls using AWS CloudTrail and analyze these logs with Athena to answer critical security and operational questions.
Capture Critical Events in Real Time with Amazon EventBridge, setting up separate rules for both governance changes and data access requests.
Implement Granular Object Level Logging in a cost effective way using S3 Server Access Logs, and build an automated partition projection table in Athena for easy analysis.
What's Next: A Security Challenge for You
With our observability framework in place, we've made huge strides in meeting our security responsibilities. According to the AWS Shared Responsibility Model, while AWS secures the cloud infrastructure itself, we are responsible for securing what we put in the cloud. Our current setup provides visibility, but visibility alone isn't prevention.
So, before we move on, I want to pose a challenge to you: Looking at the architecture we have just built, what potential security gaps or misconfigurations might still exist? Specifically, Network Security.
Network security is a critical layer of defense that protects data in transit, controls access, and ensures secure communication between services and domains. This moves us from monitoring activity to proactively controlling access, a key aspect of the "security in the cloud" model.
In our next blog, we will dive deep into hardening our platform by addressing this very question. We will start by locking down our network and implementing stricter encryption controls to ensure our data mesh is not just observable, but truly resilient and secure from the ground up.
Stay tuned as we build a secure network foundation, ensuring our data is accessed not just by the right identities, but also from the right network locations.
If you have questions, run into issues, or want to share your results, please leave a comment below. Your feedback and insights help us all build better data platforms together
Subscribe to my newsletter
Read articles from Harshit Garg directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
