Part 4 – Auditing & Monitoring in Data Mesh

Harshit GargHarshit Garg
23 min read

Recap: Where We Stand

Welcome back to the AWS Data Mesh series. So far, we have designed, built, and optimized a scalable data mesh architecture using AWS native services. If you are just joining, you can start from the first part of the series here.

Why Monitoring and Auditing Matter

As our data mesh grow and become central to business operations, the risks of not having proper monitoring and auditing increase dramatically. Without these capabilities, we may face:

  • Undetected Failures: Glue crawlers or ETL jobs can fail silently, leading to stale or missing data that impacts analytics and decision making.

  • Security Blind Spots: Unauthorized changes to tables or permissions may go unnoticed, exposing sensitive data or disrupting operations.

  • Compliance Risks: Without a clear audit trail, demonstrating regulatory compliance becomes difficult and time consuming.

  • Delayed Troubleshooting: Identifying the root cause of issues is challenging without detailed logs and real time alerts.

  • Lack of Accountability: When actions are not tracked, it is hard to know who made critical changes or when they occurred.

To address these challenges, in this part, we shift our attention to observability which will make our data mesh transparent, secure, and operationally robust. We will be setting up monitoring for Glue crawlers, auditing activity with CloudTrail, analyzing logs using Athena, and capturing key governance events with EventBridge. By the end, we will have a hands on framework for auditing and monitoring, and a foundation for future enhancements such as disaster recovery and resilience.


Monitoring Glue Crawlers with CloudWatch

Before we can respond to issues in our data mesh, we first need to know when and where they happen. AWS Glue crawlers play a critical role in discovering and cataloging our data, but failures can easily go unnoticed without proper monitoring. By leveraging CloudWatch logs and metric filters, we can automatically detect crawler errors and receive instant notifications, helping us maintain data freshness and reliability.

What we are about to do:

  • Explore AWS Glue crawler logs in CloudWatch for real time and historical monitoring.

  • Create a metric filter in CloudWatch to automatically scan crawler logs for error patterns, such as the keyword "ERROR".

  • Set up a custom CloudWatch metric that increments each time an error is detected in the logs.

  • Use this metric to trigger alarms and notifications, ensuring immediate awareness of crawler failures.

This setup will give us real time visibility into the health of our Glue crawlers. By automatically detecting and alerting us to errors, we can address issues quickly before they impact our data mesh’s reliability. This proactive monitoring reduces manual effort, helps maintain data quality, and ensures our team can respond rapidly to operational problems.

Step 1: Setting Up CloudWatch Logs and Metric Filters

AWS Glue automatically sends crawler logs to CloudWatch, making it easy to monitor crawler activity and issues.
Let’s first explore the logs, which are generated by Glue Crawlers

  1. Open the AWS CloudWatch Console and navigate to Logs > Log groups in the sidebar.

  2. Locate and select the log group named /aws-glue/crawlers

  3. Click on the log stream corresponding to your crawler, named, sf_safety_data_crawler

  4. Review the logs to see details such as the number of partitions added or any new tables created.

Now, we will create Metric Filter for these logs

  1. In the log stream, use the Filter events search bar to look for the keyword ERROR. For our crawler, there will be no error as of now, if all the steps during crawler creation were followed correctly.

  2. Click the Create Metric filter button in the top right corner.

  3. Define the metric filter to count occurrences of the word ERROR:

    1. Filter name: CrawlerErrors.

    2. Metric namespace: Create new, name it DataMesh/Crawlers.

    3. Metric name: CrawlerErrorCount.

    4. Metric value: 1.

    5. Default value: 0.

    6. Unit: None.

Return to the log group and select the Metric filters tab to confirm our filter is active. This metric will now track the count of error keywords in our crawler logs.

Step 2: Creating an Alarm and Notification

With the metric filter in place, we can now set up an alarm to notify us if any errors are detected by our Glue crawler.

  1. Wait a few minutes for the metric to populate.

  2. From metric filters tab. select the metric we created and click Create alarm.

  3. In the alarm configuration window:

    • Set the Period to 5 minutes (or adjust as needed).

    • For the threshold, choose Greater than and set the value to 0 (this will trigger the alarm if any error is detected).

      💡
      We can also select option as Greater/Equal and threshold value as 1
    • Under Additional configuration, select Maintain the alarm state for missing data treatment.
      This will ensure where there is no data, it will keep alarm state

  4. Click Next to move to notifications.

  5. For Alarm state trigger, keep it as In alarm.

  6. For notifications, choose Create new SNS topic.

    • Name the topic, for example, CrawlerErrorsTopic.

    • Enter your email address in the email field.

    • Click Create topic.

  7. (Optional) If you want notifications when the alarm returns to OK, add another notification for that state.

  8. Scroll down and click Next.

  9. Name the alarm, such as CrawlerErrorAlarm, then click Next.

  10. Review your configuration and click Create alarm.

Our alarm is now created. It may show as "Insufficient data" for a few minutes, then it will move to "OK" once the metric updates.

After creating the SNS topic, we will receive an email from AWS at the address you provided. Open the email and click on Confirm subscription to start receiving alarm notifications.

This will complete our notification setup for crawler failures, ensuring we receive an email alert whenever an error is detected in our Glue crawler logs

Step 3: Testing our Alarm

  1. n Lake Formation, revoke the Data Location permission for our crawler’s IAM role.

  2. Upload a few partition folders to our datalake bucket.

  3. Allow the crawler to run. When it fails due to missing permissions, we should receive an email notification within a few minutes.

This test confirms that your monitoring and alerting setup is working as expected.

💡
Make sure to restore the Data Location permission for your crawler role in Lake Formation after you receive the notification, so your crawler can run successfully in future operations.

Example Errors: Failure due to schema complexity or connector issues are most common errors

Tips for Troubleshooting Crawler Failures

  • Check CloudWatch logs for detailed error messages.

  • Segment large schemas and run crawlers on smaller parts to isolate issues.

  • Verify JDBC driver compatibility if connecting to external databases.

  • Increase timeouts or resources for complex crawls.

  • Re-run failed crawlers after addressing issues to confirm resolution.


Auditing: CloudTrail Setup

Now that we have established proactive monitoring for Glue crawler errors using CloudWatch, the next step is to effectively track and analyze activity in our data it is essential to capture a comprehensive audit trail of all activity within our data mesh environment. AWS CloudTrail provides this visibility by recording API calls and resource modifications across our AWS environment. In this phase, we will set up CloudTrail to log activity, organize metadata in Glue, enforce access controls, and analyze events using Athena.

What we will do in this phase:

  • Set up AWS CloudTrail to record all management events and store logs in a secure S3 bucket for auditing and compliance.

  • Create a dedicated Glue database and table to catalog CloudTrail logs, making them searchable in Athena.

  • Restrict access to the Glue database and table so only administrators can view or modify audit data, ensuring sensitive logs remain protected.

  • Use Athena to run SQL queries on CloudTrail logs for deep analysis, such as tracking user actions, identifying changes, and supporting investigations.

Capturing and organizing CloudTrail logs allows us to maintain a comprehensive, tamper-evident record of all activity in our data mesh. By restricting access, we protect sensitive audit data from unauthorized users. Querying these logs with Athena empowers us to quickly answer critical questions, troubleshoot issues, and demonstrate compliance with security and governance requirements.

Step 1: Allow Access to KMS Key from CloudTrail

Since we are encrypting our CloudTrail logs, CloudTrail needs permission to use our KMS key.

  1. Navigate to AWS KMS Console., select Customer managed keys, and choose our KMS key.

  2. Under the Key policy tab, click Edit.

  3. Add a new statement to the policy that grants CloudTrail the necessary permissions. We can find an policy statement from the GitHub Link.

  4. Make sure not to modify or remove any existing statements in your key policy.

  5. Click Save changes.

Tip: If we create the KMS key from the CloudTrail console, the required policy is added automatically. IAs we have created it via the KMS console, this manual step was necessary

Step 2: Create S3 Bucket for Audit Logs

We will create a dedicated S3 bucket to store our CloudTrail audit logs. Using a dedicated bucket for audit logs helps keep our logs organized and secure. We can further restrict access to this bucket as needed.

Follow steps as given in previous parts to create S3 bucket, name it as aws-data-mesh-audit-trail-<account_id>-<region>. Add bucket policy as well to this bucket.

Step 3: Setting Up CloudTrail

  1. Navigate to AWS CloudTrail Console. From side bar, click on Dashboard, Click on Create trail.

  2. Let’s name the trail as data-mesh-audit-trail.

  3. For Storage location, Select Use existing S3 bucket, Browse to bucket created in previous step

  4. We will use our existing KMS Key. Select Existing and select the key.

  5. Leave the remaining settings at their default values and click Next.

  6. For Event type, select only Management events, to avoid unnecessary costs from data and insight events.

  7. Select Exclude AWS KMS events and Exclude Amazon RDS Data API events events to filter out high-volume, less relevant events.

  8. Click Next, review configuration and click Create trail.

Go to the S3 Console and open the bucket we created. We should now see that audit logs are being saved to this bucket.

Step 4: Setup Glue Database & Table

Follow the steps from the previous part to create a new Glue database named data-mesh-logging with the location set to s3://aws-data-mesh-audit-trail-<account_id>-<region>/AWSLogs/<account_id>/.

For cloudtrail table, we will run DDL statement in Athena, with partition projection.

Before running DDL to create table, we need to perform following preparatory steps:

  • Register our cloudtrail S3 bucket in Lake Formation to enable data access control.

  • Add a new policy named CloudTrailBucketAccess to the LFServiceRole. Copy the policy from the provided GitHub link.

  • Provide access to Administrator user, for data-mesh-logging in Lake Formation

Create Table via DDL statement:

  1. Navigate to AWS Athena Console.

  2. Select data-mesh-consumer as workgroup.

  3. Run following DDL statement.

    1. Replace <region> with region in which we are building this setup. For me, it is us-east-1.

    2. Replace <account_number> with our account_id.

    CREATE EXTERNAL TABLE cloudtrail_logs (
        eventVersion STRING,
        userIdentity STRUCT<
            type: STRING,
            principalId: STRING,
            arn: STRING,
            accountId: STRING,
            invokedBy: STRING,
            accessKeyId: STRING,
            userName: STRING,
            sessionContext: STRUCT<
                attributes: STRUCT<
                    mfaAuthenticated: STRING,
                    creationDate: STRING
                >,
                sessionIssuer: STRUCT<
                    type: STRING,
                    principalId: STRING,
                    arn: STRING,
                    accountId: STRING,
                    userName: STRING
                >
            >
        >,
        eventTime STRING,
        eventSource STRING,
        eventName STRING,
        awsRegion STRING,
        sourceIpAddress STRING,
        userAgent STRING,
        errorCode STRING,
        errorMessage STRING,
        requestParameters STRING,
        responseElements STRING,
        additionalEventData STRING,
        requestId STRING,
        eventId STRING,
        resources ARRAY<STRUCT<
            ARN: STRING,
            accountId: STRING,
            type: STRING
        >>,
        eventType STRING,
        apiVersion STRING,
        readOnly BOOLEAN,
        recipientAccountId STRING,
        serviceEventDetails STRING,
        sharedEventID STRING,
        vpcEndpointId STRING
    )
    COMMENT 'CloudTrail table with JsonSerDe and Partition Projection'
    PARTITIONED BY (
        `region` string,
        `year` string,
        `month` string,
        `day` string
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3://aws-data-mesh-audit-trail-<account_id>-<region>/AWSLogs/<account_id>/CloudTrail/'
    TBLPROPERTIES (
        'projection.enabled' = 'true',
        'projection.region.type' = 'enum',
        'projection.region.values' = '<region>',
        'projection.year.type' = 'integer',
        'projection.year.range' = '2025,2040',
        'projection.month.type' = 'integer',
        'projection.month.range' = '01,12',
        'projection.day.type' = 'integer',
        'projection.day.range' = '01,31',
        'storage.location.template' = 's3://aws-data-mesh-audit-trail-<account_id>-<region>/AWSLogs/<account_id>/CloudTrail/${region}/${year}/${month}/${day}/'
    );

Explanation of Partition Projection Properties:

  • projection.enabled: Activates partition projection for this table.

  • projection.<partition_key>.type: Defines the data type (integer, enum, date, or injected).

  • projection.<partition_key>.range: Specifies the valid range of values for the partition key. Athena uses this to calculate valid partition locations.

  • storage.location.template: This is the most important property. It defines the S3 path template that Athena uses to find our data. The ${variable} placeholders correspond to the partition keys defined in the PARTITIONED BY clause.

Step 5: Analyzing Activity with Athena

Athena lets you run SQL queries directly on CloudTrail logs stored in S3. This is invaluable for historical audits and answering questions like "Who deleted a table?" or "When were permissions changed?"

Let’s go to Athena Console, and run few sample queries

Example Queries:

  • List all Lake Formation access:

      SELECT eventTime,
          eventName,
          userIdentity,
          params.tableArn AS tableArn,
          params.permissions AS permissions
      FROM (
              SELECT eventTime,
                  eventName,
                  userIdentity.arn as userIdentity,
                  CAST(
                      json_parse(requestparameters) AS ROW(
                          tableArn VARCHAR,
                          permissions ARRAY < VARCHAR >,
                          auditContext ROW(
                              additionalAuditContext VARCHAR,
                              cellLevelSecurityEnforced boolean,
                              expectedTableId VARCHAR
                          )
                      )
                  ) AS params
              from cloudtrail_logs
              WHERE region = 'us-east-1'
                  AND year = '2025'
                  AND month = '06'
                  AND day between '21' and '30'
                  and eventName = 'GetDataAccess'
          )
      ORDER BY eventTime DESC
    
  • Find all permission grants:

      SELECT eventTime,
          eventName,
          userIdentity.arn as userIdentity,
          requestparameters
      from cloudtrail_logs_9
      WHERE region = 'us-east-1'
          AND year = '2025'
          AND month = '06'
          AND day between '21' and '30'
          and eventName = 'BatchGrantPermissions'
    
💡
As we are using Partition projections, we need to provide where condition always, without these conditions, we wont get any results
💡
As per requirement, we can build dashboards in Quicksight from this table.

Capturing Real Time Governance Events with EventBridge

While CloudTrail provides an essential historical record for auditing, many security and operational use cases require immediate action. To achieve this, we will use Amazon EventBridge to build rules that filter the stream of events from CloudTrail. This allows us to automatically capture critical governance changes, such as a table being deleted or permissions being granted, the moment they happen. This capability is the foundation for building automated alerts, security responses, and the resilient, self healing data platforms we will explore in future parts.

What we will do in this phase:

  • Create two distinct EventBridge rules to monitor different types of activity.

  • Our first rule, created using the AWS Management Console, will capture critical write operations from AWS Glue and Lake Formation, such as creating or deleting tables and changing permissions.

  • Our second rule, created using the AWS CLI, will specifically monitor read access requests (GetDataAccess) from Lake Formation, which helps in auditing who is querying the data.

  • For both rules, we will configure a CloudWatch Log Group as the target to create dedicated, real time log streams for each type of event.

Setting up EventBridge to capture governance events provides immediate visibility into critical changes across our data mesh. Instead of waiting to discover an issue during a periodic audit, we can log it, or even act on it, the moment it occurs. This enhances our security posture by enabling rapid investigation of unauthorized activity and provides an immutable log for compliance. Most importantly, it creates the event driven foundation needed for future automation, such as triggering alerts or disaster recovery workflows.

Step 1: Capturing Write Operations

First, we will initiate the rule creation process from the EventBridge console.

  1. Navigate to the Amazon EventBridge console.

  2. From the sidebar, select Rules and then click Create rule.

  3. Name the rule as DataMeshWriteEvents.

  4. For Rule type, ensure Rule with an event pattern is selected, and click Next.

  5. For Event source, select AWS events or EventBridge partner events.

  6. Scroll down to the Event pattern section.

  7. Select Custom patterns (JSON editor) and paste the following JSON pattern to target write operations.

     {
       "source": ["aws.glue", "aws.lakeformation"],
       "detail-type": ["AWS API Call via CloudTrail"],
       "detail": {
         "eventName": [
           "CreateTable",
           "UpdateTable",
           "DeleteTable",
           "BacthDeleteTable",
           "CreatePartition",
           "BatchCreatePartition",
           "DeletePartition",
           "BatchDeletePartition",
           "UpdatePartition",
           "BatchUpdatePartition",
           "GrantPermissions",
           "BatchGrantPermissions",
           "RevokePermissions",
           "BatchRevokePermissions"
         ]
       }
     }
    
  8. Click Next.

  9. In the Target 1 box, select AWS service.

  10. Under Select a target, choose CloudWatch log group.

  11. For Log group, let’s create dedicated log group for these events, name it as /aws/events/datamesh-write-events.

  12. Click Next, review the configuration, and click Create rule.

Next we will create rule for read event and then test both the rules.

Step 2: Capturing Data Access Operations

Since EventBridge rules for read events like GetDataAccess cannot be created directly through the AWS Management Console, we will use the AWS CLI/Python to create this rule.

💡
Note: AWS announced support for capturing read events in EventBridge in November 2023. This feature is currently accessible only via AWS CLI or SDKs, not through the console. Once this rule will be created, in console, we can see state of this rule as Enabled (with CloudTrail read-only Management events)

To create this rule, we will use AWS CloudShell, a convenient, browser based environment that comes preconfigured with the AWS CLI and Python.

What the Script Will Do

We have provided two scripts that perform the same five actions. We only need to run one of them. The script will:

  1. Define the event pattern for GetDataAccess events.

  2. Create the EventBridge rule.

  3. Create a dedicated CloudWatch Log Group.

  4. Set the log group as the target for the rule.

  5. Add a resource policy to the log group to grant EventBridge permission to write logs. This is a crucial step that the console handles automatically but must be done manually when using a script.

First, launch a new AWS CloudShell session.

Follow these general steps, using the specific link and filename for the chosen script from the options provided below.

  • Option A: Shell Script → GitHub Link, File Name → create_event_bridge_rule.sh

  • Option B: Python Script → GitHub Link, File Name → create_event_bridge_rule.py

  1. Download the script from GitHub using curl.

     # Replace [GitHub Link] with the link for your chosen script
     curl -O [GitHub Link]
    
  2. Edit the Placeholders. Replace the <region> and <account_id> placeholders. Here are two ways to do this:

    • Option A: Using sed for Quick Replacement (Recommended)
      The sed command finds and replaces text directly from the command line.

        # Replace placeholders in your script file.
        # IMPORTANT: Update your-aws-region and your-aws-account-id with your actual values.
      
        sed -i 's/<region>/your-aws-region/g' [script_filename]
        sed -i 's/<account_id>/your-aws-account-id/g' [script_filename]
      
    • Option B: Using a Text Editor
      If you prefer to edit the file interactively, you can use a command line editor like nano or vim.

        bash# Open the file with your preferred editor
        nano [script_filename]
        # OR
        vim [script_filename]
      
      • In nano, use Ctrl + \ to search and replace.

      • In vim, use the command :%s/<placeholder>/your-value/g to replace all instances.

  3. Run the Script

  4.   chmod +x [script_filename] #Make script executable
    
      #if shell script then run using below command
      ./[script_filename]
    
      #if python script then run using below command
      python3 [script_filename]
    

After running either script, our DataMeshAccessRule rule will be active and logging data access events to the specified CloudWatch log group, providing a clean and organized way to monitor data access patterns.

Step 3 Testing the EventBridge Rules

With both EventBridge rules active and logging to their respective CloudWatch log groups, we now have a structured and efficient way to monitor governance changes and data access events.

To verify the setup, perform the following tests:

  • Test Rule 1 (Write Operations): Revoke a permission in Lake Formation to trigger a governance event. We can then check the /aws/events/datamesh-write-events log group to see the captured event.

    Remember to restore the revoked permission afterward to maintain proper access.

  • Test Rule 2 (Read Access): Log in as the DataMeshUser and execute a query in the Athena Query Editor. This will generate a GetDataAccess event, which you can find in the /aws/events/datamesh_access_log_group log group

With our EventBridge rules now active, we have powerful, real time visibility into critical management and governance events across our data platform. We can see when tables are altered or permissions are granted the moment it happens.

However, this only tells part of the story. To get a complete picture of our data lake’s activity, we need to go one level deeper and monitor who is accessing the individual data objects within our S3 buckets.


Auditing Object Level Access with S3 Server Access Logs

In this next section, we will set up S3 Server Access Logging to capture a detailed, object level audit trail. This will complement our management event monitoring by providing a granular record of every file interaction, which is essential for comprehensive security analysis and usage auditing.

S3 Server Access Logs can tell us who is accessing the individual files within that table, how frequently, and from where. This provides a granular, low level audit trail that is invaluable for security analysis, performance tuning, and understanding data usage patterns.

Why S3 Server Access Logs Instead of CloudTrail Data Events?

A common question is why we would use S3 Server Access Logs when CloudTrail Data Events can also track object level operations. While both are powerful, they serve different primary purposes, and for our goal of comprehensive, cost effective auditing, S3 Server Access Logs are the better choice for several reasons:

  • Cost: This is the most significant differentiator. S3 Server Access Logging is free, with costs incurred only for storing the log files in S3. CloudTrail Data Events, on the other hand, incur a direct cost for every event processed, which can become very expensive for high traffic buckets.

  • Log Detail: S3 access logs capture specific operational details not found in CloudTrail, such as object size, total request time, turn around time, and the HTTP referrer. They are also the only way to log authentication failures, which are critical for detecting security threats.

  • Comprehensive Record: S3 logging provides a detailed, best effort record of all requests made directly to your S3 objects, including those from presigned URLs, which may not always appear in CloudTrail.

While CloudTrail data events are delivered faster (within 5 minutes vs. a few hours for S3 logs) and integrate seamlessly with CloudWatch Events, their primary strength is for real time, high priority security alerts across multiple AWS services. For detailed, historical, and cost effective analysis of data access patterns, S3 Server Access Logging is the superior tool.

What we will do in this phase:

  • Enable S3 Server Access Logging on our data lake bucket to capture all object level requests.

  • Configure a separate S3 bucket as the destination for these logs to maintain organization and prevent logging loops.

  • Create a new database and a partitioned table in AWS Athena to make these raw log files easily analyzable with standard SQL.

Enabling S3 Server Access Logging gives us a complete and cost effective audit trail of every GET, PUT, and DELETE request made to our data objects. By structuring these logs in Athena, it empower our security, operations, and analytics teams to troubleshoot access issues, analyze usage patterns, optimize performance, and meet strict compliance requirements without running up significant costs associated with high volume event processing.

Step 1: Enable Server Access Logging on Your S3 Bucket

First, we will configure our main data lake bucket to send access logs to a separate destination bucket.

  1. Navigate to the Amazon S3 console.

  2. Create a new S3 bucket to serve as the destination for our logs, we will name it as aws-data-mesh-access-logs-<account_id>-<region>. This bucket must be in the same region as your source bucket and should not have logging enabled on it to avoid recursive logging.

    💡
    For this bucket, keep encryption as SSE-S3, as server access logging doesn’t support SSE-KMS as of now
  3. Go to the properties of our datalake S3 bucket.

  4. Find the Server access logging section and click Edit.

  5. Enable logging and select the destination bucket we just created. We can optionally specify a target prefix to keep the logs organized within the bucket. For now, we will keep it blank

    • For Log object key format, select [SourceAccountId]/​[SourceRegion]/​[SourceBucket]/​[YYYY]/​[MM]/​[DD]/​[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[UniqueString]. As this will help in partitioning
  6. Click Save changes. Amazon S3 will automatically update the destination bucket's policy to allow the S3 logging service principal to write log files to it.

Step 2: Create a Table in Athena

To analyze the logs, we will define a table structure in Athena that maps to the S3 access log format. Just like we did for CloudTrail Logs, we will use same database to create the table

  1. Open the Athena console.

  2. Run the following CREATE TABLE query to define the table structure.

    1. Replace <region> with region in which we are building this setup. For me, it is us-east-1.

    2. Replace <account_number> with our account_id.

       CREATE EXTERNAL TABLE `s3_access_logs`(
         `bucketowner` STRING,
         `bucket_name` STRING,
         `requestdatetime` STRING,
         `remoteip` STRING,
         `requester` STRING,
         `requestid` STRING,
         `operation` STRING,
         `key` STRING,
         `request_uri` STRING,
         `httpstatus` STRING,
         `errorcode` STRING,
         `bytessent` BIGINT,
         `objectsize` BIGINT,
         `totaltime` STRING,
         `turnaroundtime` STRING,
         `referer` STRING,
         `useragent` STRING,
         `versionid` STRING,
         `hostid` STRING,
         `sigv` STRING,
         `ciphersuite` STRING,
         `authtype` STRING,
         `endpoint` STRING,
         `tlsversion` STRING,
         `accesspointarn` STRING,
         `aclrequired` STRING
       )
       PARTITIONED BY ( 
         `year` INT,
         `month` INT,
         `day` INT
       )
       ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
       WITH SERDEPROPERTIES (
         'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$'
       )
       LOCATION 's3://aws-data-mesh-access-logs-<account_id>-<region>/<account_id>/<region>/<datalake_bucket>/'
       TBLPROPERTIES (
         'projection.enabled' = 'true',
         'projection.year.type' = 'integer',
         'projection.year.range' = '2025,2040',
         'projection.month.type' = 'integer',
         'projection.month.range' = '01,12',
         'projection.month.digits' = '2',
         'projection.day.type' = 'integer',
         'projection.day.range' = '01,31',
         'projection.day.digits' = '2',
         'storage.location.template' = 's3://aws-data-mesh-access-logs-<account_id>-<region>/<account_id>/<region>/<datalake_bucket>/${year}/${month}/${day}/'
       );
      

With our s3_access_logs table set up, we can now run highly efficient queries by filtering on the partition keys (year, month, day, hour) in WHERE clause. This is crucial for controlling costs and improving query performance, as it allows Athena to calculate the exact S3 locations to scan without listing all partitions.

Step 3: Analyzing Activity with Athena

Remember to adjust the year, month, and day values in the WHERE clause to match the time range you wish to analyze.

1. Top 10 IP Addresses by Request Count
This query identifies the top 10 IP addresses making requests on a specific day, which is useful for identifying top traffic sources or potential security anomalies.

SELECT remoteip,
    COUNT(*) AS request_count
FROM s3_access_logs
WHERE year = 2025
    AND month = 06
    AND day = 22
GROUP BY remoteip
ORDER BY request_count DESC
LIMIT 10;

2. Count Requests by HTTP Status Code
This helps us monitor the health of our S3 access by identifying the frequency of successful requests (2xx), client-side errors (4xx), or server-side errors (5xx).

SELECT httpstatus,
    COUNT(*) AS status_count
FROM s3_access_logs
WHERE year = 2025
    AND month = 06
    AND day = 22
GROUP BY httpstatus
ORDER BY status_count DESC;

3. Find All Requests for a Specific Object
Use this for detailed auditing of every access attempt on a single, specific file on a given day.

SELECT *
FROM s3_access_logs
WHERE key = 'path/to/your/specific-object.csv'
    AND year = 2025
    AND month = 06
    AND day = 22;

4. Count Requests by Operation Type
This query shows the breakdown of operations (e.g., REST.GET.OBJECT, REST.PUT.OBJECT, REST.DELETE.OBJECT) performed on our data, helping us understand usage patterns.

SELECT operation,
    COUNT(*) AS operation_count
FROM s3_access_logs
WHERE year = 2025
    AND month = 06
    AND day = 22
GROUP BY operation
ORDER BY operation_count DESC;

5. List Unique Users Accessing Data
This query lists the unique IAM users or roles that accessed data on a given day, providing a clear audit of who is interacting with our data.

SELECT DISTINCT requester
FROM s3_access_logs
WHERE year = 2025
    AND month = 06
    AND day = 22;

Summary

In this part, we built a comprehensive observability framework, transforming our data mesh from a black box into a transparent, auditable, and secure platform. We now have the tools and knowledge to monitor every layer of our data ecosystem, from high level operational health down to individual file access.

Specifically, we learned how to:

  • Proactively Monitor Glue Crawlers by creating CloudWatch metric filters and alarms to get immediate notifications for crawler failures.

  • Establish a Complete Audit Trail of all API calls using AWS CloudTrail and analyze these logs with Athena to answer critical security and operational questions.

  • Capture Critical Events in Real Time with Amazon EventBridge, setting up separate rules for both governance changes and data access requests.

  • Implement Granular Object Level Logging in a cost effective way using S3 Server Access Logs, and build an automated partition projection table in Athena for easy analysis.


What's Next: A Security Challenge for You

With our observability framework in place, we've made huge strides in meeting our security responsibilities. According to the AWS Shared Responsibility Model, while AWS secures the cloud infrastructure itself, we are responsible for securing what we put in the cloud. Our current setup provides visibility, but visibility alone isn't prevention.

So, before we move on, I want to pose a challenge to you: Looking at the architecture we have just built, what potential security gaps or misconfigurations might still exist? Specifically, Network Security.

Network security is a critical layer of defense that protects data in transit, controls access, and ensures secure communication between services and domains. This moves us from monitoring activity to proactively controlling access, a key aspect of the "security in the cloud" model.

In our next blog, we will dive deep into hardening our platform by addressing this very question. We will start by locking down our network and implementing stricter encryption controls to ensure our data mesh is not just observable, but truly resilient and secure from the ground up.

Stay tuned as we build a secure network foundation, ensuring our data is accessed not just by the right identities, but also from the right network locations.


If you have questions, run into issues, or want to share your results, please leave a comment below. Your feedback and insights help us all build better data platforms together

0
Subscribe to my newsletter

Read articles from Harshit Garg directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harshit Garg
Harshit Garg