๐๐š๐ซ๐ญ ๐Ÿ โ€“ ๐‹๐š๐ฒ๐ข๐ง๐  ๐ญ๐ก๐ž ๐…๐จ๐ฎ๐ง๐๐š๐ญ๐ข๐จ๐ง๐ฌ

Harshit GargHarshit Garg
11 min read

Welcome to Part 1 of our hands-on series for building a data mesh on AWS. In this guide, weโ€™ll set up a secure, governed data mesh on AWS.

This part will focus on building central catalog components from below Architecture Diagram

Prerequisites

Step 1: Start with Security: Create a KMS Key

Security is foundational in any modern data platform. Creating a KMS key first is critical to ensure that all data and metadata stored in AWS services is encrypted and secure. This supports compliance and protects sensitive information.

  1. Navigate to the AWS KMS Console. Update region as per need.

  2. Click Create key.

  3. Select Symmetric, and Click Next.

  4. We will name our key as aws-data-mesh-key. Click Next.

  5. Add admin IAM user as a key administrator and user. Click Next.

  6. Review Key Policy. Click Finish.

Step 2: Organize our Storage: Create S3 Bucket

With encryption in place, Lets set up our datalake storage. We will name it as aws-data-mesh-<account_number>-<region>

  1. Navigate to the AWS S3 Console. Update region as per need.

  2. Click Create Bucket.

  3. Name the bucket as discussed above.

  4. Under Default Encryption settings, Select SSE-KMS, Select KMS Key created in Step 1 (This ensures bucket is always protected).

  5. Click on Create Bucket.

Configure Bucket Policy

Once bucket is created, we will add a bucket policy to deny unsecured (non-HTTPS) access and prevent the use of pre-signed URLs:

๐Ÿ’ก
A presigned URL in Amazon S3 is a special link that temporarily grants access to a specific S3 object, even if the object is private and the person accessing it does not have AWS credentials or permissions.
  1. Select the datalake bucket from the bucket list,

  2. Click on the Permissions tab,

  3. Scroll down to Bucket policy section.

  4. Click Edit to modify the bucket policy.

  5. Copy and paste the policy from the GitHub link.

    1. Replace REPLACE_ME_WITH_DATALAKE_BUCKET with bucket name.
  6. Click Save changes.

Step 3. Protect our Metadata: Configure Glue Encryption

The Glue Data Catalog stores metadata about our datasets. Encrypting this metadata with the KMS key ensures that sensitive information about our data structure and schema is protected.

  1. Navigate to the Amazon Glue Console. Update region as per need.

  2. In Sidebar, under Data Catalog, Go to Catalog Settings.

  3. Select Metadata encryption checkbox, and select KMS Key created in Step 1.

  4. Click on Save.

Step 4. Centralize Governance: Set Up Lake Formation

Lake Formation centralizes data governance, allowing fine-grained access control and simplifying permissions management across AWS data services.

๐Ÿ’ก
By default, Lake Formation is disabled in all the regions, and we need to manage permissions via IAM
๐Ÿ’ก
Managing permissions via IAM will be very tedious as we will be required to update IAM policies for any permission change and over time as more complex permissions are required, it will be almost impossible to manage it via IAM

Step 4a. Enable Lake Formation

  1. Navigate to the Amazon Lake Formation Console. Update region as per need.

  2. On first-time service access, a welcome dialog will appear. Select Add myself from the available options and click Get Started to proceed.

  1. Under Administration, Go to Data catalog settings. We will see WARNING sign beside it, this is due to default permissions are set to IAM.

  2. Uncheck both boxes for Default permissions for newly created databases and tables. And now WARNING sign is gone.

  3. Click Save.

Step 4b. Remove broad IAM Permissions

Next, we need to remove any broad IAM permissions that were automatically granted.

  1. Under Administration, Go to Administrative roles and tasks.

  2. Revoke IAMAllowedPrincipals from Catalog creators and Database creators.

    1. Select radio button for IAMAllowedPrincipals.

    2. Click on Revoke button.

    3. In pop-up, Select Create Catalog/Create Data Database.

    4. Click Revoke.

Next, we will create an IAM Role that serves as the bridge between Lake Formation and our datalake buckets. This role operates through the following workflow:

  • Registration: We use this role to register our S3 bucket with Lake Formation

  • Authorization: When users authorize successfully, Lake Formation returns temporary security tokens for this role with access to only required s3 path

  • Access Control: Users utilize these temporary tokens to access datalake bucket, ensuring no direct S3 bucket access is granted

This approach maintains secure, auditable access to our Data Mesh resources.

Step 4c. Create LF Service Role

  1. Navigate to the Amazon IAM Console. IAM Service is global, so, we donโ€™t need to update region

  2. Under Access Management, Go to Roles.

  3. Click on Create Role.

  4. Select AWS Service, Search for Lake Formation in drop down list. Click Next,

  5. For now, donโ€™t select any Managed Policy as we will create our own policy with least privilege. Click Next.

  6. Name the role as LFServiceRole.

  7. Click Create role.

Now, we will add following permissions to this role.

  • KMS Permissions, as this role will be used to read and write data to datalake buckets via Lake Formation, we will provide Decrypt, GenerateDataKey and Encrypt permissions

  • S3 permissions, as all access will be governed by Lake Formation, which internally will use this role to access S3 bucket, so, we will add GetObject, PutObject and ListBucket permissions

We will follow below steps to add these permissions

  1. Select LFServiceRole.

  2. Under Permissions tab, Drop down for Add permissions. Click Create inline policy.

  3. Click on JSON and copy policy from the GitHub Link.

    1. Replace REPLACE_WITH_KMS_KEY_ARN_CREATED_IN_STEP_1 with KMS Key ARN created in step 1. Checkout KMS Console, It will be starting with arn:awsโ€ฆ

    2. Replace REPLACE_ME_WITH_DATALAKE_BUCKET with bucket created in step 2.

  1. Click Next.

  2. Name the policy as LFServiceRolePolicy.

  3. Click Create Policy.

Step 4d. Register Datalake Bucket in Lake Formation

  1. Return to the AWS Lake Formation Console.

  2. Under Administration, select Data lake locations.

    ๐Ÿ’ก
    Important: Verify that we are in the Administration section, not the "Data locations" option found under Permissions section.
  3. Click Register Location to begin the registration process.

  4. Browse to our Datalake S3 bucket and click Select.

  5. From the IAM role dropdown, select the role created in the previous step.

  6. Click Register Location to complete the bucket registration.

Step 5. Automate Discovery: Create a Glue Crawler

With governance in place, last part of this guide is to automate the discovery of our data using Glue Crawlers. These crawlers scan data in S3, infers schema, and populates the Glue Data Catalog, making data available for query in Athena and other analytics tools

Step 5a. Create Glue Database

  1. Return back to the AWS Glue Console. Update region as per need.

  2. Under Data Catalog, Click on Databases.

  3. Click on Add Database.

  4. Name the database as sf_safety_data.

    • We will use San Francisco Fire Department Emergency Response data.

    • We will discuss more about data in next part.

  5. Provide location as s3://<datalake_bucketname>/sf_safety_data/ .

    ๐Ÿ’ก
    Important: Although, this field is mentioned as optional on console, this step is mandatory for crawlers to be able to scan s3 database location
  6. Click on Create Database.

Step 5b. Create Glue Crawler Role

  1. Navigate to Amazon IAM Console.

  2. Under Access Management, Go to Roles.

  3. Click Create Role.

  4. Select AWS Service, Search for Glue in drop down list. Click Next.

  5. For now, donโ€™t select any Managed Policy as we will create our own policy with least privilege. Click Next.

  6. Name the role as AWSGlueServiceRole-datamesh.

  7. Click Create role.

Now, we will add following permissions to this role.

  • KMS Permissions - as this role will be used to scan encrypted data in our buckets, we will provide Decrypt permissions along with that this role, will be used by crawler to update Data Catalog, which is encrypted, so we will add GenerateDataKey and Encrypt permissions

  • Glue Permissions - Crawler will create/update tables in our database along with partitions (we havenโ€™t discussed about these as of now), so, it will need Glue permissions

  • Lake Formation Permissions - as all the access is governed by Lake Formation, we need to provide Lake Formation GetDataAccess permission

  • CloudWatch Permissions - Our glue crawler will write logs in CloudWatch, so it will need permissions to put events in it. (Again, we havenโ€™t discussed about this service as of now)

๐Ÿ’ก
We havenโ€™t provided any direct access to our S3 Buckets, but still our crawler which we will create in next step, will be able to scan data. I will explain this whole flow in next part when we will perform Athena queries

We will follow below steps to add these permissions

  1. Select AWSGlueServiceRole-datamesh.

  2. Under Permissions tab, Drop down for Add permissions. Click Create inline policy.

  3. Click on JSON and copy policy from this GitHub Link.

    1. Replace REPLACE_WITH_KMS_KEY_ARN_CREATED_IN_STEP_1 with KMS Key ARN created in step 1. Checkout KMS Console, It will be starting with arn:awsโ€ฆ

    2. Replace <region> with region in which we are building this setup. For me, it is us-east-1.

    3. Replace <account_number> with our account_id.

  4. Click Next.

  5. Name the policy as CrawlerAccessPolicy.

  6. Click Create Policy.

Step 5c. Add Glue Crawler Role Permissions in Lake Formation

  1. Grant Glue Crawler Role Access to Database and Tables

    ๐Ÿ’ก
    Currently, In Lake Formation, we cant provide Database and table permissions at the same time, thatโ€™s why, we need to perform below steps in 2 parts. At later stage of series, we will move to Tag-based access control, in which we will be able to provide Database and Table permissions at the same time
  • Database Permissions

    1. Navigate to the Amazon Lake Formation Console.

    2. Under Permissions, select Data permissions.

    3. Click Grant to initiate permission assignment.

    4. From dropdown list of IAM users and roles, Select AWSGlueServiceRole-datamesh.
      Note: As we are providing access to IAM based role within same account we have selected IAM users and roles radio button. If we have different type of user, we need to select option accordingly

    5. Choose Named Data Catalog resources.

    6. For Catalog, select account ID (only option available).

    7. For Database, select sf_safety_data.

    8. Leave the Table dropdown blank for database-level permissions.

    9. Select Super permission under Database permissions.

    10. Click Grant to apply database permissions.

  • Table Permissions

    1. Repeat the above process for table access.

    2. Follow steps 1-7 from the database permissions section.

    3. For Tables, select All tables.

    4. Select Super permission under Table permissions.

    5. Click Grant to apply table permissions

Add Filter as mentioned in below image and we will see 2 permissions, 1 each for database and table

  1. Grant Glue Crawler Role Access to S3 Location

    1. Under Permissions, select Data locations.

    2. Click Grant to begin location permission setup.

    3. Select AWSGlueServiceRole-datamesh from the dropdown list.

    4. Browse and select our datalake bucket.

    5. Click Grant to finalize S3 location permissions.

Step 5d. Create Glue Crawler

  1. Navigate to Data Catalog > Crawlers in the AWS Glue Console.

  2. Click Create Crawler to initiate the crawler setup.

  3. Enter sf_safety_data_crawler as the crawler name and click Next.

  4. Configure the data source:

    1. Click Add a data source.

    2. Select the datalake bucket from the available options.

    3. Click Add an S3 data source to confirm the selection.

    4. Click Next to proceed.

  5. Select AWSGlueServiceRole-datamesh from the IAM role dropdown list.

  6. Enable Use Lake Formation credentials for crawling S3 data source checkbox and click Next.

  7. Choose the database created in previous steps as the Target Database.

  8. Configure Advanced Options:

    • Set Table level to 3.
  9. Review the configuration and click Next.

  10. Click Create Crawler to finalize the crawler creation.

โš ๏ธ Note: Setting Table level to 3 prevents Glue Crawler from treating multiple datasets as a single table when using the same S3 location for different tables in the future


Summary

In this first part, we laid the groundwork for building a modern Data Mesh on AWS using key services such as S3, Glue, and Lake Formation. We covered the essential concepts and practical steps required to set up a secure, scalable, and well-governed Data Mesh environment. This foundational setup ensures that our data is organized, discoverable, and ready for further processing and analytics.

๐Ÿ’ก
You will observe, wherever we are using buckets, I have used โ€œdatalakeโ€, and for all other resources, I have used โ€œData Meshโ€. This is because, our S3 buckets are storage layer which is working as datalake only. Data Mesh is the governed layer over this bucket which will control how data is being accessed

What We Achieved So Far

  • Set up the initial AWS account structure for data mesh operations.

  • Configured foundational storage using Amazon S3.

  • Established governance and access controls with AWS Lake Formation.

  • Created a central data catalog using AWS Glue for better data discoverability.

  • Built a secure and organized environment to support future data publishing and analytics.

Whatโ€™s Next

In the next part of this series, we will focus on publishing data, updating the central data catalog, and demonstrating how to query the data. This will include practical steps for making our datasets available, ensuring they are discoverable, and enabling efficient data access for analytics and business intelligence.


If you face any challenges or have any doubts, feel free to mention them in the comments below, Iโ€™ll be happy to answer and help you out!

0
Subscribe to my newsletter

Read articles from Harshit Garg directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harshit Garg
Harshit Garg