๐๐๐ซ๐ญ ๐ โ ๐๐๐ฒ๐ข๐ง๐ ๐ญ๐ก๐ ๐ ๐จ๐ฎ๐ง๐๐๐ญ๐ข๐จ๐ง๐ฌ


Welcome to Part 1 of our hands-on series for building a data mesh on AWS. In this guide, weโll set up a secure, governed data mesh on AWS.
This part will focus on building central catalog components from below Architecture Diagram
Prerequisites
AWS Account
If you donโt have an AWS account, follow the official guide:
๐ Create and Activate an AWS Account (AWS Docs)For all setup, create and use an Administrator IAM user instead of your root account.
๐ Creating Your First IAM Admin User and Group (AWS Docs)Only use your root user for billing and account management. Secure your root credentials with MFA.
Step 1: Start with Security: Create a KMS Key
Security is foundational in any modern data platform. Creating a KMS key first is critical to ensure that all data and metadata stored in AWS services is encrypted and secure. This supports compliance and protects sensitive information.
Navigate to the AWS KMS Console. Update region as per need.
Click Create key.
Select Symmetric, and Click Next.
We will name our key as aws-data-mesh-key. Click Next.
Add admin IAM user as a key administrator and user. Click Next.
Review Key Policy. Click Finish.
Step 2: Organize our Storage: Create S3 Bucket
With encryption in place, Lets set up our datalake storage. We will name it as aws-data-mesh-<account_number>-<region>
Navigate to the AWS S3 Console. Update region as per need.
Click Create Bucket.
Name the bucket as discussed above.
Under Default Encryption settings, Select SSE-KMS, Select KMS Key created in Step 1 (This ensures bucket is always protected).
Click on Create Bucket.
Configure Bucket Policy
Once bucket is created, we will add a bucket policy to deny unsecured (non-HTTPS) access and prevent the use of pre-signed URLs:
Select the datalake bucket from the bucket list,
Click on the Permissions tab,
Scroll down to Bucket policy section.
Click Edit to modify the bucket policy.
Copy and paste the policy from the GitHub link.
- Replace REPLACE_ME_WITH_DATALAKE_BUCKET with bucket name.
Click Save changes.
Step 3. Protect our Metadata: Configure Glue Encryption
The Glue Data Catalog stores metadata about our datasets. Encrypting this metadata with the KMS key ensures that sensitive information about our data structure and schema is protected.
Navigate to the Amazon Glue Console. Update region as per need.
In Sidebar, under Data Catalog, Go to Catalog Settings.
Select Metadata encryption checkbox, and select KMS Key created in Step 1.
Click on Save.
Step 4. Centralize Governance: Set Up Lake Formation
Lake Formation centralizes data governance, allowing fine-grained access control and simplifying permissions management across AWS data services.
Step 4a. Enable Lake Formation
Navigate to the Amazon Lake Formation Console. Update region as per need.
On first-time service access, a welcome dialog will appear. Select Add myself from the available options and click Get Started to proceed.
Under Administration, Go to Data catalog settings. We will see WARNING sign beside it, this is due to default permissions are set to IAM.
Uncheck both boxes for Default permissions for newly created databases and tables. And now WARNING sign is gone.
Click Save.
Step 4b. Remove broad IAM Permissions
Next, we need to remove any broad IAM permissions that were automatically granted.
Under Administration, Go to Administrative roles and tasks.
Revoke IAMAllowedPrincipals from Catalog creators and Database creators.
Select radio button for IAMAllowedPrincipals.
Click on Revoke button.
In pop-up, Select Create Catalog/Create Data Database.
Click Revoke.
Next, we will create an IAM Role that serves as the bridge between Lake Formation and our datalake buckets. This role operates through the following workflow:
Registration: We use this role to register our S3 bucket with Lake Formation
Authorization: When users authorize successfully, Lake Formation returns temporary security tokens for this role with access to only required s3 path
Access Control: Users utilize these temporary tokens to access datalake bucket, ensuring no direct S3 bucket access is granted
This approach maintains secure, auditable access to our Data Mesh resources.
Step 4c. Create LF Service Role
Navigate to the Amazon IAM Console. IAM Service is global, so, we donโt need to update region
Under Access Management, Go to Roles.
Click on Create Role.
Select AWS Service, Search for Lake Formation in drop down list. Click Next,
For now, donโt select any Managed Policy as we will create our own policy with least privilege. Click Next.
Name the role as LFServiceRole.
Click Create role.
Now, we will add following permissions to this role.
KMS Permissions, as this role will be used to read and write data to datalake buckets via Lake Formation, we will provide
Decrypt
,GenerateDataKey
andEncrypt
permissionsS3 permissions, as all access will be governed by Lake Formation, which internally will use this role to access S3 bucket, so, we will add
GetObject
,PutObject
andListBucket
permissions
We will follow below steps to add these permissions
Select LFServiceRole.
Under Permissions tab, Drop down for Add permissions. Click Create inline policy.
Click on JSON and copy policy from the GitHub Link.
Replace REPLACE_WITH_KMS_KEY_ARN_CREATED_IN_STEP_1 with KMS Key ARN created in step 1. Checkout KMS Console, It will be starting with arn:awsโฆ
Replace REPLACE_ME_WITH_DATALAKE_BUCKET with bucket created in step 2.
Click Next.
Name the policy as LFServiceRolePolicy.
Click Create Policy.
Step 4d. Register Datalake Bucket in Lake Formation
Return to the AWS Lake Formation Console.
Under Administration, select Data lake locations.
๐กImportant: Verify that we are in the Administration section, not the "Data locations" option found under Permissions section.Click Register Location to begin the registration process.
Browse to our Datalake S3 bucket and click Select.
From the IAM role dropdown, select the role created in the previous step.
Click Register Location to complete the bucket registration.
Step 5. Automate Discovery: Create a Glue Crawler
With governance in place, last part of this guide is to automate the discovery of our data using Glue Crawlers. These crawlers scan data in S3, infers schema, and populates the Glue Data Catalog, making data available for query in Athena and other analytics tools
Step 5a. Create Glue Database
Return back to the AWS Glue Console. Update region as per need.
Under Data Catalog, Click on Databases.
Click on Add Database.
Name the database as sf_safety_data.
We will use San Francisco Fire Department Emergency Response data.
We will discuss more about data in next part.
Provide location as
s3://<datalake_bucketname>/sf_safety_data/
.๐กImportant: Although, this field is mentioned as optional on console, this step is mandatory for crawlers to be able to scan s3 database locationClick on Create Database.
Step 5b. Create Glue Crawler Role
Navigate to Amazon IAM Console.
Under Access Management, Go to Roles.
Click Create Role.
Select AWS Service, Search for Glue in drop down list. Click Next.
For now, donโt select any Managed Policy as we will create our own policy with least privilege. Click Next.
Name the role as AWSGlueServiceRole-datamesh.
Click Create role.
Now, we will add following permissions to this role.
KMS Permissions - as this role will be used to scan encrypted data in our buckets, we will provide
Decrypt
permissions along with that this role, will be used by crawler to update Data Catalog, which is encrypted, so we will addGenerateDataKey
andEncrypt
permissionsGlue Permissions - Crawler will create/update tables in our database along with partitions (we havenโt discussed about these as of now), so, it will need Glue permissions
Lake Formation Permissions - as all the access is governed by Lake Formation, we need to provide Lake Formation
GetDataAccess
permissionCloudWatch Permissions - Our glue crawler will write logs in CloudWatch, so it will need permissions to put events in it. (Again, we havenโt discussed about this service as of now)
We will follow below steps to add these permissions
Select AWSGlueServiceRole-datamesh.
Under Permissions tab, Drop down for Add permissions. Click Create inline policy.
Click on JSON and copy policy from this GitHub Link.
Replace REPLACE_WITH_KMS_KEY_ARN_CREATED_IN_STEP_1 with KMS Key ARN created in step 1. Checkout KMS Console, It will be starting with arn:awsโฆ
Replace <region> with region in which we are building this setup. For me, it is us-east-1.
Replace <account_number> with our account_id.
Click Next.
Name the policy as CrawlerAccessPolicy.
Click Create Policy.
Step 5c. Add Glue Crawler Role Permissions in Lake Formation
Grant Glue Crawler Role Access to Database and Tables
๐กCurrently, In Lake Formation, we cant provide Database and table permissions at the same time, thatโs why, we need to perform below steps in 2 parts. At later stage of series, we will move to Tag-based access control, in which we will be able to provide Database and Table permissions at the same time
Database Permissions
Navigate to the Amazon Lake Formation Console.
Under Permissions, select Data permissions.
Click Grant to initiate permission assignment.
From dropdown list of IAM users and roles, Select AWSGlueServiceRole-datamesh.
Note: As we are providing access to IAM based role within same account we have selected IAM users and roles radio button. If we have different type of user, we need to select option accordinglyChoose Named Data Catalog resources.
For Catalog, select account ID (only option available).
For Database, select sf_safety_data.
Leave the Table dropdown blank for database-level permissions.
Select Super permission under Database permissions.
Click Grant to apply database permissions.
Table Permissions
Repeat the above process for table access.
Follow steps 1-7 from the database permissions section.
For Tables, select All tables.
Select Super permission under Table permissions.
Click Grant to apply table permissions
Add Filter as mentioned in below image and we will see 2 permissions, 1 each for database and table
Grant Glue Crawler Role Access to S3 Location
Under Permissions, select Data locations.
Click Grant to begin location permission setup.
Select AWSGlueServiceRole-datamesh from the dropdown list.
Browse and select our datalake bucket.
Click Grant to finalize S3 location permissions.
Step 5d. Create Glue Crawler
Navigate to Data Catalog > Crawlers in the AWS Glue Console.
Click Create Crawler to initiate the crawler setup.
Enter sf_safety_data_crawler as the crawler name and click Next.
Configure the data source:
Click Add a data source.
Select the datalake bucket from the available options.
Click Add an S3 data source to confirm the selection.
Click Next to proceed.
Select AWSGlueServiceRole-datamesh from the IAM role dropdown list.
Enable Use Lake Formation credentials for crawling S3 data source checkbox and click Next.
Choose the database created in previous steps as the Target Database.
Configure Advanced Options:
- Set Table level to 3.
Review the configuration and click Next.
Click Create Crawler to finalize the crawler creation.
โ ๏ธ Note: Setting Table level to 3 prevents Glue Crawler from treating multiple datasets as a single table when using the same S3 location for different tables in the future
Summary
In this first part, we laid the groundwork for building a modern Data Mesh on AWS using key services such as S3, Glue, and Lake Formation. We covered the essential concepts and practical steps required to set up a secure, scalable, and well-governed Data Mesh environment. This foundational setup ensures that our data is organized, discoverable, and ready for further processing and analytics.
What We Achieved So Far
Set up the initial AWS account structure for data mesh operations.
Configured foundational storage using Amazon S3.
Established governance and access controls with AWS Lake Formation.
Created a central data catalog using AWS Glue for better data discoverability.
Built a secure and organized environment to support future data publishing and analytics.
Whatโs Next
In the next part of this series, we will focus on publishing data, updating the central data catalog, and demonstrating how to query the data. This will include practical steps for making our datasets available, ensuring they are discoverable, and enabling efficient data access for analytics and business intelligence.
If you face any challenges or have any doubts, feel free to mention them in the comments below, Iโll be happy to answer and help you out!
Subscribe to my newsletter
Read articles from Harshit Garg directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
