Part 2 – Operationalizing Data Mesh

Harshit GargHarshit Garg
10 min read

In the previous part, we laid the secure and governed foundations for our data mesh on AWS. If you missed it, check out Part 1 – Laying the Foundations for all the setup details, including KMS encryption, S3 buckets, Lake Formation, and Glue configuration.

Now, let’s move from platform setup to real data workflows i.e. Publishing and Consuming data.

In this series, we’ll be using the official San Francisco Safety Dataset, which you can explore here:
Fire Department and Emergency Medical Services Dispatched Calls.

  • This dataset contains 36 columns with 6.98M rows. Each row is details about 1 response.

  • For this part, we will upload single csv file to our datalake bucket.

  • In upcoming parts, we will perform data transformations on this file, to optimize query performance

Publisher Workflow

Publishers in a data mesh architecture is responsible for preparing, managing, and sharing high-quality data products with the organization. They are typically domain teams (data owners) who ingest raw data, clean and transform it, document metadata, and register datasets into the data platform for discoverability and governed access. Their duties include ensuring data quality, maintaining data lineage, and documentation.

Let’s revisit architecture diagram we had in previous part, where our focus was platform setup. Now, Let’s focus on publisher side

Step 1: Sourcing Real-World Data: Download the Dataset

To make our data mesh valuable, we need to bring in a real dataset that can be published and consumed. The San Francisco Safety dataset is large and well-structured, making it ideal for demonstrating data onboarding, cataloging, and analytics.

Step 2: Ingest Data: Upload Dataset

Once we have the dataset, we will upload it to our datalake bucket, which was created in Part 1 (named aws-data-mesh-<account_number>-<region>) using the AWS Console.

Organizing our data in S3 using meaningful prefixes makes it easier to manage, automate, and later partition. This structure supports scalable data engineering practices.

  1. Navigate to the S3 console.

  2. Select our datalake bucket.

  3. Click on Create folder, name it sf_safety_data (We gave this as prefix in Database Location in Part 1).

  4. Click on sf_safety_data, and again click on Create folder, name it sf_safety_data_csv_non_partitioned.

  5. Click on sf_safety_data_csv_non_partitioned. (Crawler will detect this as our table name).

    1. I have added csv for our file type and non_partitioned as this table wont have any partitions.

    2. This is not mandatory, but I have given table name like this for better understanding about how data is loaded in our table.

  6. Click on Upload. Drag and Drop or select the csv file, downloaded in Step 1.
    Note: Make sure you are inside sf_safety_data_csv_non_partitioned, before you start upload.

Uploading may take several minutes due to the file size.

⚠️ Important: Ensure server-side encryption is enabled (it will be enabled, if you followed Part 1).

Step 3: Metadata Discovery: Run the Glue Crawler

Using Glue crawler, we will process and detect new data and update the Data Catalog. This step ensures that our data is discoverable and ready for analytics without manual schema management.

  1. Navigate to the AWS Glue console.

  2. Select the crawler we created in Part 1.

  3. Click on Run.

It will take few minutes to scan the bucket, once its succeeded. Check Table Changes column. It will show “1 created”

Step 4: Understanding the Data: Explore the Schema

Before running analytics or building dashboards, it’s crucial to understand the structure of our data like what columns are available, their data types, and any potential data quality issues. Exploring the schema in Glue ensures the crawler has correctly inferred the structure and helps us plan efficient queries and transformations.

  1. In Glue Console, select Tables under Databases, then click on sf_safety_data_csv_non_partitioned.

  2. Now, under Table overview, Explore the details.

  3. Scroll down and we will find Schema tab. Check column names and data types.

💡
Since CSV files do not contain embedded schema information, the Glue crawler infers column data types based on the data it scans. When the crawler cannot confidently determine a specific data type for a column, it defaults to assigning the type as string. For instance, the date_as_of column is detected as a string, even though it actually contains timestamp values. We will address this limitation in future parts by transforming the data into Parquet format, which preserves schema and data types more accurately.
💡
Additionally, you may notice that our inferred schema lists 37 columns, whereas the official website mentions only 36. Upon inspecting the data file, we find an extra column, analysis_neighborhoods at the end, which is present in the file but not documented on the website. This confirms that the crawler has correctly detected all column names as they appear in the source data.

This step completes the Publisher Workflow and our data is now discoverable and ready for analytics.


Consumer Workflow

Consumer workflow focuses on discovering, accessing, and analyzing data published by various teams. Consumers such as analysts, data scientists, or downstream applications are responsible for requesting access, querying data according to governance policies, and providing feedback on data usability and quality. Their duties include adhering to access controls, using data responsibly, and collaborating with publishers to improve data quality, ensuring the organization derives maximum value from the data mesh.

Just like we did for publisher workflow, let’s revisit architecture diagram we had in previous part, where our focus was platform setup. Now, Let’s focus on consumer side.

Step 1: Storing Query Results: Create an Athena Results Bucket

Storing Athena query results in a dedicated, encrypted S3 bucket that keeps analytics outputs secure and separate from our datalake. In this bucket, if we want we can create separate prefixes for each consumers, which is a best practice for both security and organization.

Follow steps as mentioned Create S3 Bucket in Part 1 of this series. With only change as below.

  1. Name of bucket → aws-data-mesh-athena-results-<account_number>-<region>.

Step 2: Enabling Analytics: Setup Athena Workgroup

Athena workgroups allow us to isolate workloads, enforce encryption, manage costs, and maintain separate query histories. This is essential for multi-user or multi-team environments.

  1. Navigate to the Athena console.

  2. Click on top left menu button. Under Administration. Click on Workgroups.

  3. Click on Create Workgroup.

  4. Name Workgroup as data-mesh-consumer.

  5. Scroll down to Query result configuration.

  6. Select Customer managed.

  7. For Location of query result, Browse to bucket created in above step.

  8. Select Encrypt query results. Then select SSE-KMS.

  9. Select Choose a different AWS KMS key (advanced), then Select KMS Key created in Part 1.

  10. We can configure Additional configuration to manage usage for the workgroup. But, for now, we will keep it default.

  11. Click Create Workgroup.

Step 3: Getting Access to Datalake: IAM User for the Consumer

In a data mesh architecture, it’s important to give each user or team only the access they need. For a large organization, which already has active directory, using IAM roles is better approach, But, here as we are just starting our journey, we will be creating a dedicated IAM user for data consumers and assigning least-privilege permissions.

Create Managed Policy

  1. Navigate to the IAM console.

  2. In side bar, Click on Policies.

  3. Click on Create Policy.

  4. Click on JSON and copy policy from this GitHub Link.

    1. Replace REPLACE_WITH_KMS_KEY_ARN_CREATED_IN_PART_1 with KMS Key ARN created in part 1.

    2. Replace REPLACE_ME_WITH_QUERY_RESULTS_BUCKET with bucket created in step 1 of consumer workflow.

    3. Replace <region> with region in which we are building this setup. For me, it is us-east-1.

    4. Replace <account_number> with our account_id.

  5. Click Next.

  6. Name the policy as ConsumerUserPolicy.

  7. Click on Create Policy.

Create IAM User

  1. In side bar, Click on Users.

  2. Click on Create User.

  3. Name the user as DataMeshUser.

  4. Select Provide user access to the AWS Management Console - optional.

  5. Choose I want to create IAM user.

  6. Select Autogenerated password.

  7. Select Users must create a new password at next sign-in - Recommended. Click on Next.

  8. Select Attach policies directly.

  9. Search for ConsumerUserPolicy which we created above. Click Next.

  10. Click on Create user.

💡
Copy the password and sign-in URL. Keep it Safe for now, it will be required when we will login via this IAM User.

Step 4: Governing Data Access: Grant Lake Formation Permissions

Now that our consumer IAM user exists, it’s time to explicitly grant them access to query the relevant data. Assigning Lake Formation permissions at this point ensures that only authorized users can read from the database and tables we’ve prepared, enforcing fine-grained, auditable access control as our data mesh grows. This step is crucial for maintaining security and clear data ownership as we enable new consumers.

We have to perform same steps for DataMeshUser, which we have performed in Part 1 to give Glue Crawler Role access to our database and tables. Perform these steps, and let me know in comments if you faced any issues.

💡
Remember to give only DESCIRBE permission for Database and for Table SELECT and DESCRIBE permissions only. Don’t give Data Location permission to Consumers

If it is performed correctly, This is how permissions will look like in Data Permissions section

Congratulations!! We have successfully built our basic Data Mesh setup. Now it is time to query our data


Consuming Data: Login and Query in Athena

Lets login to our account with DataMeshUser

  1. Logout from Administrator user.

  2. Go to console sign-in URL which we have stored in previously.

  3. Provide IAM Username as DataMeshUser.

  4. Paste password stored in above step and sign-in.

  5. Reset Password with new password which can be remembered.

Query Data

  1. Navigate to Athena Console. Update region in which our data mesh is available.

  2. On top-left menu button, In side-bar, select Query editor.
    Note: A red error banner will be visible. This is expected as we have only given access to data-mesh-consumer Workgroup. By default, Athena select primary workgroup.

  3. From Workgroup drop down on top-right. Select the workgroup data-mesh-consumer.

  4. Now, we can see our table sf_safety_data_csv_non_partitioned.

  5. In Query Editor, we can now run queries on this table.

💡
Don’t run SELECT * query without limit as this will scan whole table which is 2.5GB, and we will incur unnecessary cost.
💡
Try to run ALTER and DROP queries. These queries will fail, as we haven’t provided these permissions in Lake Formation.

Explore Athena Documentation to learn about more queries.


Summary

In this part, we advanced our data mesh journey by operationalizing both the publisher and consumer workflows. Here’s what we achieved:

  • Sourced and uploaded a real-world dataset, organizing it in S3 with a logical prefix for better management.

  • Cataloged the dataset using AWS Glue, making it discoverable and ready for analytics.

  • Explored the schema in Glue to ensure the data structure was correctly inferred and understood.

  • Established secure, role-based access for consumers by creating a dedicated IAM user with least-privilege permissions.

  • Set up a dedicated Athena workgroup and an encrypted results bucket to isolate and secure query outputs.

  • Granted granular permissions using Lake Formation, ensuring only authorized users can access the data.

  • Validated the end-to-end workflow by running queries as a consumer and noting query performance and data scanned.

With these steps, we have moved beyond foundational setup by demonstrating how data owners can publish high-quality, discoverable data, and how consumers can securely access and analyze them within a governed environment.

Now that our data mesh platform is set up, try ingesting any other open-source dataset, upload it to our S3 bucket, run the Glue crawler, and access the data in Athena. This is a great way to explore the flexibility and scalability of our architecture and see how easily new data can be onboarded and queried.

What’s Next

In the next part, now that our platform and workflows are in place, we’ll refine the publishing process by partitioning our data and transforming it to Parquet format. We’ll then compare query performance before and after these optimizations, highlighting the tangible benefits of efficient data engineering within a data mesh.


If you encounter any challenges, have questions, or want to share your experience, please drop a comment below. Your feedback and insights are always welcome as we continue building and refining this data mesh journey together.

0
Subscribe to my newsletter

Read articles from Harshit Garg directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harshit Garg
Harshit Garg