Part 5 - Network Security in Data Mesh


Recap: Where We Stand
In our AWS Data Mesh series so far, we have established the foundational concepts of data mesh, set up the foundational AWS environment, and demonstrated how to ingest and catalog a single table using AWS Glue, making it available for querying within the mesh, optimized for scalability and performance, and implemented robust auditing and monitoring to ensure secure, compliant data operations.
You can explore the full series here: AWS Data Mesh Series
Introduction
As organizations scale their data mesh on AWS, network security becomes a foundational pillar for protecting sensitive data and ensuring compliance. In this post, we focus on how to secure data mesh components (especially AWS Glue Crawler), by leveraging AWS networking features. We’ll introduce key networking concepts, explain why they matter, and provide a step-by-step guide to moving our Glue Crawler inside a VPC.
Networking Primer
A solid understanding of AWS networking components is essential for building a secure, scalable data mesh. Here’s a concise overview of the key elements we’ll encounter when securing data mesh resources like Glue Crawlers and ETL pipelines.
Virtual Private Cloud (VPC) - A VPC is our own isolated network within AWS. It lets us define IP address ranges, create subnets, and control how resources communicate internally and externally. VPCs provide the foundation for network security, segmentation, and compliance.
CIDR (Classless Inter-Domain Routing) - A CIDR block defines the range of IP addresses available within our VPC or subnet. It uses a notation like 10.0.0.0/16, where the prefix (/16) specifies the number of bits used for the network portion of the address. This allows us to efficiently allocate and manage IP address ranges for your AWS resources.
Example:
10.0.0.0/16 provides 65,536 IP addresses for our VPC.
10.0.1.0/24 provides 256 IP addresses for a subnet.
Subnet - A Subnet is a segment of our VPC’s IP address range. Subnets can be:
Public: Connected to the internet via an Internet Gateway.
Private: No direct internet access, ideal for sensitive workloads.
Segmenting resources into subnets allows fine-grained control over access and security.
Internet Gateway - An Internet Gateway enables communication between our VPC and the internet. Only subnets with a route to the Internet Gateway can send/receive traffic from the internet, making this a key control point for exposure.
S3 Gateway Endpoint - An S3 Gateway Endpoint allows private, secure connectivity between our VPC and Amazon S3. With this endpoint, data traffic to S3 never leaves the AWS network or traverses the public internet, which is crucial for keeping our S3 buckets private and secure.
Interface Endpoint - An Interface Endpoint (powered by AWS PrivateLink) provides private connectivity to supported AWS services (such as Glue, KMS, Secrets Manager) from within our VPC. It creates elastic network interfaces in our subnets, ensuring service traffic stays within AWS’s private network.
Elastic Network Interface (ENI) - An Elastic Network Interface (ENI) is a virtual network card that we can attach to EC2 instances, Lambda functions, or other supported AWS resources within our VPC. ENIs provide network connectivity for these resources and can have multiple private IP addresses, security groups, and MAC addresses. When we create an Interface Endpoint, AWS automatically provisions an ENI in our subnet, allowing secure, private communication with AWS services.
Route Tables - Route Tables determine how network traffic is directed within our VPC. They define which subnets can reach the internet, S3, and other AWS services, and are critical for enforcing network boundaries and security policies.
Using these components, we can design a secure, compliant, and robust network architecture for our AWS data mesh, ensuring that our data products and processing jobs remain protected and efficient.
Why Network Security Matters in Data Mesh
Effective network security is a cornerstone of a robust data mesh architecture. As organizations distribute data ownership and processing across multiple domains and AWS accounts, the importance of secure, well-designed networking grows significantly. Here’s a deeper look at why network security is essential in this context:
Protects Data Products:
Access Control: By leveraging VPCs, private subnets, and security groups, we can ensure that only authorized AWS resources such as specific EC2 instances, Lambda functions, or Glue jobs can access sensitive data products.
Minimizes Exposure: Using VPC endpoints and private connectivity ensures that data never traverses the public internet, reducing the risk of interception or unauthorized access.
Enables Compliance:
Regulatory Alignment: Many compliance frameworks (such as GDPR, HIPAA, or PCI DSS) require that sensitive data is processed and transferred within private, controlled networks. Network security features like VPC endpoints and private subnets help meet these requirements.
Auditability: Keeping data flows within the VPC allows for more granular logging and monitoring via VPC Flow Logs, CloudTrail, and other AWS tools, supporting audits and incident response.
Data Residency: Network controls enable us to enforce data residency and sovereignty policies by restricting data movement to approved paths and regions.
Supports Future ETL Pipelines:
Secure Processing: As our data mesh will mature, we’ll introduce more complex ETL pipelines using AWS Glue or similar services. Running these jobs inside a secure VPC ensures that intermediate and final datasets remain protected throughout processing.
Scalable Integration: A well-secured network foundation makes it easier to integrate new data sources, consumers, and processing tools without exposing sensitive data to unnecessary risk.
Consistent Security Posture: By standardizing network security practices early (such as using S3 Gateway Endpoints and Interface Endpoints for service access), we ensure that all future data processing workloads inherit strong security controls by default.
By prioritizing network security in our data mesh, we not only safeguard sensitive data and meet compliance obligations, but also create a flexible, secure environment that can scale with our evolving data needs
Building a Secure Network Foundation
Before we dive into the technical steps, it’s important to understand that our network architecture is the bedrock of our data mesh’s security and scalability. By thoughtfully designing each layer from VPC creation to endpoint configuration, we ensure that our data products remain protected, our compliance needs are met, and our environment is ready for future data processing demands. The following guide walks through each critical step, providing not only the “how,” but also the “why,” so we can confidently build a robust, secure network for our AWS Data Mesh.
Step 1. Build Your Fortress: Create a New VPC
A Virtual Private Cloud (VPC) is the foundation of your AWS network security. It gives us complete control over our network environment, including IP address ranges, subnets, route tables, and gateways.
Navigate to VPC Console. Click on VPCs from Dashboard.
Click Create VPC. Select VPC Only radio button.
💡Note: While the "VPC and more" wizard can automate much of the setup, we’ll configure each component manually here for better understanding.Name our VPC as
data-mesh-vpc
.Specify the IPv4 CIDR block as
10.0.0.0/16
.For IPv6 CIDR block, select No IPv6 CIDR block since we are working with IPv4 only.
Set Tenancy to Default.
Click Create VPC to finish the process.
Our new VPC is now ready for further configuration!
To prepare our VPC for creating Interface Endpoints, we need to enable DNS hostnames. This setting allows AWS to assign DNS names to resources within our VPC, which is essential for seamless private connectivity to AWS services via Interface Endpoints. This is a foundational step for achieving secure, private communication between your VPC resources and AWS services.
With your VPC selected, click on the Actions dropdown menu.
Choose Edit VPC Settings.
In the settings window, check the box for Enable DNS hostnames.
Click Save to apply the changes.
Step 2. Divide and Conquer: Set Up Public and Private Subnets
Subnets allow us to segment our network for better security and availability. Public subnets can host resources that need internet access, while private subnets keep sensitive workloads isolated.
From sidebar in VPC Console, Click on Subnets.
Click on Create Subnet.
Choose
data-mesh-vpc
VPC.Add subnets:
For the first subnet:
- Subnet Name:
PrivateSubnet01
, IPv4 CIDR block:10.0.1.0/24
, Availability Zone: Select one (e.g.,us-east-1a
)
- Subnet Name:
Click Add new subnet.
For the second subnet:
- Subnet Name:
PrivateSubnet02
, IPv4 CIDR block:10.0.2.0/24
, Availability Zone: Select a different one (e.g.,us-east-1b
)
- Subnet Name:
Click Add new subnet.
For the third subnet:
- Subnet Name:
PublicSubnet01
, IPv4 CIDR block:10.0.101.0/24
, Availability Zone: Choose the first zone (e.g.,us-east-1a
)
- Subnet Name:
Click Add new subnet.
For the fourth subnet:
- Subnet Name:
PublicSubnet02
, IPv4 CIDR block:10.0.102.0/24
, Availability Zone: Choose the second zone (e.g.,us-east-1b
)
- Subnet Name:
Verify that both private subnets and both public subnets are distributed across different Availability Zones for high availability.
Once all subnet details are entered, click Create Subnet to complete the process.
Our VPC now contains two private and two public subnets, each in separate Availability Zones, providing a resilient and secure network foundation for your data mesh environment.
Step 3. Direct the Traffic: Create an Additional Route Table
A dedicated route table for our private subnets ensures we control how traffic flows, keeping private resources isolated from the internet.
From sidebar in VPC Console, Click on Route Tables.
Click Create Route Table.
Name it
data-mesh-private-rt
Choose
data-mesh-vpc
VPC.Click on Create Route Table.
Step 4. Unlock Private S3 Access: Create an S3 Gateway Endpoint
An S3 Gateway Endpoint allows our private subnets to access S3 securely, without sending traffic over the public internet.
From sidebar in the VPC Console, Click on Endpoints.
Click on Create Endpoint.
Give it name
s3-gateway-endpoint
Select AWS services and search for
com.amazonaws.us-east-1.s3
.Select endpoint with type Gateway
Select our VPC and select route table created above (
data-mesh-private-rt
).Keep Full access Policy
Click Create Endpoint.
Step 5. Seal the Gates: Remove Internet Access from Private Route Table
To ensure true isolation, our private subnets must not have a route to the internet. This prevents accidental data leaks or unauthorized access.
As we created new route table, there will be no route to Internet Gateway, but, it there is any route, we need to delete that.
From sidebar in VPC Console, Click on Route Tables.
Select the private route table (
data-mesh-private-rt
).Click on Routes tab, Edit routes and remove route to the Internet Gateway (
0.0.0.0/0
via IGW).Ensure the only external route is to S3 via the Gateway Endpoint.
Step 6. Assign the Guards: Associate Private Route Table with Private Subnets
Associating the private route table with our private subnets enforces our custom routing and security rules.
In continuation to previous step, Click on Subnet Associations tab.
Click on Edit subnet associations. Select two private subnets.
Save the association.
Step 7. Connect Privately to AWS Services: Create Interface Endpoints
Interface Endpoints (powered by PrivateLink) allow our private subnets to communicate with AWS services like STS, and Lake Formation without traversing the public internet.
From sidebar in the VPC Console, Click on Endpoints.
Click on Create Endpoint.
Go to Endpoints > Create Endpoint.
Name it as
sts-endpoint
.Select AWS services and search for
com.amazonaws.us-east-1.sts
.Select Interface as the endpoint type.
Select our VPC.
Keep Enable DNS name selected which will enable private DNS.
Considering cost, we will only select 1 Availability Zone and select Private Subnet.
Click on Create endpoints.
Follow same steps for below endpoints.
Lake Formation
Name -
lf-endpoint
Service Name -
com.amazonaws.us-east-1.lakeformation
Enabling private DNS for interface endpoints is essential because it ensures that requests to standard AWS service DNS names (like glue.amazonaws.com
or lakeformation.amazonaws.com
) automatically resolve to the private IP addresses of our VPC endpoint, rather than the public internet endpoints. This allows our applications and AWS resources to securely and transparently use private connectivity to AWS services, without requiring any code changes or manual DNS configuration, helping us maintain strict security boundaries and compliance within our VPC environment.
All Set: Our VPC Is Ready for Secure Crawling
At this point, we’ve established a robust network foundation, our VPC, subnets, route tables, S3 Gateway Endpoint, and Interface Endpoints are all in place. This setup ensures that our environment is fully prepared to support private, secure operations for our data mesh workloads.
What’s Next?
With the groundwork complete, the next step is to isolate our AWS Glue Crawler within our VPC. This means configuring the crawler to operate exclusively in our private subnets, leveraging all the security controls and private connectivity we’ve just set up. By doing so, we’ll ensure that our data discovery and processing remain protected, compliant, and ready for future expansion.
Step 8. Set Up Required Permissions for the Crawler Role
For our AWS Glue Crawler to operate successfully inside a VPC, it must have an IAM role with all necessary permissions. These permissions allow the crawler to access data sources, manage network interfaces, and interact with AWS services securely and efficiently.
Navigate to the AWS IAM Console.
Under Access Management, Go to Roles.
Select AWSGlueServiceRole-datamesh.
Under Permissions tab, Drop down for Add permissions. Click Create inline policy.
Click on JSON and copy policy from this GitHub Link.
Click Next.
Name the policy as
PrivateConnectionAccessPolicy
.Click Create Policy.
Step 9. Establish Secure Links: Create a Network Connection on Private Subnet
A network connection on our private subnet ensures that Glue and other resources can securely access data and services within our VPC.
Navigate to the AWS Glue Console, go to Connections under Data Catalog in sidebar.
Click Create connection.
Search Network in the Data sources. Click Next.
Select our VPC, private subnet in which Interface endpoints are created and default security group. Click Next.
Name it
PrivateConnection
. Click Next.Click Create Connection.
Step 10. Move the Crawler Behind the Walls: Update Existing Glue Crawler
Updating our Glue Crawler to run inside the private subnet ensures it operates securely within our protected network, leveraging all our new security enhancements.
Navigate to the AWS Glue Console.
Under Data Catalog, Go to Crawlers console.
Click on
sf_safety_data_crawler
. Click Edit in the top right corner.In the wizard, let’s update settings step-by-step.
Proceed to Step 2: Choose data sources and classifiers and click Edit.
Select existing S3 data source and click Edit.
Select
PrivateConnection
under Network Connections.Click on Update S3 data source.
Go to Step 5 (Review and update) from sidebar.
Click Update.
Step 11. Testing our Setup by Adding Partitions and Running the Crawler
Testing our setup ensures that the Glue Crawler can successfully access the data within your VPC, detect new partitions, and update the Data Catalog accordingly. This validation confirms that all networking, permissions, and configurations are correctly in place.
Upload new data partitions to our existing partitioned table path in S3 data lake bucket.
In the AWS Glue Console, navigate to Glue Crawler.
Run the crawler manually by selecting it and clicking the Run crawler button.
Monitor the crawler run status and logs to ensure it completes successfully without errors.
Verify that the new partitions are reflected in the Glue Data Catalog table.
By completing this step, we validated that our secure network setup and permissions are functioning as intended, and our data mesh environment is ready for production workloads.
Summary
Through this blog, we have taken a significant leap in fortifying the network security and operational readiness of our AWS Data Mesh environment. Our efforts not only enhance the protection of sensitive data but also lay a strong foundation for scalable and compliant data operations in the cloud.
VPC Setup:
By establishing a carefully designed VPC with private networking, we’ve created a secure and compliant environment that forms the backbone of our data mesh. This foundational network layer not only isolates and protects our data resources but also enables seamless integration with essential AWS services, ensuring all operations remain within a trusted boundary.
Crawler Configuration:
Configuring the Glue Crawler to operate within this secure network ensures that data discovery and cataloging processes inherit the same level of protection and governance. This alignment between network architecture and data operations is key to maintaining data integrity, supporting regulatory requirements, and enabling our data mesh to scale confidently for future analytics and ETL workloads.
What's Next: ETL Workflows for the Modern Data Mesh
Building on the secure and scalable network foundation we established in this post, our next step is to bring data transformation workflows into the heart of our data mesh. In the upcoming part, we will design and implement an ETL pipeline that converts raw CSV files into the highly efficient Parquet format using AWS Glue, all within the protected boundaries of our VPC.
This transition is more than just a file format change, it’s a critical move for any modern data platform. By converting data to Parquet, we not only reduce storage costs but also dramatically improve query performance, making large-scale analytics faster and more cost-effective. Automating this transformation process ensures that as new data lands in our data lake, it is immediately optimized and ready for analysis, supporting both current and future business needs.
This next chapter will demonstrate how the secure infrastructure we’ve built enables robust, scalable ETL operations, and sets the stage for advanced analytics and data-driven innovation across our data mesh.
If you have questions, run into issues, or want to share your results, please leave a comment below. Your feedback and insights help us all build better data platforms together
Subscribe to my newsletter
Read articles from Harshit Garg directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
