24: AWS Glue Building Blocks 🧱

Hey fellow learners! 👋
This blog demonstrates how to work with AWS Glue through practical, real-world use cases. From setting up crawlers and managing schema changes to building ETL pipelines with transformations and handling sensitive data, each section focuses on solving common data engineering tasks using serverless tools.
Definitions
AWS Glue is a serverless data integration service which eliminates the need for provisioning and managing infrastructure, by automatically scaling based on the workloads.
The data integration capability of AWS Glue revolves around the following three services:
AWS Glue Crawler: It detects the column names and data types, and thus defines the structure of the dataset.
AWS Glue Data Catalog: It simplifies managing and discovering datasets by acting as a central repository for metadata.
ETL Jobs: These facilitate seamless movement and transformation of data across systems, using visual interface, scripts or interactive notebooks.
Crawler and Data Catalog
- Consider that we have the following file – ‘orders_data1.csv’.
- We upload it into the bucket ‘aws-glue-mehul-demo-01’ at the path ‘retaildb/orders/’.
- Inside AWS Glue, we create a new database ‘retaildb’ and provide the location of our ‘retaildb’ folder that we created inside the S3 bucket.
- We create a new crawler and provide the location of ‘orders’ folder inside our S3 bucket. We also configure the ‘AWSGlueServiceRole’ in the security settings, to allow AWS Glue to access the S3 data.
- Our new crawler ‘aws-glue-mehul-crawler-01‘ is ready to run.
- The crawler crawls the data in the S3 bucket and changes (creates) one table, hence creating our first table in the Data Catalog.
- The ‘orders’ table gets created inside ‘retaildb’ database, with the expected four columns.
- Now, we upload a new file ‘orders_data2.csv’, which has an extra column ‘order_amount’.
- By uploading this new file at the same path, we want to observe the impact of schema evolution in Data Catalog.
- We run the crawler again and expect the new version of ‘orders’ data with updated schema.
- The new version of the ‘orders’ table has 5 columns, thus the schema has evolved flawlessly.
ETL Jobs
- Out of the three options to run ETL jobs, Visual ETL is the most intuitive one.
- In Visual ETL, we have three kinds of nodes - Sources, Transforms and Targets.
- For creating our first Visual ETL, we select the source node as AWS Glue Data Catalog and we name it as ‘Source - Glue Data Catalog’. We select the database ‘retaildb‘ and the table ‘orders‘, which will serve as the source for our ETL job.
- When we provide the IAM role with necessary access policies, the Data preview session gets started and fetches the data.
- Next, from Tansforms node type, we select Change Schema transformation node and rename it to ‘Transform - Change Schema‘. We provide our ‘Source - Glue Data Catalog’ node as its node parent, so they get connected as shown.
- The Change Schema transformation helps in applying mapping, in order to change the schema as required.
- Now, from Targets node types, we select Amazon S3 and rename it to ‘Target - Amazon S3’. We configure its parent node to be our ‘Transform - Change Schema‘ node. We select the format as Parquet, implying that the data gets written in Parquet format at the S3 target location - ‘s3://aws-glue-mehul-demo-01/retaildb/output/‘.
- We can rename the ETL job to ‘csv-to-parquet’, provide the necessary IAM role and save it.
- Now, we run the job and see that we don’t encounter any errors.
- We can confirm that the data gets ingested inside the S3 path that we provided earlier, hence signifying that our Visual ETL job has run successfully.
Transformations
- There are various useful transformations available within Visual ETL.
Join Transformation
Consider the following scenario.
- At the path ‘s3://aws-glue-mehul-demo-01/retaildb/customers/‘, we have the file ‘customers.csv‘ that contains the following data.
- At the path ‘s3://aws-glue-mehul-demo-01/retaildb/orders/‘, we have the file ‘orders.csv‘ that contains the following data.
We want to join these two and load the results in Parquet format into S3.
We create two data sources - ‘Orders Source - Amazon S3‘, which contains the S3 path of ‘orders’ folder and ‘Customers Source - Amazon S3‘, which contains the S3 path of ‘customers‘ folder.
- We add a Join transformation node and provide both of our nodes as its parents. The editor provides us with a quick fix to resolve overlapping field names between two sources.
- We click on ‘Resolve it’ and a Change Schema transformation gets applied which modifies the column names in the customer dataset by prefixing them with ‘right_‘. For instance, ‘customer_id‘ becomes ‘right_customer_id‘ to avoid conflicts.
- We select the Inner Join and configure the join condition to merge the datasets on ‘customer_id’ and ‘right_customer_id’.
- On previewing the session, we can see the results of the join, even before running the job.
- In order to drop the redundant columns and get the desired field names, we can add one more Change Schema transformation.
- We give the target as Amazon S3 and provide the final output S3 path as ‘s3://aws-glue-mehul-demo-01/retaildb/output/‘.
- Now, we save and run the job.
- We can see the required files in the ‘output‘ folder.
SQL Query Transformation
- Consider that we select SQL Query transformation node and set ‘Orders - Amazon S3‘ and ‘Customers - Amazon S3‘ as its parents.
- On running a SQL query and previewing the data, we can see the desired results.
Aggregate Transformation
- In the ‘orders‘ dataset, we provide the field on which grouping should happen (order_status), the field to aggregate (order_status) and the aggregation function (count).
- Data preview shows us the results - the count of each ‘order_status’.
Filter Transformation
- We select the Global AND function and provide the filter condition that ‘order_status‘ should match the value ‘CLOSED‘.
- We get the following result on previewing the data - only the records where ‘order_status‘ is ‘CLOSED‘.
Handling Sensitive Data
Sometimes, a dataset may contain personal information like credit card numbers, phone numbers, or emails, that needs to be properly protected.
In our use case, we have a bucket ‘aws-glue-sensitive-data-mehul-01‘, in which we have two folders - ‘input‘ and ‘output‘.
- In the input folder, we upload the file ‘candidate_details.csv‘, which contains sensitive information, quite evidently.
- We create a new Visual ETL job and add ‘Input - Amazon S3‘ data source, where we give the path to ‘candidate_details.csv‘ file. Then we add Detect Sensitive Data transformation, with the S3 source node as its parent. Note that we will look for sensitive data in each row, based on a specific pattern.
- We select the pattern as Email Address and perform Redact operation on the detected entities and replace them with *******. Afterwards, we store the result in ‘output‘ folder in S3.
- After running the job, we can query the data inside ‘output‘ folder using Athena and get the redacted values in Email column. Note that only the pattern that matched is shown.
Evaluating Data Quality
AWS Glue provides us with two ways of evaluating data quality:
Data Quality at Rest
Data Quality in Move
In this blog, we will cover evaluating data quality at rest.
- Consider the following ‘customers_data.csv‘ file.
- We have a bucket ‘aws-glue-data-quality-01‘, in which there are two folders - ‘customers‘ and ‘dq-results‘. We upload the ‘customers_data.csv‘ file into the ‘customers‘ folder.
- In order to catalog this data, we create a new crawler in ‘retaildb‘ database, giving the S3 path of ‘customers‘ folder we just created.
- After running the crawler, the ‘customers‘ table gets created successfully with 8 columns.
- Now, we will create Data Quality rules by recommendations as follows.
- We get the following 27 recommended rules.
- Out of these 27 recommended rules, we keep the following rules and save the ruleset as ‘customers_ruleset‘.
- Now, we run the ruleset on our data by providing the path of ‘dq-output‘ folder as result location.
- We can see that our data passed 14 out of 15 rules, and we can make necessary modifications accordingly.
Conclusion
This guide shows how AWS Glue can simplify complex data workflows—from ingesting and transforming data to ensuring quality and security. With step-by-step examples, you now have a clear path to designing efficient, scalable ETL pipelines using AWS Glue features.
Stay tuned for more!
Subscribe to my newsletter
Read articles from Mehul Kansal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by