end to end data engineering project using aws

I built a weather data pipeline on AWS to automate the collection and analysis of weather forecasts from the OpenWeather API. Data flows into S3, gets transformed with Glue, is orchestrated by Airflow (MWAA), and finally lands in Redshift for analytics. The setup is cloud-native, scalable, secure, and mirrors real enterprise standards.

Architecture Diagram

Here's a visual of the pipeline showing how S3, Glue, Redshift, Airflow, IAM, VPC, and networking come together:

How Data Moves

Airflow (MWAA) fetches raw weather data and saves it in S3 (raw/).
Airflow triggers a Glue job, which processes and transforms the data, storing outputs in S3 (processed/).
Glue then loads this curated dataset into Redshift, where analytics queries can run.

Real AWS Billing Lessons

Controlling costs is vital with cloud pipelines! Below are screenshots from my AWS console showing:

Billing breakdown by service: VPC, Redshift, Glue, S3, etc.
Total monthly spend, which exceeded $30 due to setup delays and idle Redshift cluster.

A key lesson: When creating the MWAA environment, I had already started my Redshift cluster. MWAA setup can take a long time and I ran into errors multiple times, causing the Redshift cluster to stay running (and billing) throughout. In my case, this led to "idle time" charges of $15–$20 or more, and contributed to a total bill over $30.

Best Practice:
If possible, avoid launching the Redshift cluster until you know your MWAA environment is ready and working. MWAA only needs Redshift at the end, when Glue writes the processed data, so you can minimize costs by stopping or not starting Redshift until this final step. This simple adjustment can significantly reduce AWS bills, especially during repeated setup or troubleshooting.

Solution Approach

Airflow provides orchestration with retries and alerting.
Glue handles transformation at any data scale.
Redshift stores analytic-ready datasets.
Secure VPC, private subnets, and least-privilege IAM ensure privacy and compliance.
Managed services mean less operational headache and more reliability.

Cost Optimization Tips

Route S3 traffic via Gateway Endpoints to reduce NAT charges.
Use Redshift Serverless or pause clusters during setup downtime.
Clean up/stop resources when not needed.
Always time setup tasks to avoid idle infrastructure.

Source & Setup Guide

Find the code, setup steps, and documentation in my GitHub repo:
aws-weather-data-warehouse
All required steps are explained in the README for easy implementation.

Closing Note

This project shows how to build a production-style, scalable weather data pipeline using AWS best practices—with practical insights on cost management. Plan your resource lifecycle carefully, especially during setup, to keep expenses under control while achieving enterprise-grade performance.

Building a Modern Weather Data Pipeline on AWS (S3 + Glue + Redshift + MWAA)

Table of contents