Build a Stock Data Pipeline with Apache Airflow and Python

Have you ever looked at stock charts and wondered what goes on behind the scenes to make them work? Maybe not, but I did, especially after starting crypto trading late last year. That curiosity led me, a few weeks back, to start a personal data engineering project: an automated stock data pipeline.

First, I pulled raw market data and cleaned it up with Python and Pandas. After that, I stored it safely in the cloud and used Apache Airflow to automate the process. To wrap it all up, I built some interactive dashboards with Grafana so I could easily track what’s happening.

I wanted to build something meaningful, something close to what professionals use in real-world environments, with tools widely adopted across the industry. So, if you’re curious about data engineering or just love a good “how I built this” story, you’re in the right place. I’ll walk you through the entire journey: the tools, the wins, what went wrong, and how I fixed it all.

Below is the architectural diagram of the entire project, showing various tools used and methods:

Setup and Installation

To get this project up and running, here are some of the installation steps I followed:

Clone the repository:

 git clone https://github.com/yourusername/stock-data-pipeline.git
 cd stock-data-pipeline

Create and activate a virtual environment:

 python3 -m venv venv_project
 source venv_project/bin/activate

Install the Python dependencies:
```
 pip install -r requirements.txt
```
The main dependencies include:
- requests for API calls
- python-dotenv to manage environment variables
- pandas for data manipulation
- apache-airflow for workflow orchestration
- psycopg2-binary for PostgreSQL database connection
Set up your environment variables:

Create a .env file in the root directory with your Alpha Vantage API key:
Configure PostgreSQL database:
- Use AWS RDS or a local PostgreSQL instance.
- Update your database connection details in the .env configuration files used by the scripts and Airflow.
Start Apache Airflow

Initialize the database, then start the web server and scheduler with the following commands:
```
 airflow db init
 airflow webserver -p 8080
 airflow scheduler
```
OR

Use the standalone command to do it all at once:
```
 airflow standalone
```
Run the DAG:

Access the Airflow UI at http://localhost:8080, enable the DAG named stock_data_pipeline_dag, and trigger it to start the pipeline.

With the installations complete, let’s proceed to the project overview.

Phase 1: Data Sourcing - Sourcing Market Data

Every data pipeline begins with a source. For this project, I needed reliable daily stock market data. After exploring various options, I chose the Alpha Vantage API. It offers a generous free tier, provides comprehensive historical and daily time-series data (open, high, low, close prices, and volume), and is relatively simple to integrate.

The Tooling: My primary tool for this phase was Python, leveraging the requests library to handle the HTTP GET requests to the Alpha Vantage API. The core logic for fetching and initial handling of the data resides in my Fetch_stock_data.py script.
The Process: The script is designed to be flexible, accepting a stock symbol (e.g., 'IBM', 'AAPL', 'MSFT') as a parameter. This makes the pipeline easy to configure for tracking different stocks. The API returns data in JSON format, which the script then parses.
Security First: API keys are critical credentials. To manage my Alpha Vantage API key securely, I used a .env file to store it. The python-dotenv library loads this key into the script's environment at runtime, ensuring the key isn't hardcoded or accidentally committed to version control.

What went wrong?

At first, when I tried to get data for many stocks or the full history, I hit Alpha Vantage’s limits on how often I could ask for data. The API only lets you make a certain number of requests in a set time, and I was making too many too fast.

The Fix

To fix this, I made sure to follow the API’s rules more carefully. For this project, pulling data for just a few stock symbols with the smaller “compact” dataset was enough. But for bigger projects, I’d need to add error checks when the API limits are hit, add delays between requests, or schedule the data fetching better with Airflow. This experience taught me the importance of designing systems that can handle limits and keep running smoothly.

Here's a look at the Fetch_stock_data.pyScript in action, successfully retrieving data for AAPL and displaying the initial DataFrame structure:

Phase 2: Data Transformation – Making Raw Data Useful

Once I had stock data loading into the daily_stock_data table, I ran some quick checks using my local script fetch_and_analyze_stock_data.py. Simple SQL queries helped me confirm the data was clean and usable. I was also able to view average closing prices, daily highs/lows, etc.

After that, I realized that if I wanted meaningful, consistent insights (and later, good-looking Grafana dashboards),I couldn’t rely on manual SQL every time. So I built a dedicated transformation script: clean_and_transform_data.py

Key Transformations Implemented in clean_and_transform_data.py:
20-day Moving Average (MA): This is a common technical indicator used to smooth out price data by averaging a stock's price over the last 20 days. helps smooth out short-term price fluctuations and highlight longer-term trends (overall direction of a stock’s price). It’s calculated by taking the average of the closing prices over the past 20 trading days. In Pandas, this is done with df['close_price'].rolling(window=20).mean(), which computes a moving average over a sliding window of 20 days.
Daily Percentage Change: This metric shows the day-over-day volatility and return of a stock, calculated as (current_day_close - previous_day_close) / previous_day_close * 100.
The Output: The clean_and_transform_data.py Script prepares a refined dataset, combining original price and volume data with these additional insights. This clean and structured data is then ready for automated processing, ensuring that every run of the pipeline delivers up-to-date, insightful information.

Phase 3: Data Storage & Management - A Reliable Cloud Foundation

After transforming the data, the next critical step was storing it in a system allowing fast, reliable access and easy integration with visualization tools.

A reliable and easily accessible storage system is the backbone of any data pipeline and ensures that the entire pipeline remains stable and scalable as more data flows in. Here's a breakdown of my storage management:

Database Choice: I selected PostgreSQL for this because it’s a powerful, open-source object-relational database system and is known for its reliability and rich feature set.
Cloud Hosting with AWS RDS: To simulate a production-like environment and leverage the benefits of a managed service (like automated backups, patching, and scalability), I hosted my PostgreSQL instance on AWS RDS (Relational Database Service). This significantly reduces the operational overhead compared to self-hosting a database.
Database Management with pgAdmin: Throughout development, I used pgAdmin to manage the PostgreSQL database. It handled everything from schema design and table creation to manual data inspection and query testing. The tool provided a clear and efficient way to interact with the database throughout the project.
Schema Design: Using pgAdmin, I designed and implemented the following table structures within my stock_data_db database on RDS:
1. daily_stock_dataThis table stores the raw daily values as fetched by Fetch_stock_data.py It includes columns for symbol, trade_date, open, high, low, close prices, and volume, with a unique constraint on symbol and trade_date to prevent duplicates. Below is an image of the daily_stock_data table in pgAdmin:
2. transformed_stock_data This table is updated by clean_and_transform_data.py and holds the enriched data, including all fields from daily_stock_data plus the calculated ma_20_day and daily_pct_change. It also has a unique constraint on symbol and trade_date. Below is an image of the transformed_stock_data table in pgAdmin:

Phase 4: Orchestration - Automating with Apache Airflow

Having built the individual components for data ingestion, processing, and storage, the next step was to bring them all together under an automated workflow. Instead of running scripts manually, I implemented Apache Airflow to schedule, track, and manage the pipeline end-to-end. The breakdown is as follows:

Apache Airflow for orchestration: I chose Airflow because it's well-suited for managing complex data workflows. Its use of Directed Acyclic Graphs (DAGs) offers a clean way to define task dependencies and execution order.
The DAG Definition:

I created a DAG named stock_data_pipeline_dag.py to represent the full pipeline and some of its tasks is as follows:
1. fetch_stock_data_task This uses Airflow BashOperator to run Fetch_stock_data.py . It is parameterized to accept a stock symbol.
2. transform_and_load_taskThis task runs after the fetch task completes successfully. It executes the clean_and_transform_data.py script, which reads the newly fetched raw data, transforms it, and loads the results into the transformed_stock_data table.

Below is an image of the stock_data_pipeline's DAG:

Key Features of the Airflow:

Flexible Configuration: The DAG is designed to be triggered with configurable parameters like the stock_symbol and output_size (compact or full history for Alpha Vantage). This makes the pipeline highly flexible for fetching data for different stocks without modifying the DAG code.
Task Dependencies: Airflow ensures that the transformation task only runs after the corresponding fetch task for a given symbol has succeeded. It enforces execution order.
Retries & Logging: Airflow's built-in retry mechanisms help handle temporary failures (like API downtime). Its extensive logging capabilities are invaluable for debugging and monitoring pipeline health.
Web UI: The Airflow web interface provides a great overview of DAG runs, task statuses, logs, and allows for manual triggering and management.

What went wrong?

This was a classic “works on my machine” moment until automation revealed the missing pieces. It all started when my clean_and_transform_data.py script, which heavily relies on Pandas, kept failing when triggered by Airflow, even though it ran perfectly from my terminal inside venv_project.

The culprit? Airflow’s BashOperator was pointing to a different Python environment, one that didn’t have Pandas installed.

The Fix

I resolved it by installing Pandas in the environment Airflow was using to run the script. Once the correct dependencies were available, the task executed without errors.

Phase 5: Visualization - Bringing Data to Life with Grafana

The final step in this data pipeline was making the insights easily accessible and understandable. For this, I chose to build an interactive dashboard that would bring the stock data to life.

The Visualization Tool: Grafana

Connecting to Data: I selected Grafana for its strength in time-series visualizations, open-source flexibility, and excellent compatibility with PostgreSQL. Its dynamic dashboards and rich querying capabilities made it ideal for this use case.
Building the Dashboard ("Multi-Stock Price & MA Dashboard"):

The dashboard was created by setting up multiple panels and writing SQL queries within Grafana’s panel editor to retrieve and visualize the data.

For example, to display the closing price and 20-day moving average for IBM, a typical query selected trade_date (alias as time), close_price, and ma_20_day from transformed_stock_data where symbol = 'IBM'.

Each panel used the Time series visualization type, enabling a clear display of stock price movements and trends over time.

Currently, the dashboard includes dedicated panels for IBM and AAPL, each visualizing their respective closing prices alongside the 20-day moving average. This setup allows for easy visual comparison across different stocks.

What went wrong?

1. The Missing Table

I encountered an error in Grafana:

“Error: transformed_stock_data doesn't exist.”

Despite confirming the table existed in my database, Grafana couldn't find it. After some investigation, I discovered that the data source was mistakenly configured to connect to the default postgres database on AWS RDS, not my intended stock_data_db.

The Fix:
I updated the Grafana data source configuration to point to the correct database (stock_data_db). Once connected to the right database, the error disappeared, and the panels loaded as expected.

2. What Went Wrong: The Comma Culprit

While trying to visualize both close_price and ma_20_day on the same chart, the panel rendered nothing. The query inspector flagged a syntax error, which turned out to be caused by a missing comma between the columns in the SELECT clause.

The Fix:
I added the missing comma in the SQL query:
```
  SELECT trade_date AS time, close_price, ma_20_day
  FROM transformed_stock_data
  WHERE symbol = 'IBM';
```

Final Thoughts

Building this stock data pipeline taught me far more than I expected, from managing API limitations and orchestrating tasks with Airflow to debugging environment issues and designing clear visual dashboards in Grafana. Each phase brought its challenges, but also valuable lessons that deepened my understanding of real-world data engineering workflows.

If you’re someone looking to break into data engineering, or just curious about how raw data becomes usable insight, I hope this walkthrough gives you a solid foundation and inspiration to start your own project. There’s nothing more rewarding than seeing your ideas run smoothly from start to finish, automated and visualized.

All the code for this project, including the Python scripts (Fetch_stock_data.py, clean_and_transform_data.py, fetch_and_analyze_stock_data.py), the Airflow DAG (stock_data_pipeline_dag.py a detailed README.md (with setup instructions and the system architecture diagram), A screenshot of the Grafana dashboard is available on my GitHub repository: Link to Data Pipeline project

I hope sharing this journey has been insightful. I’d love to hear your feedback, answer any questions, or simply connect. Thanks for reading!

How I Built a Stock Data Pipeline

Setup and Installation

Start Apache Airflow

Phase 1: Data Sourcing - Sourcing Market Data

What went wrong?

The Fix

Phase 2: Data Transformation – Making Raw Data Useful

Phase 3: Data Storage & Management - A Reliable Cloud Foundation

Phase 4: Orchestration - Automating with Apache Airflow

What went wrong?

The Fix

Phase 5: Visualization - Bringing Data to Life with Grafana

What went wrong?

1. The Missing Table

2. What Went Wrong: The Comma Culprit

Final Thoughts

Subscribe to my newsletter

Linda Uchenwoke

Linda Uchenwoke