How I Built a Stock Data Pipeline


Have you ever looked at stock charts and wondered what goes on behind the scenes to make them work? Maybe not, but I did, especially after starting crypto trading late last year. That curiosity led me, a few weeks back, to start a personal data engineering project: an automated stock data pipeline.
First, I pulled raw market data and cleaned it up with Python and Pandas. After that, I stored it safely in the cloud and used Apache Airflow to automate the process. To wrap it all up, I built some interactive dashboards with Grafana so I could easily track what’s happening.
I wanted to build something meaningful, something close to what professionals use in real-world environments, with tools widely adopted across the industry. So, if you’re curious about data engineering or just love a good “how I built this” story, you’re in the right place. I’ll walk you through the entire journey: the tools, the wins, what went wrong, and how I fixed it all.
Below is the architectural diagram of the entire project, showing various tools used and methods:
Setup and Installation
To get this project up and running, here are some of the installation steps I followed:
Clone the repository:
git clone https://github.com/yourusername/stock-data-pipeline.git cd stock-data-pipeline
Create and activate a virtual environment:
python3 -m venv venv_project source venv_project/bin/activate
Install the Python dependencies:
pip install -r requirements.txt
The main dependencies include:
requests
for API callspython-dotenv
to manage environment variablespandas
for data manipulationapache-airflow
for workflow orchestrationpsycopg2-binary
for PostgreSQL database connection
Set up your environment variables:
Create a
.env
file in the root directory with your Alpha Vantage API key:Configure PostgreSQL database:
Use AWS RDS or a local PostgreSQL instance.
Update your database connection details in the
.env
configuration files used by the scripts and Airflow.
Start Apache Airflow
Initialize the database, then start the web server and scheduler with the following commands:
airflow db init airflow webserver -p 8080 airflow scheduler
OR
Use the standalone command to do it all at once:
airflow standalone
Run the DAG:
Access the Airflow UI at
http://localhost:8080
, enable the DAG namedstock_data_pipeline_dag
, and trigger it to start the pipeline.With the installations complete, let’s proceed to the project overview.
Phase 1: Data Sourcing - Sourcing Market Data
Every data pipeline begins with a source. For this project, I needed reliable daily stock market data. After exploring various options, I chose the Alpha Vantage API. It offers a generous free tier, provides comprehensive historical and daily time-series data (open, high, low, close prices, and volume), and is relatively simple to integrate.
The Tooling: My primary tool for this phase was Python, leveraging the
requests
library to handle the HTTP GET requests to the Alpha Vantage API. The core logic for fetching and initial handling of the data resides in myFetch_stock_data.py
script.The Process: The script is designed to be flexible, accepting a stock symbol (e.g., 'IBM', 'AAPL', 'MSFT') as a parameter. This makes the pipeline easy to configure for tracking different stocks. The API returns data in JSON format, which the script then parses.
Security First: API keys are critical credentials. To manage my Alpha Vantage API key securely, I used a
.env
file to store it. Thepython-dotenv
library loads this key into the script's environment at runtime, ensuring the key isn't hardcoded or accidentally committed to version control.What went wrong?
At first, when I tried to get data for many stocks or the full history, I hit Alpha Vantage’s limits on how often I could ask for data. The API only lets you make a certain number of requests in a set time, and I was making too many too fast.
The Fix
To fix this, I made sure to follow the API’s rules more carefully. For this project, pulling data for just a few stock symbols with the smaller “compact” dataset was enough. But for bigger projects, I’d need to add error checks when the API limits are hit, add delays between requests, or schedule the data fetching better with Airflow. This experience taught me the importance of designing systems that can handle limits and keep running smoothly.
Here's a look at the
Fetch_stock_data.py
Script in action, successfully retrieving data for AAPL and displaying the initial DataFrame structure:Phase 2: Data Transformation – Making Raw Data Useful
Once I had stock data loading into the
daily_stock_data
table, I ran some quick checks using my local scriptfetch_and_analyze_stock_data.py
. Simple SQL queries helped me confirm the data was clean and usable. I was also able to view average closing prices, daily highs/lows, etc.After that, I realized that if I wanted meaningful, consistent insights (and later, good-looking Grafana dashboards),I couldn’t rely on manual SQL every time. So I built a dedicated transformation script:
clean_and_transform_data.py
Key Transformations Implemented in
clean_and_transform_data.py
:20-day Moving Average (MA): This is a common technical indicator used to smooth out price data by averaging a stock's price over the last 20 days. helps smooth out short-term price fluctuations and highlight longer-term trends (overall direction of a stock’s price). It’s calculated by taking the average of the closing prices over the past 20 trading days. In Pandas, this is done with
df['close_price'].rolling(window=20).mean()
, which computes a moving average over a sliding window of 20 days.Daily Percentage Change: This metric shows the day-over-day volatility and return of a stock, calculated as
(current_day_close - previous_day_close) / previous_day_close * 100
.The Output:
The clean_and_transform_data.py
Script prepares a refined dataset, combining original price and volume data with these additional insights. This clean and structured data is then ready for automated processing, ensuring that every run of the pipeline delivers up-to-date, insightful information.
Phase 3: Data Storage & Management - A Reliable Cloud Foundation
After transforming the data, the next critical step was storing it in a system allowing fast, reliable access and easy integration with visualization tools.
A reliable and easily accessible storage system is the backbone of any data pipeline and ensures that the entire pipeline remains stable and scalable as more data flows in. Here's a breakdown of my storage management:
Database Choice: I selected PostgreSQL for this because it’s a powerful, open-source object-relational database system and is known for its reliability and rich feature set.
Cloud Hosting with AWS RDS: To simulate a production-like environment and leverage the benefits of a managed service (like automated backups, patching, and scalability), I hosted my PostgreSQL instance on AWS RDS (Relational Database Service). This significantly reduces the operational overhead compared to self-hosting a database.
Database Management with pgAdmin: Throughout development, I used pgAdmin to manage the PostgreSQL database. It handled everything from schema design and table creation to manual data inspection and query testing. The tool provided a clear and efficient way to interact with the database throughout the project.
Schema Design: Using pgAdmin, I designed and implemented the following table structures within my
stock_data_db
database on RDS:daily_stock_data
This table stores the raw daily values as fetched byFetch_stock_data.py
It includes columns for symbol, trade_date, open, high, low, close prices, and volume, with a unique constraint on symbol and trade_date to prevent duplicates. Below is an image of thedaily_stock_data
table in pgAdmin:transformed_stock_data
This table is updated byclean_and_transform_data.py
and holds the enriched data, including all fields fromdaily_stock_data
plus the calculated ma_20_day and daily_pct_change. It also has a unique constraint on symbol and trade_date. Below is an image of the transformed_stock_data table in pgAdmin:
Phase 4: Orchestration - Automating with Apache Airflow
Having built the individual components for data ingestion, processing, and storage, the next step was to bring them all together under an automated workflow. Instead of running scripts manually, I implemented Apache Airflow to schedule, track, and manage the pipeline end-to-end. The breakdown is as follows:
Apache Airflow for orchestration: I chose Airflow because it's well-suited for managing complex data workflows. Its use of Directed Acyclic Graphs (DAGs) offers a clean way to define task dependencies and execution order.
The DAG Definition:
I created a DAG named
stock_data_pipeline_dag.py
to represent the full pipeline and some of its tasks is as follows:fetch_stock_data_task
This uses AirflowBashOperator
to runFetch_stock_data.py
. It is parameterized to accept a stock symbol.transform_and_load_task
This task runs after the fetch task completes successfully. It executes theclean_and_transform_data.py
script, which reads the newly fetched raw data, transforms it, and loads the results into thetransformed_stock_data
table.
Below is an image of the stock_data_pipeline's DAG:
Key Features of the Airflow:
Flexible Configuration: The DAG is designed to be triggered with configurable parameters like the
stock_symbol
andoutput_size
(compact or full history for Alpha Vantage). This makes the pipeline highly flexible for fetching data for different stocks without modifying the DAG code.Task Dependencies: Airflow ensures that the transformation task only runs after the corresponding fetch task for a given symbol has succeeded. It enforces execution order.
Retries & Logging: Airflow's built-in retry mechanisms help handle temporary failures (like API downtime). Its extensive logging capabilities are invaluable for debugging and monitoring pipeline health.
Web UI: The Airflow web interface provides a great overview of DAG runs, task statuses, logs, and allows for manual triggering and management.
What went wrong?
This was a classic “works on my machine” moment until automation revealed the missing pieces. It all started when my
clean_and_transform_data.py
script, which heavily relies on Pandas, kept failing when triggered by Airflow, even though it ran perfectly from my terminal insidevenv_project
.The culprit? Airflow’s
BashOperator
was pointing to a different Python environment, one that didn’t have Pandas installed.The Fix
I resolved it by installing Pandas in the environment Airflow was using to run the script. Once the correct dependencies were available, the task executed without errors.
Phase 5: Visualization - Bringing Data to Life with Grafana
The final step in this data pipeline was making the insights easily accessible and understandable. For this, I chose to build an interactive dashboard that would bring the stock data to life.
The Visualization Tool: Grafana
Connecting to Data: I selected Grafana for its strength in time-series visualizations, open-source flexibility, and excellent compatibility with PostgreSQL. Its dynamic dashboards and rich querying capabilities made it ideal for this use case.
Building the Dashboard ("Multi-Stock Price & MA Dashboard"):
The dashboard was created by setting up multiple panels and writing SQL queries within Grafana’s panel editor to retrieve and visualize the data.
For example, to display the closing price and 20-day moving average for IBM, a typical query selected
trade_date
(alias astime
),close_price
, andma_20_day
fromtransformed_stock_data
wheresymbol = 'IBM'
.Each panel used the Time series visualization type, enabling a clear display of stock price movements and trends over time.
Currently, the dashboard includes dedicated panels for IBM and AAPL, each visualizing their respective closing prices alongside the 20-day moving average. This setup allows for easy visual comparison across different stocks.
What went wrong?
1. The Missing Table
I encountered an error in Grafana:
“Error: transformed_stock_data doesn't exist.”
Despite confirming the table existed in my database, Grafana couldn't find it. After some investigation, I discovered that the data source was mistakenly configured to connect to the default
postgres
database on AWS RDS, not my intendedstock_data_db
.The Fix:
I updated the Grafana data source configuration to point to the correct database (stock_data_db
). Once connected to the right database, the error disappeared, and the panels loaded as expected.2. What Went Wrong: The Comma Culprit
While trying to visualize both
close_price
andma_20_day
on the same chart, the panel rendered nothing. The query inspector flagged a syntax error, which turned out to be caused by a missing comma between the columns in theSELECT
clause.The Fix:
I added the missing comma in the SQL query:SELECT trade_date AS time, close_price, ma_20_day FROM transformed_stock_data WHERE symbol = 'IBM';
Final Thoughts
Building this stock data pipeline taught me far more than I expected, from managing API limitations and orchestrating tasks with Airflow to debugging environment issues and designing clear visual dashboards in Grafana. Each phase brought its challenges, but also valuable lessons that deepened my understanding of real-world data engineering workflows.
If you’re someone looking to break into data engineering, or just curious about how raw data becomes usable insight, I hope this walkthrough gives you a solid foundation and inspiration to start your own project. There’s nothing more rewarding than seeing your ideas run smoothly from start to finish, automated and visualized.
All the code for this project, including the Python scripts (Fetch_stock_data.py
, clean_and_transform_data.py
, fetch_and_analyze_stock_data.py
), the Airflow DAG (stock_data_pipeline_dag.py
a detailed README.md (with setup instructions and the system architecture diagram), A screenshot of the Grafana dashboard is available on my GitHub repository: Link to Data Pipeline project
I hope sharing this journey has been insightful. I’d love to hear your feedback, answer any questions, or simply connect. Thanks for reading!
Subscribe to my newsletter
Read articles from Linda Uchenwoke directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Linda Uchenwoke
Linda Uchenwoke
Hands-on data engineer who gets things done.