Introduction

In today's fast-paced financial markets, the ability to process and analyze stock market data in real-time provides a significant competitive advantage. This article describes our journey building a comprehensive Stock Market Analytics Platform on Google Cloud Platform (GCP), combining batch processing, real-time streaming, and advanced analytics components to deliver actionable insights from market data.

We'll walk through the architecture, implementation details, and key technical decisions that went into creating this end-to-end data engineering solution.

Project Overview

Our Stock Market Analytics Platform processes data from both historical stock feeds and real-time market updates. It transforms raw market data into meaningful analytics through a combination of batch and streaming pipelines, stores the results in BigQuery, and visualizes them through interactive dashboards.

The system handles millions of records daily, processes complex transformations through dbt, and delivers insights ranging from basic price trends to sophisticated technical indicators.

Architecture

Here's the high-level architecture of our stock market analytics platform:

                                                  +-------------------+
                                                  |                   |
                                                  |  Looker Studio    |
                                                  |  Visualizations   |
                                                  |                   |
                                                  +--------^----------+
                                                           |
                       Batch Pipeline                      |
+---------------+    +----------------+    +---------------v-----------+
|               |    |                |    |                           |
| Stock Market  +--->+ Dataproc       +--->+                           |
| Data Files    |    | (Spark Jobs)   |    |                           |
|               |    |                |    |                           |
+---------------+    +-------^--------+    |                           |
                             |             |                           |
                             |             |      BigQuery             |
+---------------+    +-------+--------+    |     (Data Storage)        |
|               |    |                |    |                           |
| Stock Market  +--->+ Kafka VM       +--->+                           |
| Data Stream   |    | (Producer/     |    |                           |
|               |    |  Consumer)     |    |                           |
+---------------+    +-------+--------+    +-------------^-------------+
                             |                           |
                             |                           |
                     +-------v---------+       +---------+---------+
                     |                 |       |                   |
                     | Airflow + DBT   +------>+ Transformed Data  |
                     | (Analytics)     |       | Models            |
                     |                 |       |                   |
                     +-----------------+       +-------------------+
                           |
                           | (triggers batch processing once daily)
                           |
                           v

This architecture combines:

Batch Processing: Using Dataproc (managed Spark) for processing historical data
Real-time Streaming: Using Kafka for streaming real-time stock market data
Data Storage: BigQuery for storing raw and processed data
Analytics: Airflow and DBT for orchestration and data transformation
- Daily Batch Updates: Airflow triggers Dataproc jobs once a day to process the latest stock data
Visualization: Connected to Looker Studio for data visualization

Infrastructure as Code (Terraform)

For our cloud infrastructure, we implemented Infrastructure as Code (IaC) using Terraform. This allowed us to define, version, and deploy our GCP resources programmatically.

Our main Terraform configuration provisions:

Cloud Storage buckets for raw data and processing artifacts
BigQuery datasets for data storage and analytics
Service accounts with appropriate IAM permissions
Network configurations for secure communication
VM instance templates for Kafka
Firewall rules for network security

By using Terraform, we achieved several key benefits:

Reproducibility: We can recreate the entire infrastructure consistently
Version Control: Infrastructure changes are tracked in Git
Automation: Deployment can be automated in CI/CD pipelines
Documentation: The infrastructure is self-documenting through code

Batch Processing with Dataproc and Spark

The batch processing component handles historical stock market data, primarily focusing on daily OHLCV (Open, High, Low, Close, Volume) data points for Fortune 500 companies.

Data Source

We utilize historical data from financial APIs, which provide daily stock price information going back several years. This data is initially stored in Cloud Storage as CSV or Parquet files.

Spark Processing

Our Spark jobs deployed on Dataproc perform several key functions:

Data Cleansing: Removing duplicates, handling missing values, and standardizing formats
Feature Engineering: Creating technical indicators such as moving averages (7-day, 30-day, 50-day), relative strength index (RSI), and volatility measures
Aggregations: Calculating sector-level and market-wide metrics
Data Enrichment: Joining with company metadata and sector classifications

Dataproc Cluster Configuration

Our Dataproc cluster is configured with:

1 master node (n1-standard-4)
2-4 worker nodes (n1-standard-4), auto-scaling based on workload
Pre-installed libraries for financial analysis
Custom initialization scripts for environment setup

Real-time Data Processing with Kafka

While historical data provides context, our real-time streaming pipeline delivers immediate insights from market movements.

Kafka Setup

We deployed Kafka on a dedicated VM instance with the following components:

Zookeeper: For Kafka cluster management
Kafka Broker: The message broker service running on ports 9092/9093
Kafka Connect: For simplified integrations with external systems
Schema Registry: To ensure data consistency

The VM configuration includes:

n1-standard-4 machine type (4 vCPU, 15 GB memory)
100 GB SSD persistent disk
Ubuntu 20.04 LTS operating system

Data Producers

Our main producer is a Python service that connects to financial APIs to fetch real-time stock quote updates. It continuously polls for new price and volume data for selected Fortune 500 companies and publishes these updates to Kafka topics. The producer incorporates rate limiting and error handling to ensure reliable data flow.

Data Consumers

On the consumer side, we have a Python service that reads from Kafka topics and streams data directly to BigQuery. The consumer implements exactly-once semantics to prevent data duplication and uses batch processing for improved efficiency. It handles backpressure elegantly by adjusting consumption rates based on downstream system capabilities.

Streaming to BigQuery

The real-time data is streamed into BigQuery tables optimized for streaming inserts:

Partitioned by ingestion date for optimal query performance
Clustered by symbol for efficient filtering
Real-time availability with minimal latency

Coming Next

In Part 2 of this article, we'll explore:

Data transformation with dbt
Workflow orchestration with Apache Airflow
Visualization with Looker Studio
Performance optimization strategies
Monitoring and alerting

We'll also dive into the specific analytics models we've developed, including technical indicators, sentiment analysis, and sector performance tracking.

#DEZOOMCAMP

Building a Real-Time Stock Market Analytics Platform on GCP (Part 1)