AWS Clickstream Analytics Pipeline

Hello everyone!

I’m an aspiring Cloud Engineer who’s always on the lookout for projects that push my limits and challenge my understanding of cloud technologies. This was one such project—or dare I say, THE project—that checked all the boxes in terms of difficulty and learning.

I received this project as a problem statement from my mentor, a Senior Solutions Architect at AWS India. At first glance, it seemed straightforward:

Modernize an application that was previously hosted on an IaaS stack by leveraging DynamoDB for fast and scalable structured data storage. Ensure low-latency operations, automate data pipelines, and enable SQL-based querying on archived data using Athena and S3. Finally, visualize insights through QuickSight dashboards for better decision-making.

Seemed simple enough, right? Well, not quite. As I started working on it, I realized the true complexity of designing a system that was not only scalable and efficient but also capable of automation and real-time analytics.

After brainstorming, I landed on a click-based web app—a simple UI with buttons, where every click would generate and log structured data. But this wasn’t just any random button-click tracker. The focus was on internal stakeholders of an organization—CXOs, CTOs, and other decision-makers—who could analyze the data to gain meaningful insights into user engagement, behavioral patterns, and operational efficiency.

Here's a deep dive into my journey—every challenge, every solution, and the invaluable lessons I took away.

Before I break down the challenges and solutions, let me first take you through how my thought process evolved. Initially, I sketched out a rough idea of the architecture, thinking the project wouldn’t be too complex. However, as I progressed, I came to understand the complexities involved, which helped me develop a more refined architecture.

I had a rough idea of how to structure the project, so I quickly sketched out an initial design on paper. Here’s how it looked:

And this is the final architecture I built after understanding the complexities and refining my approach:

Comparing the two diagrams, you can see how much my understanding evolved. Now, let’s dive into the challenges I faced while building this.

1. Brainstorming the Application

The problem statement had a few key requirements:

An application that produces structured data
Low-latency and scalable data ingestion
Automated data movement for analysis
Visualization for end users or stakeholders

But here’s where things got interesting—the problem statement didn’t specify what the app should be or what kind of data it should collect. At first, this felt like a small detail, but the more I thought about it, the more I realized that this decision would shape the entire project.

After discussing with my mentor and exploring different ideas, I settled on a click-based web app with a set of buttons. Every time a user clicked a button, it would generate structured data with:

Timestamp – The exact moment of the click
Button Name – (e.g., "Google," "Netflix," etc.)
Device – Desktop, Mobile, or Tablet
Browser – Extracted from User-Agent
Location – City and Country
Page URL – The page where the click happened

I wanted to keep it simple yet meaningful. With this data, I could analyze things like:

How many Netflix clicks came from mobile v/s desktop?
Which day of the week gets the most YouTube clicks?

First Major Challenge: Defining the Data

At first, it seemed like a small decision, but choosing the right data points was tricky.

Too many attributes would make the design complicated.
Too few would make the insights boring, or even unnecessary.

After a lot of trial and error, I landed on 6 fields you saw above—enough to capture meaningful trends without over complicating things.

2. Building the Frontend and Backend

Frontend:

I built the frontend using React.js with a simple design—just a few buttons labeled “Google,” “YouTube,” “Netflix,” etc. (I'll add an image of my frontend with this).

The idea was straightforward: when someone clicks any button, the app would capture some details and send them to the backend. The attributes collected were:

Timestamp (ISO 8601 format)
Button Name (Google, Netflix, etc.)
Device & Browser (navigator.userAgent)
Location (City, Country via Geolocation API)
Page URL (window.location.href)

To host my application, I used S3 and CloudFront to serve my frontend to users.

Backend:

For the backend, I built an Express.js server to process the incoming click data and send it to AWS. After testing everything locally, I had to deploy it in the cloud—but where?

I decided to containerize my backend using Docker, push the image to ECR (Elastic Container Registry), and run it on ECS Fargate since it’s serverless and requires no infrastructure management.

Here’s what I did:

Created an ECS cluster.
Defined a task to use my Docker image from ECR.
Created a service to run my server as containers on Fargate.
Attached an Application Load Balancer (ALB) to distribute traffic.

This last part—ALB—turned into a big challenge.

How I debugged the ALB:

To keep the app highly available, I wanted multiple containers to handle requests. An ALB (Application Load Balancer) would make sure requests were distributed evenly. Simple, right?

Wrong.

Everything was set up (or so I thought), but requests from the frontend never reached my backend. I spent 2 days debugging, trying to figure out where the issue was. Finally, I opened the browser's developer console and found the problem:

My frontend was making HTTP requests, but most browsers block HTTP requests to secure (HTTPS) domains.

My frontend was on HTTPS, and I was trying to reach the ALB which was on HTTP, so the browser blocked all those requests.

The Fix

Once I knew the problem, the solution became clear:

Route 53: I already had a hosted zone and an SSL certificate for my custom domain.
Created a record in Route 53 to map my ALB's domain name to a custom API endpoint.
Attached the SSL certificate to this API endpoint.

This was the light-bulb moment, and as soon as I made these changes, everything worked. This enhanced my debugging skills, and showed me how small details can disrupt systems.

Here’s what the backend looks like:

Why did I containerize only the server(the backend) and not the frontend?

At first, I wanted to containerize the entire application and run it on ECS, but I felt it would lead to unnecessary complications because the frontend was constantly getting updated with new buttons, design changes, etc, and it would require building the Docker image and pushing to ECR whenever the frontend was updated, so I decided to host my frontend as a static site using S3+CloudFront and stuck to hosting the backend server on ECS Fargate, which simplified things, because the server didn’t need frequent updates.

3. Data Storage and Processing

Now that the click data was flowing into DynamoDB, the next step was analysis and visualization using QuickSight.

I first wrote a Lambda function to extract data from DynamoDB and upload it to an S3 bucket. Initially, I was storing it in JSON, but after researching how Athena and QuickSight handle data, I learned that JSON is a row-major format(better for transactional processing) while column-major formats are preferred for analytics, which was my use case. That’s when I came across Parquet, which is optimized for fast queries and compression—perfect for my use case.

I modified my Lambda function to export the DynamoDB data in Parquet format and store it in S3.
(I'll include an image of the Lambda logs showing some exported data.)

Switching to parquet significantly boosted the query speeds, which also helped me optimize costs, since Athena charges for each query depending on the size of the data scanned.

Next, I had to automate this process, so I set up an EventBridge rule with a CRON expression to trigger my Lambda function every night at 11:59 PM. This ensured fresh data was available in S3 each morning for analysis.

4. Querying the data with Athena

With data in S3, the next step was to run SQL queries. Athena was the obvious choice, but I faced a challenge—Athena needs a predefined schema and partitions to efficiently query data from S3. I initially set up a manual schema but quickly realized it wasn't scalable.

That’s when I explored AWS Glue. I learned about crawlers and data catalogs—which could automatically detect schema, partition attributes, and organize data from my Parquet files. Once I created a database and table in Glue, the crawler handled everything.

Now I could write SQL queries in Athena to extract insights from the clickstream data, such as:

Total clicks per button:

  SELECT button, COUNT(*) 
  FROM clickevents 
  GROUP BY button;

Clicks on a specific day:

  SELECT * 
  FROM clickevents 
  WHERE day='02' 
  LIMIT 10;

and so on.

5. Visualization with QuickSight

The final step was creating visual dashboards in Amazon QuickSight.

I started by creating a QuickSight account and setting up a dataset using Athena as the data source. After validating the connection, I selected the appropriate database and table—everything seemed good to go.

But then, I ran into an issue—I wasn’t able to generate any visuals. I couldn’t figure out the problem at first, but after some debugging, I checked the IAM role I had assigned to QuickSight. I had granted it read access to S3, but I had missed write access.

This was the exact error I saw in QuickSight:

requestId: cf658afd-3905-41e2-b0db-0f31ea3c35eb
sourceErrorCode: GENERIC_SQL_EXCEPTION
sourceErrorMessage: 
from the AWS Athena client.
Unable to verify/create output bucket clickevents-athena [Execution ID not available]

I couldn’t understand what was wrong at first. But after analyzing the error and reading on StackOverflow, I found out that QuickSight needed permission to write query results to S3. I updated the IAM role to include write access, and as soon as I did that, the connection worked perfectly.

For anyone facing this error, check IAM permissions first—that’s most likely where the issue is.

After fixing this, Athena and QuickSight were successfully linked, and I was finally able to generate visual dashboards from the clickstream data. Seeing all the visuals come to life was incredibly satisfying!

Here are a few screenshots from QuickSight:

These insights may be simple(or even boring to some) with respect to my application, a simple clickstream tracker, but it’s actually pretty useful with complex applications, ex large e-commerce applications, where the companies, or the stakeholders sitting high up in the company like CXOs, CTOs, etc can pick up these insights to improve their application, they can use AI or ML among other tools and options to try and understand the behavior of their users, what they tend to like, what drives a user to click on a product, etc.

6. Challenges & Lessons Learned

This project wasn’t just about setting up AWS services to me—it was about debugging real-world issues and actually build something unique.

Key Challenges and the lessons I learned:

Defining the Scope: The problem statement was like a puzzle. A simple button-click app was enough to get useful analytics.
Securing ALB + HTTPS: Took two days to fix SSL issues. Route 53 + ACM + ALB must be connected properly.
Data Format Optimization: Raw JSON was inefficient and expensive. Parquet is better for scalable analytics.
Automating Data Processing: Manually scanning DynamoDB was a no-go. EventBridge + Lambda automated it.
IAM Headaches: QuickSight wasn’t working due to missing S3 permissions. Always check IAM first.
Docker is great, but not mandatory: I initially thought everything should be containerized. But hosting the frontend via S3 + CloudFront was simpler than deploying it as a container, reducing unnecessary complexity.

7. Final Thoughts and Call to Action

If there’s one thing this project taught me, it’s that every problem has a solution—you just have to keep going. No matter how frustrating things get, if you stay on it, you’ll figure it out. And trust me, the satisfaction of seeing everything finally work is absolutely worth it.

More than anything, this project forced me to think like an architect—not just solving problems but designing solutions from the ground up.

For anyone attempting something like this—do it. These are the projects that push you beyond tutorials, forcing you to think at every level of abstraction. You’ll struggle, break things, and probably want to quit a few times, but in the end, you’ll have built something real—something you can definitely be proud of.

If you’ve made it this far, thank you for reading! I really appreciate your time, and I hope you found this blog insightful. If you have any questions, thoughts, or suggestions to improve this, feel free to reach out—I’d love to hear from you.

🔗LinkedIn

🌐Portfolio

Building a serverless Clickstream Analytics Pipeline on AWS

Table of contents