I just watched Zach video (the famous data engineer guy that has worked for Airbnb and Facebook) and want to build something with it. I am lacked with data engineer experience so I decided to build something while learning. Previously, I just felt that data engineering is not that interesting. However, after watched his series, my perspective changed and I realized that maybe I just not digging enough in this topic because it is interesting. Something systematically sophisticated always dragged my attention and now I just almost complete the project.

I started to find any free API that provide a real time info. Talk about real time, Zach is true about how do you define the real time. Stakeholders almost knew nothing about it. Real time, near real time, batching, micro-batching will always be assumed the same. For them as long as it ‘quick’, then it can be real time. Why the understanding real time is important? Technically because of the complexity. The more it near to real time, the higher the complexity. I can’t say more about this, because the API that I just used is not actually a real-time API as it is updated every 30 seconds, however it has shown its complexity compare to my daily job that only require to fetch the data from database as per requirement.

I started with playing around with the API. Simply using Python you can fetch the information of the feeder bus in KL with certain information return such as the trip time, latitude, longitude, bearing, and speed. Then I decided to build an internal monitoring dashboard that can

Monitor the behavior of the driver - safe, cautious and dangerous driving behavior
Monitor the condition of the bus - bus that has mileage more than 10,000 km that require service
Monitor the current location of the bus - observe the density bus movement

First draft

To build this, I started by draw the first draft of data orchestration infrastructure.

(yea it is so bad, but I will provide the final draft at the end of this article by using excalidraw)

The first plan is to use Airflow, Dbt, Postgres, and Apache Superset, by containerize them. No reason of why I used Postgres as it is the common usage and previously used in my project. Airflow, Dbt and Apache Superset is new to me, but it is learnable.

Database and Dbt

I started by setup the database first, then playing with SQL on Jupyter Notebook. The first problem that I faced is to ensure the data integrity of data insertion in the database. This is because I used change the json return from the request into dataframe before doing the insertion using Pandas .to_sql(). This method be able to change the schema of the database and would not be able to ensure the data type as of it just insert any kind of data format (including if it has different column name) into database. So I used the native SQL instead of using Pandas method.

Then I artificially normalized the data by adding bus and driver table. The bus is coming from the original data but the driver is generated. Then I write the script for recurring and cumulative table which will update daily for the monitoring. All of this SQL query is written following the Dbt methodology, where I can modularize the script, reuse and recall whenever I want. Personally, this is a good way for having version control especially for a potential scalable project that require many changes in the future. Other than modularization, Dbt has testing feature where I can ensure the data type and column name to be matched what is expected (good because I don’t need another external packages like Pydantic or Pytest itself).

Orchestration

My first plan on implementing data orchestration is using Airflow (of course after I watched Zach). The tools provided a complete data orchestration tool, however the size of the image is too heavy; at least for my local machine to endure. This is actually because it has a lot of battery include features which is in my case is overkill (what I want is just automation). Then, I found Prefect. Based on Reddit forum and tested myself, Prefect image is much lighter and this will allow me to develop the pipeline locally. The syntax is also much simpler, which require task and flow; just need to understand on how to test your flow locally as the docs is quite deviated for beginners use.

Visualization

As you can see on the paper, it stated Apcahe Superset. I also changed my plan from using this, as of it is quite unstable in terms of container building where it has error of npm and build-essential. All the docker-compose provided surprisingly produced error (I found the workaround but that is after I implement the other alternative) and that makes me change to Metabase. Also, much simpler, stable and easy to use. Just that it has only daily database sync, which is not what I want on the first plan where I want the visualization to update

Progress

The following is my final architecture for the project

Instead of using MLFlow, I changed it Dagster with a smaller size image so I can run smoothly (RAM poor lol). I also use Metabase instead of Apache Superset, due to some image bug happened in Superset. The weakness of Metabase is it cannot established a near real time visualization as the quickest is one day.

While doing this project, I have discussed with my colleague where I got several input on the improvement. First and foremost, the API itself can’t be a real time due to its HTTP request feature. Instead, I need a web-hook API or implement PubSub method.

Current

The project still not achieve my ultimate goals, but I learned a lot about Data Engineering with such a small project. Shout out to Zach the YT Data Engineer for the knowledge sharing (though he already hide the series) and myself.

First hello world ETL DE project