Parquet batch ingestion with Python, Postgres and Docker.
Introduction
We'll use Python to ingest into a Postgres database the NYC TLC green taxi trip records for September 2019, which are available in a Parquet file at the following URL:
(https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2019-09.parquet)
The ingestion will be done in batches which allows us to efficiently ingest large files when the memory size of the compute resource being used is limited.
We will also ingest a small csv file containing taxi zone data for NYC. Finally we will perform SQL queries using pgadmin on the ingested data.
The GitHub repository of this project can be found at the following URL:
(https://github.com/Aditya-Tiwari-07/parquet_batch_ingestion)
Development Environment Exploration and Setup for GitHub Codespaces.
Fork the GitHub repository of the project:
(https://github.com/Aditya-Tiwari-07/parquet_batch_ingestion)
Open a codespace with the forked repository. Refer the below video to launch a codespace: -
Explore the codespace: -
In the codespace terminal. Check if docker, python and git are installed, using
docker -v
,python --version
andgit -v
commands respectively. You should get their respective versions if they are installed.Run
docker image ls
anddocker ps -a
to check whether any images or containers are present in the codespace.Run
which python
to see the path to the the location where the python binaries are stored, for the python kernel being used by you.
Explore the files in the project directory /workspaces/parquet_batch_ingestion
:
postgres.yaml
: Compose file for the postgres service.ingest.yaml
: Compose file for the ingestion service.pgadmin.yaml
: Compose file for the pgadmin service.ingest_data.py
: Python script for downloading the parquet file and ingesting it in batches to the postgres database.dockerfile
: Dockerfile for dockerizing theingest_data.py
script and creating thetaxi_ingest:v1
image.dev.env
: Default environment variable file.zone_ingest.py
: Python script for downloading and ingesting into postgres a small csv file containing taxi zone data for NYC..gitignore
: Contains files and folders to be ignored by git version control.README.md
: Markdown file with brief description of project.LICENSE
: The Apache license file for project code.
Set up the codespace for the project: -
In the terminal, make sure you're in the project directory:
/workspaces/parquet_batch_ingestion
Run
mkdir green_taxi_data && mkdir pgadmin_data
to create the bind mounts to persist data for postgres and pgadmin.Run
cp dev.env .env
to copy the default environment variable file to a.env
file. You may change these values as per your requirements.Run
source .env
to source the environment variables for the current terminal.Run
pip install sqlalchemy psycopg2 python-dotenv
to install the python libraries needed by the codespace for this project.Run
docker build -t taxi_ingest:v1 .
to build thetaxi_ingest:v1
docker image. This image needs to be rebuilt every time theingest_data.py
file is changed.
Parquet file batch ingestion and querying the data from Postgres.
Run
docker compose -f postgres.yaml -f ingest.yaml -f pgadmin.yaml up
:This pulls the
postgres:13
anddpage/pgadmin4
docker images from the online docker registry. We already have thetaxi_ingest:v1
image in our local docker registry.Creates the network
pg-network
and the respective containers of thepg-database
,pg-ingest
andpgadmin
services.Attaches the network to the three containers.
Starts the three containers.
If the ingestion is successful and the postgres and pgadmin services are up, go to the ports tab of the container and open the port
8080
for pgadmin in the browser.Login to pgadmin, create a server for the postgres database and check if the
green_taxi_data
table has been ingested. If yes, try running some queries on the table.Use
docker compose -f postgres.yaml -f ingest.yaml -f pgadmin.yaml down
to compose down the three containers and their network.Execute
docker compose -f postgres.yaml -f pgadmin.yaml up
to relaunch the postgres and pgadmin containers inpg-network
to check whether the server configuration and ingested data persist in the local bind mountspgadmin_data
andgreen_taxi_data
respectively.Execute
python zone_ingest.py
to ingest the small csv file for the NYC taxi zone data, which happens in one go.Check the
zones
table in the postgres database using the running pgadmin service.
We now have a robust setup for downloading and ingesting large parquet files containing taxi trip record data, in batches, and analyzing it using SQL queries through pgadmin.
Subscribe to my newsletter
Read articles from Aditya Tiwari directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Aditya Tiwari
Aditya Tiwari
I build Data and ML systems in the Cloud.