ClickHouse CSV Ingestion with Python: Explicit Schema


Introduction
Hello everyone! Today, I am going to show you how you can create a table in ClickHouse from a CSV using Python.
We use WSL to run ClickHouse on Windows. If you want to know how to run ClickHouse on Windows locally, you can check this article https://blog.siddhantbobde.com/clickhouse-windows-installation
Python Setup for clickhouse
I am using VSCode as my Python editor. You can use any editor you like. Create a new Python virtual environment, or you can use your default one.
To connect to the ClickHouse client locally, we need to install the Python library clickhouse-connect
.
Run the command below to install clickhouse-connect
.
pip install clickhouse-connect
Now, let's connect to our local ClickHouse client using Python.
Create a Python file and add the code below to establish a connection with ClickHouse.
Make sure your ClickHouse is running locally.
# import the package
import clickhouse_connect
# Create a connection
clickhouse_client = clickhouse_connect.get_client(host='localhost', database="default", username='default')
Creating table using csv file.
Lets create a csv having random data. I am using https://www.mockaroo.com/ to generate random data.
I have uploaded my CSV file here. After creating your own test data or downloading test data from the link I've provided, you have to copy the file into your Ubuntu environment.
You can access your ubuntu environment from your file explorer or you can type \\wsl$
in windows run command.
then you have to put csv file in user-files folder in ubuntu environment you have created.
The path will be \Ubuntu\home\{your ubuntu user}\user_files
Now lets write one python script to create a table from csv and insert data into it.
import clickhouse_connect
# Connect to clickhouse client. Below are the default credentials of clickhouse.
clickhouse_client = clickhouse_connect.get_client(host='localhost', database="default", username='default')
# put your table name and file name
table_name = 'testdata'
file_name = 'testdata.csv'
# Query to column names and their data types from file.
result = clickhouse_client.query('DESCRIBE TABLE file({0}, CSVWithNames)'.format('testdata.csv'))
columns_and_types = result.result_rows
# Create table query
create_table_query = 'CREATE TABLE {0} ('.format(table_name)
# Select query to select data from file
select_data_query = 'SELECT '
for column in columns_and_types:
column_name = column[0]
column_type = column[1]
create_table_query += column_name + ' ' + column_type + ', '
select_data_query += column_name + ', '
create_table_query = create_table_query[:-2] + ') ENGINE = MergeTree() ORDER BY tuple() AS '
select_data_query += " FROM file('{0}', CSVWithNames)".format(file_name)
create_table_query += select_data_query
print(create_table_query)
clickhouse_client.query(create_table_query)
After running above python script you can check if your table is created or not. Run following query to print all table names
show tables
After running above query you will see all the table names and check if your table name is present here.
To get the data from the table we created we can run a simple sql like query.
SELECT * FROM testdata limit 5
Note: I have added limit 5 so that i can show you the full output of query. You can run query without adding the limit.
If you have any doubts regarding the blog, comment down below. I will do my best to answer them.
Subscribe to my newsletter
Read articles from SIDDHANT BOBDE directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

SIDDHANT BOBDE
SIDDHANT BOBDE
👋Hi, my name is Siddhant. 👩💻I am a Software Engineer. I am passionate about technology, and coding. ✍I also like to read and write about technology and productivity. In my free time, I play guitar🎸 and also like to play badminton🏸.