ClickHouse: The Key to Faster Insights

Arin ZingadeArin Zingade
11 min read

ClickHouse is rapidly gaining traction for its unmatched speed and efficiency in processing big data. Cloudflare, for example, uses ClickHouse to process millions of rows per second and reduce memory usage by over four times, making it a key player in large-scale analytics. With its advanced features and real-time query performance, ClickHouse is becoming a go-to choice for companies handling massive datasets. In this article, we'll explore why ClickHouse is increasingly favored for analytics, its key features, and how to deploy it on Kubernetes. We'll also cover some best practices for scaling ClickHouse to handle growing workloads and maximize performance.

Introduction

ClickHouse is a high-performance, column-oriented SQL database management system (DBMS) designed for online analytical processing (OLAP), excelling in handling large datasets with remarkable speed, particularly for filtering and aggregating data. By utilizing columnar storage, it enables rapid data access and efficient compression, making it ideal for industries that demand fast data retrieval and analysis. Its common use cases include web analytics, where it processes vast amounts of tracking data, business intelligence to power high-speed decision-making, and log analysis for large-scale monitoring and troubleshooting.

Key Features of Clickhouse

  • Columnar Storage: Enables fast data access and efficient compression, enhancing the speed of analytical queries and efficient compressions.

  • High Performance and Scalability: Optimized for handling massive datasets and complex queries with unique table engines that determine how data is stored.

  • Real-Time Analytics: Supports real-time data processing and analytics.

  • Maximizing Hardware Usage: ClickHouse is designed to utilize all available resources of the system effectively.

  • Rich Functionality: Offers a wide array of built-in functions that enhance data manipulation and analysis.

How Does ClickHouse Work?

ClickHouse is designed for speed and scalability, making it ideal for handling vast amounts of data. Its distributed nature allows for data replication across multiple nodes, ensuring both fault tolerance and high availability.

Architecture

ClickHouse operates on a distributed architecture where data is partitioned and replicated across nodes. It employs a Shared Nothing Architecture, moving towards a decoupled compute and storage model, facilitating parallel and vectorized execution.

An Example of Shared Nothing ClickHouse Cluster with 3 replica servers

Storage Mechanism

ClickHouse uses columnar storage which allows it to read and compress large amounts of data quickly. Organizations migrating from row-based systems like Postgres can benefit significantly in terms of performance. Tables utilize unique Table Engines—notably the MergeTree engine family—to store data effectively, leveraging ClickHouse’s strengths in analytical processing.

Query Execution

ClickHouse utilizes a unique query engine optimized for high-speed data retrieval, leveraging Single Instruction, Multiple Data (SIMD) instructions to process multiple data points simultaneously. This parallel processing significantly enhances performance, especially for complex queries. As demonstrated in the video A Day in the Life of a Query, ClickHouse efficiently breaks down and executes queries, focusing on answering specific questions rather than merely retrieving raw data. To further understand query execution, we can use the EXPLAIN clause. The EXPLAIN clause in SQL is used to display the execution plan of a query. When you run a query with EXPLAIN, the database doesn't actually execute the query. Instead, it shows a detailed breakdown of how the query would be executed, including the steps the query optimizer will take.

For ClickHouse query execution steps look like:

Performance introspection EXPLAIN clause

Source: Performance introspection EXPLAIN clause

EXPLAIN PLAN: The query plan shows the in a generic way the stages that need to be executed for the query, but the query plan does not show how the ClickHouse executes the query using the available resources on the machine, its handy to check in what order the clauses are getting executed, read the plan from bottom to top.

For demonstration purposes, we will be using the UK Property Prices dataset.

EXPLAIN PLAN indexes = 1
SELECT
    postcode1,
    type,
    COUNT(*) AS property_count,
    AVG(price) AS avg_price
FROM
    uk_price_paid
WHERE
    is_new = 1  
    AND date >= '2023-01-01'
GROUP BY
    postcode1, type
ORDER BY
    avg_price DESC;

for the above query, we get output as:

Expression (Project names)
Limit (preliminary LIMIT (without OFFSET))
Sorting (Sorting for ORDER BY)
Expression ((Before ORDER BY + Projection))
Aggregating
Expression (Before GROUP BY)
Expression
ReadFromMergeTree (default.uk_price_paid)

Indexes:
 PrimaryKey
 Condition: true
 Parts: 1/1
 Granules: 3598/3598

In analyzing the query execution plan, it's essential to interpret the steps from the bottom up (in this case from ReadMergeTree to Limit) , as each layer represents a sequential operation performed on the data.

EXPLAIN AST: With this clause, we can explore the Abstract Syntax Tree, we can also visualize this via Graphviz

For the query:

EXPLAIN AST graph = 1
SELECT
    postcode1,
    type,
    COUNT(*) AS property_count,
    AVG(price) AS avg_price
FROM
    uk_price_paid
WHERE
    is_new = 1  
    AND date >= '2023-01-01'
GROUP BY
    postcode1, type
ORDER BY
    avg_price DESC;

we get Abstract Syntax Tree as:

Abstract Syntax Tree

EXPLAIN PIPELINE: Introspecting the query pipeline can help you identify where the bottle necks of the query.

For the query:

EXPLAIN PIPELINE graph = 1
SELECT
    postcode1,
    type,
    COUNT(*) AS property_count,
    AVG(price) AS avg_price
FROM
    uk_price_paid
WHERE
    is_new = 1  
    AND date >= '2023-01-01'
GROUP BY
    postcode1, type
ORDER BY
    avg_price DESC;

we get output as:

ClickHouse naturally parallelizes queries, with each step utilizing multiple threads by default. In this example, the stages are handled by 4 threads, meaning each thread processes roughly one-fourth of the data in parallel before combining the results. This approach speeds up execution significantly. For instance, identifying stages that run in a single thread is key to optimizing slow queries. By isolating these bottlenecks, we can target specific parts of the query for performance improvements, ensuring faster and more efficient execution overall.

Integration Capabilities

ClickHouse is highly compatible with a wide range of data tools, including ETL/ELT processes and BI tools like Apache Superset. It supports virtually all common data formats, making integration seamless across diverse ecosystems.

Why Choose ClickHouse and Migrate?

Choosing ClickHouse offers significant advantages, particularly for organizations dealing with large-scale data analytics. Its unique combination of performance, cost-effectiveness, and community support makes it a compelling choice for migrating from traditional databases.

Performance Advantages

ClickHouse is optimized for OLAP workloads, delivering exceptional speed in both data ingestion and query execution, offering sub-second query performance even when processing billions of rows. This makes it ideal for real-time analytics and decision-making in data-intensive industries. The primary key in ClickHouse plays a crucial role in determining how data is stored and searched. It's important to select columns that are frequently queried, as the primary key should optimize query execution, especially for the WHERE clause. In ClickHouse, primary key is not unique to each row.

Real-World Success Stories

Many organizations have successfully migrated to ClickHouse, achieving substantial improvements in performance and cost savings. From e-commerce giants to financial companies, success stories highlight ClickHouse’s ability to transform data analytics capabilities at scale. For more details, refer to ClickHouse Adopters.

Running ClickHouse on Kubernetes

In this guide, we’ll walk through the process of running ClickHouse on a Kubernetes cluster in 7 steps:

Step 1: Install Kubectl

First, we need to install kubectl, the command-line tool for interacting with Kubernetes clusters. Run the following commands in your terminal:

sudo apt-get update
sudo apt-get install -y kubectl
# Download Minikube
# Please check your OS configuration and download from:
# https://minikube.sigs.k8s.io/docs/start/?arch=%2Flinux%2Fx86-64%2Fstable%2Fbinary+download
sudo install minikube-linux-amd64 /usr/local/bin/minikube
minikube version
minikube start

At this point, you have set up Kubernetes locally.

Step 2: Install Altinity ClickHouse Operator

Next, we will download and install the Altinity ClickHouse operator to manage our ClickHouse deployment:


kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

kubectl get pods -n kube-system

You should see the ClickHouse operator pod running, which indicates that the operator is successfully deployed.

Step 3: Install the ClickHouse Database

Now we need to install the ClickHouse database itself. Follow these steps:

A basic configuration example for our demo:


apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: my-clickhouse
  namespace: test-clickhouse-operator
spec:
  configuration:
    clusters:
      - name: cluster
        layout:
          shardsCount: 1
          replicasCount: 1
  templates:
    podTemplates:
      - name: clickhouse-pod-template
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:latest
              resources:
                requests:
                  cpu: "100m"
                  memory: "1Gi"
                limits:
                  cpu: "1"
                  memory: "2Gi"
  defaults:
    templates:
      podTemplate: clickhouse-pod-template
  1. Now apply the configuration and check the status of the pods and services:
cat clickhouse-install.yaml | kubectl apply -f -
kubectl get pods -n test-clickhouse-operator
kubectl get services -n test-clickhouse-operator

You should see services running as defined in your installation configuration.

Step 4: Connect to ClickHouse Database

To interact with the ClickHouse database, we need to install the ClickHouse client on our local machine. If you are using a different operating system, refer to the official ClickHouse installation guide.

Run the following commands to install ClickHouse:

sudo apt-get install -y apt-transport-https ca-certificates curl gnupg

curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list

sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client
sudo clickhouse start

kubectl -n test-clickhouse-operator port-forward <pod_name> 9000:9000 &

clickhouse-client

Step 5: Test Your Services

To verify that everything is running correctly, execute the following commands:

kubectl get pods -n test-clickhouse-operator
kubectl get services -n test-clickhouse-operator

Step 6: Execute Queries

Now, let’s create a table and execute some queries in ClickHouse:

clickhouse-client
CREATE TABLE test_table (
    id UInt32,
    name String
) ENGINE = MergeTree()
ORDER BY id;

INSERT INTO test_table VALUES (1, 'CloudRaft'), (2, 'ClickHouse');
SELECT * FROM test_table;

You should see the results in the CLI with the changed path, indicating that you are interacting directly with the ClickHouse cluster.

Step 7: Load Testing

To further evaluate the performance of your ClickHouse installation, consider using load testing tools like Apache JMeter or k6 to simulate increased query loads. Measure how query response times change as you add more nodes to the cluster.

Key Differences between PostgreSQL and ClickHouse

While both Postgres and ClickHouse serve different purposes, the key distinction lies in how they handle replication and sharding. Postgres is primarily designed for transactional workloads (OLTP), where data consistency and durability are prioritized. On the other hand, ClickHouse is tailored for analytical workloads (OLAP), and optimized for high-speed querying and large-scale data analysis.

Materialized Views

In ClickHouse, Materialized Views are a powerful feature designed to improve query performance by pre-aggregating and storing data. Unlike regular views, which are calculated on-the-fly during query execution, materialized views physically store the results of a query, allowing for faster reads. These views can also leverage the efficient compression and fast access capabilities of the columnar storage model, further enhancing performance.

Materialized views are particularly useful in environments where query performance is critical, as they provide pre-computed results that save time during execution. Postgres’s Materialized Views need to be manually re-updated, whereas ClickHouse automatically updates them with insert-and-optimize-later philosophy.

Scaling ClickHouse

In ClickHouse, scaling can be achieved through replication and sharding mechanisms. These help distribute data and queries across multiple nodes for performance and fault tolerance.

ClickHouse traditionally relies on ZooKeeper, a centralized service for coordinating distributed systems. ZooKeeper ensures that data replicas are in sync across nodes by maintaining metadata, managing locks, and handling failovers. It acts as a key component to keep the cluster’s state consistent, ensuring that replicas do not diverge and that read and write operations are properly distributed.

Replication

Replication ensures that copies of the same data are stored across multiple nodes to provide redundancy and improve fault tolerance. Replication in ClickHouse is at the Table Level.

  • ReplicatedMergeTree is the engine used for replicated tables.

  • Each table has a replica on multiple servers, and these replicas are kept in sync.

  • Clickhouse-Keeper manages the coordination between these replicas, ensuring consistency by managing locks, transactions, and metadata related to replication.

  • In case one replica goes down, the system can still read from and write to the available replicas.

Replication Process Example:

  • Let’s assume there are two replicas, A and B. A write to Replica A will be logged and replicated to Replica B, ensuring that both have the same data. This happens asynchronously to avoid latency issues.

Sharding

Sharding in Clickhouse is the process of dividing the data horizontally into smaller parts and distributing it across different servers (shards). This allows Clickhouse to handle very large datasets by spreading the load.

  • Distributed Table: Clickhouse uses a distributed table to achieve sharding. A distributed table is a logical table that sits on top of local tables (sharded across different nodes) and acts as a query router.

  • When a query is executed on a distributed table, it is automatically routed to the relevant shard(s) b ased on the sharding key.

Sharding Process Example:

  • Suppose you have 3 nodes (Node 1, Node 2, Node 3), and data is sharded by a key such as user ID. A distributed table will split the data based on the user ID and store different users’ data on different nodes. Queries on user-specific data will be routed directly to the shard holding that user’s data, improving performance.

Conclusion

In conclusion, ClickHouse offers a powerful solution for businesses seeking high-speed, large-scale analytics. With its columnar storage, real-time query performance, and scalability through replication and sharding, it serves as an excellent alternative for organizations transitioning from traditional row-based databases like Postgres. Particularly effective in industries such as web analytics, business intelligence, and log analysis, ClickHouse meets the demands for rapid data retrieval and analysis.

However, while ClickHouse excels in query performance and scalability, it may introduce complexities in data insertion compared to traditional databases, and it’s not well-suited for OLTP use cases. Organizations considering migration to ClickHouse should weigh these trade-offs, especially if they require frequent real-time inserts or updates. Ultimately, its scalability, cost-effectiveness, and growing community support make ClickHouse a compelling choice for modern data-driven applications, transforming how businesses manage and analyze data.


0
Subscribe to my newsletter

Read articles from Arin Zingade directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Arin Zingade
Arin Zingade