Load Data into StarRocks from Any Database
Table of contents
Introduction
Let's look at how we can easily export load data into a StarRocks from most major databases with Sling, a versatile CLI data integration tool which allows you to quickly extract and load data right from the terminal.
Sling is a tool with a goal of making the experience of ingesting data a positive, even pleasant one. Sling focuses on 3 of data types interfaces:
From File Systems to Databases
From Databases to Databases
From Databases to File Systems
The list of connections that Sling supports continues to grow. You can see the full list here, but it supports all the major platforms including Clickhouse, DuckDB, Google BigQuery, Google BigTable, MariaDB, MongoDB, MotherDuck, MySQL, Oracle, PostgreSQL, Prometheus, Redshift, Snowflake, SQL Server, SQLite, StarRocks and Trino.
StarRocks
StarRocks is a powerful distributed, columnar storage database system designed for real-time analytics. One of the key features is that it supports several table type designs which can meet varying business requirements:
- Duplicate Key table: each of the records as a separate row
- Aggregate table: the aggregated record as a row
- Primary Key table: only the most recently loaded record as a row (based on key)
This makes it suitable for many analytics use cases, such as business intelligence, ad hoc querying, real-time events and even machine learning / AI-driven data processing.
Load Data with Sling
First, let us install Sling. See here on details how to do so. It is usually a simple command, such as:
# On Mac
brew install slingdata-io/sling/sling
# On Windows Powershell
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling
# On Linux
curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
&& tar xf sling_linux_amd64.tar.gz \
&& rm -f sling_linux_amd64.tar.gz \
&& chmod +x sling
You should be able to run the sling
command at this point.
Next, we'll set up StarRocks. If you don't have a StarRocks instance running, you can quickly launch a development instance on your machine with docker and load data into it.
docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
Now we can create a connection for our StarRocks database with Sling and test connectivity. It is important to set the fe_url
so we can Stream Load into StarRocks See here for more details on configuration.
$ sling conns set starrocks url="starrocks://root:@localhost:9030/sling" fe_url="http://root:@localhost:8030"
8:01PM INF connection `starrocks` has been set in /Users/me/.sling/env.yaml. Please test with `sling conns test starrocks`
# test our connection
$ sling conns test starrocks
8:01PM INF success!
Great. In this tutorial, we will be extracting data from a MySQL database. We can set up the connection as shown below (see here for more details).
$ sling conns set mysql url="mysql://admin:password@localhost:3306/mysql"
8:02PM INF connection `mysql` has been set in /Users/me/.sling/env.yaml. Please test with `sling conns test mysql`
# test our connection
$ sling conns test mysql
8:02PM INF success!
# discover our tables
$ sling conns discover mysql
+-----+--------------------+----------------------------------------------+-------+---------+
| # | SCHEMA | NAME | TYPE | COLUMNS |
+-----+--------------------+----------------------------------------------+-------+---------+
| 1 | information_schema | ADMINISTRABLE_ROLE_AUTHORIZATIONS | table | 9 |
| 2 | information_schema | APPLICABLE_ROLES | table | 9 |
| 3 | information_schema | CHARACTER_SETS | table | 4 |
| 4 | information_schema | CHECK_CONSTRAINTS | table | 4 |
| 5 | information_schema | COLLATIONS | table | 7 |
| 6 | information_schema | COLLATION_CHARACTER_SET_APPLICABILITY | table | 2 |
.....
Great, we're ready to load some data into StarRocks!
Replication
We'll be using a replication to define what tables Sling should load. Replications are the best way to use sling in a reusable manner. The defaults
key allows reusing your inputs with the ability to override any of them in a particular stream. Both YAML or JSON files are accepted.
# replication.yaml
source: mysql
target: starrocks
defaults:
object: main.{stream_schema}_{stream_table}
mode: full-refresh
streams:
# all tables in schema `mysql`
mysql.*:
# only one table with specific duplicate keys
finance.account_sales:
mode: truncate
target_options:
table_keys:
duplicate: [account_id, sale_id]
We can run the replication like this:
sling run -r replication.yaml
Output
09:14PM INF Sling Replication [5 streams] | mysql -> starrocks
09:14PM INF [1 / 5] running stream "mysql"."accounts"
09:14PM INF connecting to source database (mysql)
09:14PM INF connecting to target database (starrocks)
09:14PM INF reading from source database
09:14PM INF writing to target database [mode: full-refresh]
09:14PM INF streaming data
09:14PM INF importing into StarRocks via stream load
09:14PM INF created table main
.mysql_accounts
09:14PM INF inserted 62347 rows into main
.mysql_accounts
in 10 secs [6234 r/s] [20.0 MB]
09:14PM INF execution succeeded
09:14PM INF [2 / 5] running stream "mysql"."orders"
09:14PM INF connecting to source database (mysql)
09:14PM INF connecting to target database (starrocks)
09:14PM INF reading from source database
09:14PM INF writing to target database [mode: full-refresh]
09:14PM INF streaming data
09:14PM INF importing into StarRocks via stream load
09:14PM INF created table main
.mysql_orders
09:14PM INF inserted 716540 rows into main
.mysql_orders
in 40 secs [17,973 r/s] [120 MB]
09:14PM INF execution succeeded
.......
09:15PM INF [5 / 5] running stream "finance"."account_sales"
09:15PM INF connecting to source database (mysql)
09:15PM INF connecting to target database (starrocks)
09:15PM INF reading from source database
09:15PM INF writing to target database [mode: full-refresh]
09:15PM INF streaming data
09:15PM INF importing into StarRocks via stream load
09:15PM INF created table main
.finance_account_sales
09:15PM INF inserted 11718 rows into main
.finance_account_sales
in 0 secs [14,126 r/s] [2.0 MB]
09:15PM INF execution succeeded
09:15PM INF Sling Replication Completed in 1m 34s | mysql -> starrocks | 5 Successes | 0 Failures
So easy! See many more examples here: https://docs.slingdata.io/sling-cli/run/examples
Conclusion
We went over on how easy it was to install Sling, and export data from MySQL into StarRocks. Feel free to check out other examples here: https://docs.slingdata.io.
Subscribe to my newsletter
Read articles from Fritz Larco directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Fritz Larco
Fritz Larco
A technical data engineer / passionate technologist who loves tackling interesting problems with code, exploration & persistence.