[SQL] Writing Efficient SQL Queries


Think of yourself building a program. It can be any types, such as website, app, etc. These applications may require database to show the data that the user requested.
For example, you can think of a table that stores GPS-related data. The longitude and latitude are stored in the database every second. One row increases every second. And… let’s say you have 10,000 users that use your application. Then your DB receive 10,000 rows a second.
How are you going to deal with such circumstance? In this article, I will be introducing three key strategies/skills we can use to make our SQL queries more efficient.
1. Only Select Columns You Want
It is tempting to start your query with SELECT * FROM
. However, what if your table has a column that stores large scale of texts?
For example, the table reviews
has a column called description
. Even though your application has set a limit of characters for restaurant review, selecting description
column when you do not need is very inefficient.
Example code:
star_query = "SELECT * FROM `bigquery-public-data.github_repos.contents`"
show_amount_of_data_scanned(star_query)
basic_query = "SELECT size, binary FROM `bigquery-public-data.github_repos.contents`"
show_amount_of_data_scanned(basic_query)
Data processed: 2682.118 GB
Data processed: 2.531 GB
2. Read Less Data
Reading less data, in other words, we want to scan less data. It’s not very different from the first strategy. For example, you are selecting a.name
where a.id = b.id
. In this case, you are scanning three columns to retreive just a single column. Actually, you can get the same result with less columns scanned.
More efficient code:
more_data_query = """
SELECT MIN(start_station_name) AS start_station_name,
MIN(end_station_name) AS end_station_name,
AVG(duration_sec) AS avg_duration_sec
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE start_station_id != end_station_id
GROUP BY start_station_id, end_station_id
LIMIT 10
"""
show_amount_of_data_scanned(more_data_query)
less_data_query = """
SELECT start_station_name,
end_station_name,
AVG(duration_sec) AS avg_duration_sec
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE start_station_name != end_station_name
GROUP BY start_station_name, end_station_name
LIMIT 10
"""
show_amount_of_data_scanned(less_data_query)
In more_data_query
example, you are scanning FIVE columns.
start_station_name
end_station_name
duration_sec
start_station_id
end_station_id
In less_data_query
, you are scanning THREE columns.
start_station_name
end_station_name
duration_sec
Data processed: 0.076 GB
Data processed: 0.06 GB
3. Avoid N:N JOINs
First, let’s have a look at different types of x-to-x JOINs
1:1 JOIN
Most of the JOINs that you have executed in this course have been 1:1 JOINs. In this case, each row in each table has at most one match in the other table.
N:1 JOIN
Another type of JOIN is an N:1 JOIN. Here, each row in one table matches potentially many rows in the other table.
N:N JOIN
Finally, an N:N JOIN is one where a group of rows in one table can match a group of rows in the other table. Note that in general, all other things equal, this type of JOIN produces a table with many more rows than either of the two (original) tables that are being JOINed.
Now, let’s dive into actual examples.
Inefficient way:
big_join_query = """
SELECT repo,
COUNT(DISTINCT c.committer.name) as num_committers,
COUNT(DISTINCT f.id) AS num_files
FROM `bigquery-public-data.github_repos.commits` AS c,
UNNEST(c.repo_name) AS repo
INNER JOIN `bigquery-public-data.github_repos.files` AS f
ON f.repo_name = repo
WHERE f.repo_name IN ( 'tensorflow/tensorflow', 'facebook/react', 'twbs/bootstrap', 'apple/swift', 'Microsoft/vscode', 'torvalds/linux')
GROUP BY repo
ORDER BY repo
"""
show_time_to_run(big_join_query)
Time to run: 13.028 seconds
small_join_query = """
WITH commits AS
(
SELECT COUNT(DISTINCT committer.name) AS num_committers, repo
FROM `bigquery-public-data.github_repos.commits`,
UNNEST(repo_name) as repo
WHERE repo IN ( 'tensorflow/tensorflow', 'facebook/react', 'twbs/bootstrap', 'apple/swift', 'Microsoft/vscode', 'torvalds/linux')
GROUP BY repo
),
files AS
(
SELECT COUNT(DISTINCT id) AS num_files, repo_name as repo
FROM `bigquery-public-data.github_repos.files`
WHERE repo_name IN ( 'tensorflow/tensorflow', 'facebook/react', 'twbs/bootstrap', 'apple/swift', 'Microsoft/vscode', 'torvalds/linux')
GROUP BY repo
)
SELECT commits.repo, commits.num_committers, files.num_files
FROM commits
INNER JOIN files
ON commits.repo = files.repo
ORDER BY repo
"""
show_time_to_run(small_join_query)
Time to run: 4.413 seconds
There is a huge gap in efficiency. It’s 8.6 sec difference!
The first approach JOIN first, then filters out.
The efficient one(second approach) filters out first, then JOIN, which means it allows you to join the two tables that are already quite minimized. On the other hand, if you join first, you have two tables with a bunch of data, and your query scans tons of rows to find matching keys.
Subscribe to my newsletter
Read articles from Lim Woojae directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Lim Woojae
Lim Woojae
Computer Science Enthusiast with a Drive for Excellence | Data Science | Web Development | Passionate About Tech & Innovation