#3 Creating integrated client table + Basic Analysis

Once I had cleaned the raw files (visit_history_clean, client_info_raw, and invoice_raw), my next goal was to combine them into a single, master-level view of every customer—no matter which dataset they appreaed in.
Each dataset offered part of the picture:
visit_history
: client ids, full names, emails, phone numbers, genders, birthdays, and addressesinvoice
: client ids, full namesclient_info
: client ids, first names, last names, emails, phone numbers
But since not every customer existed in all three datasets, I needed a way to preserve every unique client_id across the system.
UNION all client_ids
WITH all_clients AS (
SELECT client_id FROM visit_history_raw
UNION
SELECT client_id FROM client_info_raw
UNION
SELECT client_id FROM invoice_raw
)
This gave me a CTE(all_clients) that captured every unique client_id across all sources
LEFT JOIN from the master list
Then I left-joined each cleaned dataset onto the base, using COALESCE() to combine fields where needed:
SELECT DISTINCT
a.client_id,
COALESCE(v.client_name, CONCAT(c.firstname, ' ', c.lastname), i.client_name) AS client_name,
v.gender,
COALESCE(v.email, c.email) AS email,
COALESCE(.phone, c.cellphone) AS phone,
v.address,
v.city,
v.state,
v.zipcode,
v.birthday
FROM all_clients a
LEFT JOIN visit_history_clean v ON a.client_id = v.client_id
LEFT JOIN client_info_raw c ON a.client_id = c.client_id
LEFT JOIN invoice_raw i ON a.client_id = i.client_id;
This gave me a clean, de-duplicated, all-in-one client_info_clean table with
Full name
Phone, Email
Demographic information - gender, birthday, Address(state, city, zipcode, address)
Auditing Column Completeness
Before diving too deep, I wanted to sanity check the data, to understand how complete each column is.
I used this query to get a quick summary:
WITH stats AS (
SELECT
COUNT(*) AS total_records,
COUNT(gender) AS gender_count,
COUNT(email) AS email_count,
COUNT(phone) AS phone_count,
COUNT(birthday) AS birthday_count,
COUNT(address) AS address_count,
COUNT(city) AS city_count,
COUNT(state) AS state_count,
COUNT(zipcode) AS zipcode_count
FROM client_info_clean
)
SELECT
unnest(ARRAY[
'gender_count', 'email_count', 'phone_count', 'birthday_count',
'address_count', 'city_count', 'state_count', 'zipcode_count', 'total_records'
]) AS column_name,
unnest(ARRAY[
gender_count, email_count, phone_count, birthday_count,
address_count, city_count, state_count, zipcode_count, total_records
]) AS count,
ROUND(
unnest(ARRAY[
gender_count, email_count, phone_count, birthday_count,
address_count, city_count, state_count, zipcode_count, total_records
]) * 100.0 / total_records, 1
) AS percentage
FROM stats;
This gave me a quick snapshot of which fields were more complete and which had missing values.
Basic Demographic Analysis
Once the table was built, I immediately ran basic demographic queries to get a feel for the population.
Gender Breakdown
SELECT gender, count(*) as count
FROM client_info_clean
GROUP BY gender
ORDER BY gender
TOP 10 Cities by Customer Count
SELECT city, count(*) as count
FROM client_info_clean
WHERE city <> ''
GROUP BY city
ORDER BY count DESC
LIMIT 10;
Age Grouping using Birthday
I calculated age buckets using the AGE() and EXTRACT() functions.
SELECT
CASE
WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 10 AND 19 THEN '10s'
WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 20 AND 29 THEN '20s'
WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 30 AND 39 THEN '30s'
WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 40 AND 49 THEN '40s'
WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 50 AND 59 THEN '50s'
WHEN EXTRACT(YEAR FROM AGE(birthday)) >= 60 THEN '60+'
ELSE 'Unknown'
END AS age_group,
COUNT(*) AS count
FROM client_info_clean
WHERE birthday IS NOT NULL
GROUP BY age_group
ORDER BY age_group;
What I learned today
From these simple stats, I learned that
The gender ratio was surprisingly skewed.
Most customers came from a small number of cities in Southern California, which is < 30min away from the store by car.
The age distribution was concentrated in the 30s-40s range.
These insights helped guide the next step: segmentation, mapping, and targeting strategies.
I finally have an integrated client information data in my hand, so as next step, I will anonymize other tables and start to analyze!
Resources
SQL Clause - COALESCE
SQL Clause - unnest
Subscribe to my newsletter
Read articles from kyuwon kang directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
