Once I had cleaned the raw files (visit_history_clean, client_info_raw, and invoice_raw), my next goal was to combine them into a single, master-level view of every customer—no matter which dataset they appreaed in.

Each dataset offered part of the picture:

visit_history: client ids, full names, emails, phone numbers, genders, birthdays, and addresses
invoice: client ids, full names
client_info: client ids, first names, last names, emails, phone numbers

But since not every customer existed in all three datasets, I needed a way to preserve every unique client_id across the system.

UNION all client_ids

WITH all_clients AS (
    SELECT client_id FROM visit_history_raw
    UNION
    SELECT client_id FROM client_info_raw
    UNION
    SELECT client_id FROM invoice_raw
)

This gave me a CTE(all_clients) that captured every unique client_id across all sources

LEFT JOIN from the master list

Then I left-joined each cleaned dataset onto the base, using COALESCE() to combine fields where needed:

SELECT DISTINCT
    a.client_id,
    COALESCE(v.client_name, CONCAT(c.firstname, ' ', c.lastname), i.client_name) AS client_name,
    v.gender,
    COALESCE(v.email, c.email) AS email,
    COALESCE(.phone, c.cellphone) AS phone,
    v.address,
    v.city,
    v.state,
    v.zipcode,
    v.birthday
FROM all_clients a
LEFT JOIN visit_history_clean v ON a.client_id = v.client_id
LEFT JOIN client_info_raw c ON a.client_id = c.client_id
LEFT JOIN invoice_raw i ON a.client_id = i.client_id;

This gave me a clean, de-duplicated, all-in-one client_info_clean table with

Full name
Phone, Email
Demographic information - gender, birthday, Address(state, city, zipcode, address)

Auditing Column Completeness

Before diving too deep, I wanted to sanity check the data, to understand how complete each column is.

I used this query to get a quick summary:

WITH stats AS (
  SELECT 
    COUNT(*) AS total_records,
    COUNT(gender) AS gender_count,
    COUNT(email) AS email_count,
    COUNT(phone) AS phone_count,
    COUNT(birthday) AS birthday_count,
    COUNT(address) AS address_count,
    COUNT(city) AS city_count,
    COUNT(state) AS state_count,
    COUNT(zipcode) AS zipcode_count
  FROM client_info_clean
)

SELECT 
  unnest(ARRAY[
    'gender_count', 'email_count', 'phone_count', 'birthday_count', 
    'address_count', 'city_count', 'state_count', 'zipcode_count', 'total_records'
  ]) AS column_name,
  unnest(ARRAY[
    gender_count, email_count, phone_count, birthday_count,
    address_count, city_count, state_count, zipcode_count, total_records
  ]) AS count,
  ROUND(
    unnest(ARRAY[
      gender_count, email_count, phone_count, birthday_count,
      address_count, city_count, state_count, zipcode_count, total_records
    ]) * 100.0 / total_records, 1
  ) AS percentage
FROM stats;

This gave me a quick snapshot of which fields were more complete and which had missing values.

Basic Demographic Analysis

Once the table was built, I immediately ran basic demographic queries to get a feel for the population.

Gender Breakdown

SELECT gender, count(*) as count
FROM client_info_clean
GROUP BY gender
ORDER BY gender

TOP 10 Cities by Customer Count

SELECT city, count(*) as count
FROM client_info_clean
WHERE city <> ''
GROUP BY city
ORDER BY count DESC
LIMIT 10;

Age Grouping using Birthday

I calculated age buckets using the AGE() and EXTRACT() functions.

SELECT 
  CASE 
    WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 10 AND 19 THEN '10s'
    WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 20 AND 29 THEN '20s'
    WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 30 AND 39 THEN '30s'
    WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 40 AND 49 THEN '40s'
    WHEN EXTRACT(YEAR FROM AGE(birthday)) BETWEEN 50 AND 59 THEN '50s'
    WHEN EXTRACT(YEAR FROM AGE(birthday)) >= 60 THEN '60+'
    ELSE 'Unknown'
  END AS age_group,
  COUNT(*) AS count
FROM client_info_clean
WHERE birthday IS NOT NULL
GROUP BY age_group
ORDER BY age_group;

What I learned today

From these simple stats, I learned that

The gender ratio was surprisingly skewed.
Most customers came from a small number of cities in Southern California, which is < 30min away from the store by car.
The age distribution was concentrated in the 30s-40s range.

These insights helped guide the next step: segmentation, mapping, and targeting strategies.

I finally have an integrated client information data in my hand, so as next step, I will anonymize other tables and start to analyze!

Resources

View SQL code on Github
SQL Clause - COALESCE
SQL Clause - unnest

#3 Creating integrated client table + Basic Analysis

UNION all client_ids

LEFT JOIN from the master list

Auditing Column Completeness

Basic Demographic Analysis

What I learned today

Subscribe to my newsletter

kyuwon kang

kyuwon kang