Introduction to Exploratory Data Analysis (EDA) in SQL

Hello everyone! I’m Dhyuthidhar Saraswathula. If you’re new here, I write blogs on Data Science and Computer Science topics.

So, buckle up! Today, we’re diving into an essential data science concept: Exploratory Data Analysis (EDA). In SQL, EDA means using queries to explore, summarize, and understand data. It's the first critical step in any data project, giving us insights and clarity before deeper analysis or modeling.


Why Do We Need EDA?

EDA helps us answer several important questions about our data:

  • Data Structure: What’s the structure of our data?

  • Data Quality: Are there any missing values or outliers?

  • Value Distribution: What’s the distribution of values in each column?

  • Relationships: What correlations or relationships exist between variables?

Fun Fact: The term "Exploratory Data Analysis" was popularized by John Tukey in his 1977 book Exploratory Data Analysis, where he emphasized the importance of visualizations like boxplots and scatterplots to understand data.


Why Is EDA Growing So Fast?

A few reasons drive the rapid growth of EDA:

  1. Technological Advancements: Computing tools have evolved, making data analysis more accessible.

  2. Big Data: We now have vast amounts of data across fields that require effective analysis.

  3. Cross-Disciplinary Demand: EDA is valuable in finance, healthcare, marketing, and beyond.


Data Sources in SQL

Most SQL data comes in structured tables. In real-world scenarios, however, data often starts unstructured—like sensor measurements, images, or text—and needs to be organized into a structured format to perform SQL-based analysis.


Types of Data in SQL

Understanding data types is crucial as they impact the analyses and visualizations we can perform. In SQL, we usually work with:

  • Numerical Data

    • Continuous: Values that can take any number within a range (e.g., salary, age).

    • Discrete: Countable values (e.g., number of orders).

  • Categorical Data

    • Binary: Data with two categories (e.g., 0/1 for yes/no).

    • Ordinal: Data with a specific order (e.g., ratings like 1-5).

Knowing these types allows us to choose the right SQL functions and analysis techniques.


Basic SQL EDA Techniques

Let's explore some SQL functions for EDA:

  1. COUNT()
    Use COUNT() to understand the quantity of data. If we want to count the number of students in the Computer Science department, we’d write:

     SELECT COUNT(*) AS Count_of_Students
     FROM Students
     WHERE Department = 'Computer Science';
    
  2. SUM()
    Use SUM() to get the total of a numeric column. To find the total revenue in a Transactions table:

     SELECT SUM(Amount) AS Total_Revenue
     FROM Transactions;
    
  3. AVG()
    Calculate averages with AVG(). To find the average age of students in the Computer Science department:

     SELECT AVG(Age) AS Average_Age
     FROM Students
     WHERE Department = 'Computer Science';
    
  4. MIN() and MAX()
    These functions return the smallest and largest values. To find the youngest and oldest students in the Computer Science department:

     SELECT MIN(Age) AS Youngest_Age, MAX(Age) AS Oldest_Age
     FROM Students
     WHERE Department = 'Computer Science';
    

Grouping Data with GROUP BY

GROUP BY allows us to aggregate data by specific groups. To get the number of students in each department:

SELECT Department, COUNT(*) AS Student_Count
FROM Students
GROUP BY Department;

Filtering Aggregated Data with HAVING

The HAVING clause filters data after grouping. To find departments with more than 50 students:

SELECT Department, COUNT(*) AS Student_Count
FROM Students
GROUP BY Department
HAVING COUNT(*) > 50;

Note: WHERE filters rows before grouping, while HAVING filters grouped results.


Summary of EDA Insights

Using SQL, we can quickly gain critical insights like:

  • Total Records: Using COUNT()

  • Summing Values: Using SUM()

  • Averages: Using AVG()

  • Ranges: Using MIN() and MAX()

These insights lay the foundation for deeper analysis and help us understand our data’s characteristics before applying complex models.


Final Thoughts

That wraps up our introduction to EDA in SQL! EDA in SQL is a powerful way to explore structured data and gain essential insights. In my next blog, I’ll dive into using advanced SQL functions and joins for even richer data exploration. Stay tuned!

0
Subscribe to my newsletter

Read articles from S.S.S DHYUTHIDHAR directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

S.S.S DHYUTHIDHAR
S.S.S DHYUTHIDHAR

I am a student. I am enthusiastic about learning new things.