Day 15 - Understanding Your Data by Asking Questions
Table of contents
When working with data, the first step to making sense of it is to ask the right questions. These questions guide your exploration and help you gather insights before diving into any analysis or modeling.
How Big is the Data?
Check the number of rows and columns to understand the dataset's scale using.shape()
.How Does the Data Look Like?
Preview the data with.head()
to get a sense of the columns and values.What is the Data Type of Columns?
Use.dtypes()
to identify data types and ensure they are suitable for the analysis.Are There Any Missing Values?
Check for missing data with.isnull().sum()
and decide how to handle them.How Does the Data Look Mathematically?
Use.describe()
to get summary statistics like mean, median, and standard deviation.Are There Duplicate Values?
Identify duplicates with.duplicated().sum()
and remove them if necessary.How is the Correlation Between Columns?
Check relationships using.corr()
and visualize correlations to guide feature selection.
Example: Data Exploration with Pandas
import pandas as pd
# Example: Assume we have a CSV file named 'data.csv'
df = pd.read_csv('data.csv')
# 1. How Big is the Data?
print("Data Shape (Rows, Columns):", df.shape)
# 2. How Does the Data Look Like? (View the first 5 rows)
print("\nFirst 5 Rows of the Data:")
print(df.head())
# 3. What is the Data Type of Columns?
print("\nData Types of Columns:")
print(df.dtypes)
# 4. Are There Any Missing Values?
print("\nMissing Values in Each Column:")
print(df.isnull().sum())
# 5. How Does the Data Look Mathematically? (Summary Statistics)
print("\nSummary Statistics of Numerical Columns:")
print(df.describe())
# 6. Are There Duplicate Values?
print("\nNumber of Duplicate Rows:")
print(df.duplicated().sum())
# 7. How is the Correlation Between Columns?
print("\nCorrelation Matrix:")
print(df.corr())
Subscribe to my newsletter
Read articles from Nischal Baidar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by