Visualising Data with Seaborn - A Tutorial

Table of contents
- 1. Initial Setup and Data Loading
- 2. Visualising Relationships
- 2.1 Scatter Plot (sns.lmplot): Visualising Linear Relationships
- 2.2 Regression Plot (sns.regplot): More Flexible Linear Regression
- 2.3 Joint Plot (sns.jointplot): Bivariate Distribution
- 2.4 Heatmap (sns.heatmap): Visualising Correlations and Matrices
- 2.5 Pair Grid (sns.PairGrid): Visualising Pairwise Relationships
- 2.6 Swarm Plot (sns.swarmplot): Showing Individual Observations
- 3. Visualising Trends
- 4. Visualising Distributions
- 4.1 Histogram (sns.histplot): Understanding Data Distribution
- 4.2 Box Plot (sns.boxplot): Summarizing Data Distribution
- 4.3 Violin Plot (sns.violinplot): Detailed Distribution with Density
- 4.4 Boxen Plot (Letter-value Plot - sns.boxenplot): Enhanced Distribution Detail
- 4.5 Facet Grid (sns.FacetGrid): Creating Multi-Panel Plots for Distribution
- 4.6 Count Plot (sns.countplot): Visualising Categorical Counts
- 4.7 Count Plot with Hue (sns.countplot with hue): Categorical Distribution with Sub-Categories

Seaborn is a powerful Python library built on Matplotlib that helps you create beautiful and informative statistical graphics. This guide will walk you through various Seaborn plot types using the California Housing dataset.
Since it's not always easy to decide how to best tell the story behind your data, we've broken the chart types into three broad categories to help with this:
Trends: A pattern of change over time or a continuous variable.
sns.lineplot
: Line charts show trends over time, allowing multiple lines for group comparisons.
Relationship: Understanding connections between variables.
sns.barplot
: Bar charts compare quantities across different groups.sns.heatmap
: Heatmaps reveal colour-coded patterns in numerical tables.sns.scatterplot
: Scatter plots show relationships between two continuous variables, with optional color-coding for a third categorical variable.sns.regplot
: Adds a regression line to scatter plots, making linear relationships clearer.sns.lmplot
: Useful for drawing multiple regression lines when groups are colour-coded in a scatter plot.sns.swarmplot
: Categorical scatter plots show the relationship between a continuous variable and a categorical variable by preventing point overlap.
Distribution: Visualising the possible values a variable can take and their likelihood, often across categories.
sns.histplot
: Histograms show the distribution of a single numerical variable.sns.kdeplot
: KDE plots (or 2D KDE plots) display a smooth, estimated distribution for one or two numerical variables.sns.jointplot
: Combines a 2D KDE plot with corresponding KDE plots for individual variables.sns.boxplot
: Summarises the distribution of a numerical variable, showing median, quartiles, and outliers.sns.violinplot
: Shows the density distribution of a numerical variable, along with its median and quartiles.sns.boxenplot
: Provides more detailed quantile information than a box plot, especially for large datasets.sns.countplot
: Displays the count of observations for each category in a categorical variable, with optional sub-categorization usinghue
.sns.FacetGrid
: Allows creating multiple plots (e.g., histograms or KDEs) across different subsets of your data to compare distributions.
1. Initial Setup and Data Loading
First, let's get everything set up by loading the necessary libraries and the California Housing dataset.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Fetch the California Housing dataset
housing_data = fetch_california_housing()
# Create a Pandas DataFrame
housing_df = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)
housing_df['target'] = housing_data.target
# Display the first few rows of the DataFrame
print(housing_df.head())
# Set the Seaborn style for better aesthetics
sns.set(style='ticks', palette='Set2')
Cheat Sheet: Initial Setup
Code | Description |
import pandas as pd | Imports Pandas for data manipulation. |
import seaborn as sns | Imports Seaborn for plotting. |
import matplotlib.pyplot as plt | Imports Matplotlib for plot customization. |
from sklearn.datasets import fetch_california_housing | Imports the California Housing dataset. |
housing_df = pd.DataFrame(...) | Converts the dataset into a Pandas DataFrame. |
sns.set(style='ticks', palette='Set2') | Sets the visual style of Seaborn plots. style can be 'white', 'dark', 'whitegrid', 'darkgrid', 'ticks'. palette offers various color schemes. |
Here's a look at the first few rows of the housing_df
DataFrame:
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | target |
8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 | 4.526 |
8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 | 3.585 |
7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 | 3.521 |
5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 | 3.413 |
3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 | 3.422 |
2. Visualising Relationships
These plots are designed to show how two or more variables interact with each other, helping to identify correlations, patterns, and dependencies.
2.1 Scatter Plot (sns.lmplot
): Visualising Linear Relationships
Scatter plots are great for seeing how two numerical variables relate to each other. Seaborn's lmplot
can even add a regression line to show the trend and a confidence interval around it.
def scatterPlot():
# Create the scatter plot with a regression line
sns.lmplot(x="AveRooms", y="AveBedrms", data=housing_df)
# Remove excess chart lines and ticks for a cleaner look
sns.despine()
plt.title('Average Rooms vs. Average Bedrooms')
plt.xlabel('Average Rooms per Household')
plt.ylabel('Average Bedrooms per Household')
plt.show()
# To run this plot, uncomment the line below:
# scatterPlot()
What it means: This plot shows the connection between the average number of rooms and bedrooms in a household. You'd expect to see a strong positive correlation, meaning more rooms generally lead to more bedrooms. The line helps highlight this trend, and the shaded area represents the confidence interval for the regression estimate.
Here's an example of what such a scatter plot might look like:
Cheat Sheet: Scatter Plot (lmplot
)
Code | Description |
sns.lmplot(x="col1", y="col2", data=df) | Creates a scatter plot with an optional regression line. x and y specify the columns for the axes. |
sns.despine() | Removes the top and right borders from the plot, making it look cleaner. |
plt.title('Title') | Sets the title of your plot. |
plt.xlabel('Label') | Sets the label for the x-axis. |
plt.ylabel('Label') | Sets the label for the y-axis. |
plt.show () | Displays the plot. |
2.2 Regression Plot (sns.regplot
): More Flexible Linear Regression
Similar to lmplot
, regplot
also plots data and a linear regression model fit. However, regplot
works with axes-level functions, allowing more flexibility in integrating with other plots.
def regPlot():
sns.regplot(x="MedInc", y="target", data=housing_df, scatter_kws={'alpha':0.3})
sns.despine()
plt.title('Median Income vs. House Value (Regression Plot)')
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.show()
# To run this plot, uncomment the line below:
# regPlot()
What it means: This plot directly shows the linear relationship between median income (MedInc
) and median house value (target
). The line indicates the best-fit linear regression, and the shaded area is the confidence interval. We can clearly see a positive correlation, meaning higher incomes are generally associated with higher house values. The scatter_kws
argument makes the points slightly transparent (alpha=0.3
), which helps visualise areas of high data density.
Here's an example of what such a regression plot might look like:
Cheat Sheet: Regression Plot (regplot
)
Code | Description |
sns.regplot(x="col1", y="col2", data=df, ...) | Plots data and a linear regression model fit. scatter_kws can customize scatter points. |
sns.despine() | Removes the top and right borders from the plot, making it look cleaner. |
plt.title('Title') | Sets the title of your plot. |
plt.xlabel('Label') | Sets the label for the x-axis. |
plt.ylabel('Label') | Sets the label for the y-axis. |
plt.show () | Displays the plot. |
2.3 Joint Plot (sns.jointplot
): Bivariate Distribution
Joint plots are fantastic for seeing the relationship between two variables, and they also show their individual distributions along the edges.
def joinplot_distribution():
sns.jointplot(data=housing_df, x='MedInc', y='Population', kind="hex")
plt.suptitle('Median Income vs. Population (Hexbin)', y=1.02) # Adjust suptitle position
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# joinplot_distribution()
What it means: This joint plot visualises the relationship between median income and population. The kind="hex"
option creates hexagonal bins, where the colour intensity tells you how many data points fall into each hexagon. This is particularly useful for datasets with many overlapping points. The marginal histograms on the sides show the individual distributions of median income and population, providing a comprehensive view of both variables and their interaction.
Here's an example of what such a joint plot might look like:
Cheat Sheet: Joint Plot
Code | Description |
sns.jointplot(data=df, x='col1', y='col2', kind="type") | Creates a joint plot. kind can be "scatter", "kde", "hist", "reg", "resid", "hex". |
plt.suptitle('Title', y=pos) | Sets a main title for the entire figure, which is useful for joint plots since plt.title applies to just one part. |
2.4 Heatmap (sns.heatmap
): Visualising Correlations and Matrices
Heatmaps are excellent for displaying matrices of data where colour intensity represents values. They're often used for correlation matrices or to show relationships between two categorical variables and a continuous one.
def heatmap():
# Create income and age categories for the heatmap
housing_df['IncomeCategory'] = pd.cut(housing_df['MedInc'], bins=8, labels=[
'Very Low', 'Low', 'Low-Med', 'Medium', 'Med-High', 'High', 'Very High', 'Wealthy'])
housing_df['AgeCategory'] = pd.cut(housing_df['HouseAge'], bins=10, labels=[
'0-5', '6-10', '11-15', '16-20', '21-25', '26-30', '31-35', '36-40', '41-45', '46-50'])
# Create a pivot table to aggregate the 'target' (house value) by categories
heatmap_data = housing_df.pivot_table(
values='target',
index='IncomeCategory',
columns='AgeCategory',
aggfunc='mean'
)
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap='YlOrRd', fmt='.1f')
plt.title('Average House Value by Income and House Age Category')
plt.xlabel('House Age Category')
plt.ylabel('Income Category')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# heatmap()
What it means: This heatmap shows the average house value (target
) based on categories of median income and house age. The color intensity (and annotations) indicate the average house value for each combination. This helps identify which combinations of income and house age tend to have higher or lower property values, making it easy to spot trends like "wealthy households in older homes" potentially correlating with high property values.
Here's an example of what such a heatmap might look like:
Cheat Sheet: Heatmap
Code | Description |
pd.pivot_table(df, values, index, columns, aggfunc) | Creates a pivot table. values is the column to aggregate, index and columns define the rows and columns of the new table, aggfunc is the aggregation function (e.g., 'mean'). |
sns.heatmap(data, annot, cmap, fmt) | Creates a heatmap. annot=True displays the values on the heatmap, cmap sets the color map, fmt formats the annotation text. |
plt.figure(figsize=(width, height)) | Creates a new figure with a specified size. |
2.5 Pair Grid (sns.PairGrid
): Visualising Pairwise Relationships
Pair grids (and the simpler pairplot
) are fantastic for seeing relationships between multiple variables in your dataset. They create a grid of scatter plots for every pair of variables and show histograms or KDEs for individual variables.
def pairgird():
g = sns.PairGrid(housing_df[['AveRooms', 'AveBedrms', 'Population','MedInc']])
g.map_upper(sns.scatterplot) # Scatter plots on the upper part of the grid
g.map_lower(sns.scatterplot) # Scatter plots on the lower part of the grid (Corrected from kdeplot)
g.map_diag(sns.histplot, kde=True) # Histograms with KDE on the diagonal
plt.suptitle('Pairwise Relationships of Key Housing Variables', y=1.02) # Add a main title
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# pairgird()
What it means: This pair grid shows you all the relationships between 'AveRooms', 'AveBedrms', 'Population', and 'MedInc'. The top-right and bottom-left sections now consistently show scatter plots, while the diagonal displays histograms (with KDE) for each variable. This gives a comprehensive overview of how these features relate and are distributed, helping to quickly identify potential linear or non-linear relationships and common value ranges for each variable.
Here's an example of what such a pair grid might look like:
Cheat Sheet: Pair Grid
Code | Description |
sns.PairGrid(df[columns]) | Sets up a PairGrid with a specific subset of columns. |
g.map (plot_function) | Applies a plot function to all cells in the grid. |
g.map _upper(plot_function) | Applies a plot function to the cells in the upper triangle of the grid. |
g.map _lower(plot_function) | Applies a plot function to the cells in the lower triangle of the grid. |
g.map _diag(plot_function) | Applies a plot function to the cells on the diagonal (where a variable is plotted against itself). |
2.6 Swarm Plot (sns.swarmplot
): Showing Individual Observations
Swarm plots are similar to strip plots, but they adjust the points along the categorical axis so that they do not overlap. This gives a better representation of the distribution of values, especially when there are many data points.
def swarmplot_distribution():
# First, ensure 'AgeCategory' is created if not already
if 'AgeCategory' not in housing_df.columns:
housing_df['AgeCategory'] = pd.cut(housing_df['HouseAge'],
bins=[0, 10, 20, 30, 40, 50, 100],
labels=['0-10', '11-20', '21-30', '31-40', '41-50', '50+'])
# Reduced marker size to prevent overlap warnings for dense data
sns.swarmplot(x="AgeCategory", y="MedInc", data=housing_df, s=3) # 's' controls marker size
plt.title('Median Income Distribution by House Age Category')
plt.xlabel('House Age Category')
plt.ylabel('Median Income')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# swarmplot_distribution()
What it means: This swarm plot displays the individual median income values for each house age category. Unlike a box plot or violin plot which summarize the distribution, the swarm plot shows every single data point, with points "swarming" around the areas of higher density. This allows you to see the precise spread of income values within each age group and identify any clusters or gaps. It's particularly useful when you want to show the exact values of each observation in relation to a categorical variable. Note: For very large datasets, swarmplot
might still struggle with point placement, leading to warnings. In such cases, consider using sns.stripplot
(which allows overlap) or sns.violinplot
for a density-based summary.
Here's an example of what such a swarm plot might look like:
Cheat Sheet: Swarm Plot
Code | Description |
sns.swarmplot(x="categorical_col", y="numerical_col", data=df) | Creates a swarm plot showing individual data points, adjusted to avoid overlap. x is the categorical variable, y is the numerical variable. |
plt.title('Title') | Sets the title of your plot. |
plt.xlabel('Label') | Sets the label for the x-axis. |
plt.ylabel('Label') | Sets the label for the y-axis. |
sns.despine() | Removes the top and right borders from the plot. |
plt.show () | Displays the plot. |
3. Visualising Trends
These plots are best suited for showing changes or patterns over a continuous variable, typically time, but can also represent other sequential or ordered variables.
3.1 Line Plot (sns.lineplot
): Trends Over a Continuous Variable
Line plots are ideal for showing how something changes or trends over a continuous variable. Here, we'll use it to see how median income changes with house age.
def lineplot():
age_values = housing_df.groupby('HouseAge')['MedInc'].mean().reset_index()
plt = sns.lineplot(data=age_values, x='HouseAge', y='MedInc')
plt.title('Average Median Income by House Age')
plt.xlabel('House Age')
plt.ylabel('Average Median Income')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# lineplot()
What it means: This line plot illustrates the average median income as house age increases. It can reveal if there's a particular age range of houses associated with higher or lower average incomes, or if the income generally increases or decreases with house age. This helps identify trends in the economic profiles of different age segments of housing.
Here's an example of what such a line plot might look like:
Cheat Sheet: Line Plot
Code | Description |
df.groupby('col1')['col2'].mean().reset_index() | Calculates the average of col2 for each unique value in col1 and resets the index to make col1 a regular column again. |
sns.lineplot(data=df, x='col1', y='col2') | Creates a line plot. |
4. Visualising Distributions
These plots help to understand the spread, central tendency, and shape of a single variable, or the distribution of a numerical variable across different categories.
4.1 Histogram (sns.histplot
): Understanding Data Distribution
Histograms show how a single numerical variable is distributed. Seaborn's histplot
can also include a Kernel Density Estimate (KDE), which is a smoothed curve showing the probability density.
def histogram_distribution():
sns.histplot(housing_df.MedInc, bins=100, kde=True) # Using histplot for newer seaborn
plt.title('Distribution of Median Income')
plt.xlabel('Median Income')
plt.ylabel('Count')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# histogram_distribution()
What it means: This histogram shows how median income is spread out in the California Housing dataset. The shape of the bars and the KDE curve can tell you if incomes are mostly clustered, skewed (leaning to one side), or have multiple peaks. For MedInc
, it typically reveals a right-skewed distribution, indicating that most areas have lower to moderate median incomes, with a long tail extending to higher incomes.
Here's an example of what such a histogram might look like:
Cheat Sheet: Histogram
Code | Description |
sns.histplot(data, bins, kde) | Creates a histogram. bins controls the number of bars, and kde=True adds the smoothed density curve. |
4.2 Box Plot (sns.boxplot
): Summarizing Data Distribution
Box plots are great for summarizing the distribution of numerical data. They clearly show the median, quartiles, and potential outliers.
def boxplot_distribution():
sns.boxplot(x=housing_df.MedInc) # Use x= for a horizontal box plot
plt.title('Box Plot of Median Income')
plt.xlabel('Median Income')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# boxplot_distribution()
What it means: The box plot for median income shows you the middle 50% of the data (the box), the median (the line inside the box), and the spread of the rest of the data (the "whiskers"). Any points outside the whiskers are considered outliers. For MedInc
, this plot typically highlights the range where most median incomes fall and points out any exceptionally high (or low) income districts.
Here's an example of what such a box plot might look like:
Cheat Sheet: Box Plot
Code | Description |
sns.boxplot(x=data) | Creates a horizontal box plot. Use y=data for a vertical one. |
4.3 Violin Plot (sns.violinplot
): Detailed Distribution with Density
Violin plots combine the best parts of box plots and kernel density plots. They show the overall shape (density) of the data's distribution, plus its median and quartiles.
def violinplot_distribution():
sns.violinplot(y=housing_df.HouseAge, orient="v") # Ensure orient="v" for vertical
plt.title('Violin Plot of House Age')
plt.ylabel('House Age')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# violinplot_distribution()
What it means: This violin plot of house age gives you a more detailed look at its distribution than a simple box plot. Wider parts of the "violin" mean more data points are clustered there, while narrower parts mean fewer. The small box inside shows the median and quartiles. This helps to visualise the density of house ages and identify if there are multiple peaks or a smooth distribution.
Here's an example of what such a violin plot might look like:
Cheat Sheet: Violin Plot
Code | Description |
sns.violinplot(y=data) | Creates a vertical violin plot. Use x=data for a horizontal one. orient="v" explicitly sets vertical orientation. |
4.4 Boxen Plot (Letter-value Plot - sns.boxenplot
): Enhanced Distribution Detail
Boxen plots, also known as letter-value plots, are an enhancement of the traditional box plot. They are designed to provide more detailed information about the shape of a distribution, especially for larger datasets, by showing more quantiles (the "letter values") in the tails. This allows for a richer understanding of the distribution's extremities without obscuring individual data points.
def boxenplot_distribution():
# First, ensure 'AgeCategory' is created if not already
if 'AgeCategory' not in housing_df.columns:
housing_df['AgeCategory'] = pd.cut(housing_df['HouseAge'],
bins=[0, 10, 20, 30, 40, 50, 100],
labels=['0-10', '11-20', '21-30', '31-40', '41-50', '50+'])
sns.boxenplot(x="AgeCategory", y="MedInc", data=housing_df)
plt.title('Median Income Distribution by House Age Category (Boxen Plot)')
plt.xlabel('House Age Category')
plt.ylabel('Median Income')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# boxenplot_distribution()
What it means: This boxen plot provides a refined view of the median income distribution across different house age categories. While similar to a box plot, its nested boxes represent a larger number of quantiles, giving a more granular insight into the density and spread of data, particularly in the tails of the distribution. This helps in understanding the subtle differences in income spread for various house ages and identifying potential skewness more clearly.
Here's an example of what such a boxen plot might look like:
Cheat Sheet: Boxen Plot
Code | Description |
sns.boxenplot(x="categorical_col", y="numerical_col", data=df) | Creates a boxen plot showing enhanced quantile information for a numerical variable across categories. x is the categorical variable, y is the numerical variable. |
plt.title('Title') | Sets the title of your plot. |
plt.xlabel('Label') | Sets the label for the x-axis. |
plt.ylabel('Label') | Sets the label for the y-axis. |
sns.despine() | Removes the top and right borders from the plot. |
plt.show () | Displays the plot. |
4.5 Facet Grid (sns.FacetGrid
): Creating Multi-Panel Plots for Distribution
Facet grids allow you to visualise the distribution of a variable or the relationship between multiple variables across different subsets of your dataset. While also useful for relationships, they are powerful for breaking down distributions.
def facetgrid():
# Create age groups for faceting
housing_df['AgeGroup'] = pd.cut(housing_df['HouseAge'], bins=4, labels=[
'New (0-13)', 'Moderate (13-26)', 'Old (26-39)', 'Very Old (39-52)'])
# Create a FacetGrid with rows based on 'AgeGroup'
g = sns.FacetGrid(housing_df, row="AgeGroup", height=3, aspect=2, margin_titles=True)
g.map(plt.hist, "MedInc", bins=50, alpha=0.6, color="teal") # Using hist to show distribution across facets
g.set_axis_labels("Median Income", "Count")
g.set_titles(row_template='House Age: {row_name}') # Set individual row titles
plt.suptitle('Median Income Distribution by House Age Group (Facet Grid)', y=1.02)
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show()
# To run this plot, uncomment the line below:
# facetgrid()
What it means: This facet grid displays histograms of 'Median Income' for different house age groups. Each row represents a different age group, allowing for a quick comparison of the income distribution across these groups. This helps to see if the income spread changes significantly based on the age of the property, revealing distinct demographic or economic characteristics for each age bracket.
Here's an example of what such a facet grid might look like:
Cheat Sheet: Facet Grid (for Distribution)
Code | Description |
sns.FacetGrid(data, row, col, hue, height, aspect) | Initializes a FacetGrid. row and col define the variables to create rows and columns of subplots. hue can color plot elements. height and aspect control subplot size. |
g.map (plot_function, *args, **kwargs) | Applies a plotting function (e.g., plt.hist , sns.kdeplot ) to each facet to show distributions. |
g.set_axis_labels("X Label", "Y Label") | Sets labels for the x and y axes of all subplots. |
g.set_titles(row_template='{row_name}') | Sets titles for the individual facets. |
plt.tight_layout() | Adjusts subplot params for a tight layout. |
4.6 Count Plot (sns.countplot
): Visualising Categorical Counts
Count plots display the number of observations in each category using bars. It is essentially a histogram for a categorical variable. This is useful for quickly seeing the frequency of each unique value in a categorical column.
def countplot_distribution():
# First, ensure 'AgeCategory' is created if not already
if 'AgeCategory' not in housing_df.columns:
housing_df['AgeCategory'] = pd.cut(housing_df['HouseAge'],
bins=[0, 10, 20, 30, 40, 50, 100],
labels=['0-10', '11-20', '21-30', '31-40', '41-50', '50+'])
sns.countplot(x="AgeCategory", data=housing_df)
plt.title('Count of Houses by Age Category')
plt.xlabel('House Age Category')
plt.ylabel('Count')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# countplot_distribution()
What it means: This count plot visualises the number of houses falling into each defined age category. It provides a direct count for each group, helping to understand the distribution of house ages within the dataset. For instance, you can quickly see which age ranges of houses are most (or least) common in the California Housing data.
Here's an example of what such a count plot might look like:
Cheat Sheet: Count Plot
Code | Description |
sns.countplot(x="categorical_col", data=df) | Creates a count plot showing the number of observations in each category. x is the categorical variable. |
plt.title('Title') | Sets the title of your plot. |
plt.xlabel('Label') | Sets the label for the x-axis. |
plt.ylabel('Label') | Sets the label for the y-axis. |
sns.despine() | Removes the top and right borders from the plot. |
plt.show () | Displays the plot. |
4.7 Count Plot with Hue (sns.countplot
with hue
): Categorical Distribution with Sub-Categories
The Count Plot with Hue extends the basic count plot by adding another categorical variable to further subdivide the bars. This allows you to compare the counts of sub-categories within each main category, providing a richer understanding of the data's composition.
def countplot_with_hue():
# Ensure 'AgeCategory' and 'IncomeCategory' are created if not already
if 'AgeCategory' not in housing_df.columns:
housing_df['AgeCategory'] = pd.cut(housing_df['HouseAge'],
bins=[0, 10, 20, 30, 40, 50, 100],
labels=['0-10', '11-20', '21-30', '31-40', '41-50', '50+'])
if 'IncomeCategory' not in housing_df.columns:
housing_df['IncomeCategory'] = pd.cut(housing_df['MedInc'], bins=3, labels=['Low Income', 'Medium Income', 'High Income']) # Simplified for example
sns.countplot(x="AgeCategory", hue="IncomeCategory", data=housing_df, palette="viridis")
plt.title('Count of Houses by Age & Income Category')
plt.xlabel('House Age Category')
plt.ylabel('Count')
plt.legend(title='Income Level')
sns.despine()
plt.show()
# To run this plot, uncomment the line below:
# countplot_with_hue()
What it means: This count plot with hue visualises the number of houses within each AgeCategory
, further broken down by their IncomeCategory
. You can see, for example, how many "Low Income" households reside in "0-10" year old houses versus "Medium Income" households. This helps in understanding the joint distribution of two categorical variables and identifying which income levels are more prevalent in different age groups of housing.
Here's an example of what such a count plot with hue might look like:
Cheat Sheet: Count Plot with Hue
Code | Description |
sns.countplot(x="categorical_col1", hue="categorical_col2", data=df, palette="color_map") | Creates a count plot with bars split by a second categorical variable (hue ). palette sets the color scheme for the hue categories. |
plt.title('Title') | Sets the title of your plot. |
plt.xlabel('Label') | Sets the label for the x-axis. |
plt.ylabel('Label') | Sets the label for the y-axis. |
plt.legend(title='Legend Title') | Displays and titles the legend, which is crucial when hue is used. |
sns.despine() | Removes the top and right borders from the plot. |
plt.show () | Displays the plot. |
This tutorial provides a comprehensive overview of how to use Seaborn for data visualisation, categorised by the type of insights you want to gain from your data.
Subscribe to my newsletter
Read articles from sambit choudhury directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
