Importing Libraries and Read Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the Data and get the first rows of the DataFrame

data = pd.read_csv(‘files/weight-height.csv’)

Output

Count

The count is descriptive statistics and counts observations.
The count is the most used in statistics and has high importance to evaluate findings.
Example: Making conclusion on childhood weights and the study only had 12 childing (observations). Is that trustworthy?
The count says something about the quality of the study.
Count groups to see the significance of the results.

data.groupby(‘Gender’).count()

Output

Mean

Return the mean of the values over the requested axis.

data.groupby(‘Gender’).mean()

Output

Standard Deviation

Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
A low standard deviation means data is close to the mean.
A high standard deviation means data is spread out.

data.groupby(‘Gender’).std()

Output

In the above diagram, we can observe if we take 3 samples of males(multiple times), 2 of them are likely to be within one standard deviation of the mean.

Describe

Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

data.describe()

Output

Correlation

Correlation can be said to measure the linear relationship between two variables measured.
A non-linear relationship could cancel out positive and negative correlations which might end up with zero(no relationship between variables).
Measure the relationship between two variables and ranges from -1 to 1

data.corr()

Correlation Matrix

A correlation matrix is simply a table showing the correlation coefficients between variables.

data.groupby('Gender').corr()

Using a heat map could be more helpful.

import seaborn as sns
axis_corr = sns.heatmap(
correlation,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True)

Scatter Plots

SCATTER PLOT is a great tool for identifying correlation.

data.plot.scatter(x=’Height’, y=’Weight’, alpha=.1)

Box Plots

Box plots are a great way to visualize descriptive statistics.
Notice that Q1: 25%, Q2: 50%, and Q3: 75%.
Box plots also help us to identify the Outliers.
Make a box-and-whisker plot from DataFrame columns, optionally grouped by - some other columns.

data[‘Weight’].plot.box(vert=False)

data[‘Height’].plot.box(vert=False)

We can observe a few outliers in the above BoxPlots.(Outliers should be removed for a proper model to be built.)

data.boxplot(column=[‘Height’, ‘Weight’])

BoxPlot grouped by gender

data.boxplot(column=[‘Height’, ‘Weight’], by=’Gender’)

Why Statistics?

Statistics presents information in an easy way.
Gives you an understanding of the data you are working on.
In this blog, we will work on a dataset to understand a few concepts of Statistics.
The dataset is a list of Genders, Heights, and Weights.

Statistics Concepts in Data-Science

Table of contents

Importing Libraries and Read Data

Output

Count

Output

Mean

Output

Standard Deviation

Output

Describe

Output

Correlation

Correlation Matrix

Scatter Plots

Box Plots

Why Statistics?

Statistics presents information in an easy way.

Gives you an understanding of the data you are working on.

In this blog, we will work on a dataset to understand a few concepts of Statistics.

The dataset is a list of Genders, Heights, and Weights.

Subscribe to my newsletter

Aman Singanamala

Aman Singanamala