Introduction to Statistics: The Language of Data


Hi everyone! Welcome to the very first post of Data Science Diaries. I’m super excited to start this journey with a topic that’s the foundation of everything we’ll explore together — Statistics.
Whether you're aiming to become a data scientist, machine learning engineer, or just someone who wants to understand the world through data, statistics is where it all begins.
📜 A Brief History of Statistics
To understand statistics today, it's helpful to know where it came from.
The word statistics comes from the Latin word status, meaning “state.” Originally, it referred to government data — things like population size, taxes, land records, and military resources. Even in ancient Egypt, statistics were used to build the pyramids and manage agriculture along the Nile. The Babylonians kept detailed astronomical records, which are considered early forms of statistical data.
But the modern era of statistics began in the 18th and 19th centuries, when scientists and mathematicians like:
Pierre-Simon Laplace worked on probability theory,
Carl Friedrich Gauss developed the normal distribution (yes, the famous "bell curve"),
and Ronald Fisher revolutionized experimental design and hypothesis testing.
Today, statistics powers almost everything — from scientific discoveries to Netflix recommendations, and from business forecasting to diagnosing diseases with AI
.
📈 What Is Statistics?
Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions.
Let’s simplify that:
Imagine you have a huge pile of raw data — numbers, measurements, or responses. Statistics is what helps you make sense of all that and turn it into something useful.
Think of it like cooking:
Data = raw ingredients
Statistics = the recipe that helps turn those ingredients into a delicious dish
Statistics answers questions like:
What's happening in the data?
What patterns or trends exist?
Can I predict something based on this data?
Is the result real or just due to random chance?
🤖 Why Is Statistics So Important in Data Science and Machine Learning?
You might be wondering: "Wait, isn’t machine learning all about algorithms and models? Why do I need statistics?"
The truth is — machine learning is built on top of statistics. Without statistical understanding, you're just running code blindly. Here’s why statistics is absolutely essential in the world of AI and data:
🔹 1. Understanding Data Before Modeling
Before you can train any machine learning model, you need to explore and understand your data:
What’s the average age of your customers?
Are there outliers in income or performance?
Is the data normally distributed?
These questions are answered through descriptive statistics.
🔹 2. Evaluating Model Performance
Every time you check accuracy, precision, recall, F1-score, or AUC-ROC, you're using statistical metrics.
🔹 3. Making Predictions and Decisions
The whole purpose of statistics is to infer something about a population using a sample — and that’s exactly what machine learning models do.
🔹 4. Reducing Uncertainty
Hypothesis testing, confidence intervals, and probability distributions help you decide whether what you see in the data is meaningful — or just random.
So in short: if data science is the brain, then statistics is the logic that powers it.
🧮 Two Main Branches of Statistics
Now let’s get into the two main branches of statistics. Everything in this field falls into one of these two categories:
1. Descriptive Statistics
This is all about summarizing and visualizing data. You're not making predictions here — just describing what’s in front of you.
Key Concepts:
Mean (average)
Median (middle value)
Mode (most frequent value)
Range, variance, standard deviation
Data visualization — bar charts, histograms, pie charts, box plots
Example:
Imagine you surveyed 100 students about their exam scores. Using descriptive statistics, you might find:
Average score = 78
Highest score = 98
Most common score = 85
This helps you understand what’s going on in your dataset.
2. Inferential Statistics
This is where the magic of generalization happens. You use a small sample of data to make predictions or decisions about a larger population.
Key Concepts:
Sampling and estimation
Confidence intervals
Hypothesis testing (like t-tests and chi-square tests)
Regression analysis
p-values and significance levels
Example:
Let’s say you survey 100 voters and find that 60% support a candidate. Using inferential statistics, you might estimate that 60% of all voters in the city support that candidate — with a certain level of confidence (say, 95%).
In short:
Descriptive = "What do I see?"
Inferential = "What can I conclude or predict?"
📘 Final Thoughts
Statistics is not just about numbers — it's about thinking critically with data. Whether you’re analyzing business trends, building a machine learning model, or just trying to understand the world better, statistics gives you the tools to ask the right questions and make smarter decisions.
🔍 What’s Next?
In the next post, we’ll dive deeper into Inferential Statistics, where we’ll explore:
What are confidence intervals?
How does hypothesis testing work?
What do p-values actually mean?
What’s the difference between parametric and non-parametric tests?
And how to apply all of this using Python code.
Stay tuned — this is where statistics really starts to power machine learning and data science!
Subscribe to my newsletter
Read articles from Suyog Timalsina directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
