Probability and Statistics for AI

ZenvilaZenvila
5 min read

Math for AI: Probability and Statistics with a Computer Science Perspective (Using Python)


Introduction

After exploring Linear Algebra for AI, the next logical step in building a strong foundation for Artificial Intelligence is understanding Probability and Statistics. These two branches of mathematics are essential for reasoning under uncertainty, which is at the heart of most AI applications.

In real-world problem solving, we often start with qualitative ideas, but when we express these in numbers — quantitative data — we begin to apply statistics. In short, statistics is the science of making decisions based on data.

Why Probability & Statistics Matter in Computer Science

In pure math, we usually deal with small-scale values (e.g., 10–20 values), which is manageable. But in computer science, especially in AI, we work with thousands to millions of data points. This creates scalability challenges, making statistics and probability crucial for:

  • Data Science (large-scale statistics)

  • Machine Learning (glorified probability)

  • Algorithm analysis, cryptography, and system modeling

They help us model and analyze uncertainty, a core concern in most computing problems.


Getting Started with Python for Stats & Probability

We’ll use Python as it's general-purpose and well-supported for scientific computing.

We’ll also use a virtual environment (via Conda) to isolate packages and ensure compatibility across systems. This environment can be easily ported to another machine, similar to a virtual machine.

Before setting up the environment, it's helpful to explore an open-source GitHub repository that provides foundational code and material:

git clone https://github.com/recluze/stats-prob-cs

This repository includes course notebooks and examples for deeper understanding. Browsing the code will help you get comfortable with the basics.

Note: I'm using Arch Linux. Some commands may vary slightly on other distros. For best results, use Linux or macOS over Windows.

Install Python:

sudo pacman -Syu python

Set up a virtual environment:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n stat-env python=3.11
conda activate stat-env

Install required libraries:

conda install jupyterlab pandas matplotlib
  • jupyterlab: Web-based IDE for notebooks.

  • pandas: For structured data manipulation.

  • matplotlib: For plotting graphs and visualizations.


Understanding Data Types

In this course, we deal with two types of data:

  1. Artificial Data – Collected through surveys (manual input).

  2. Organic Data – Generated by systems and logs (e.g., website user activity).

Organic data is especially relevant in computing as it reflects real-world system behavior.

In machine learning, data must often be i.i.d. (independent and identically distributed) for modeling.


One-Hot Encoding

When comparing categorical variables (e.g., male vs. female), one-hot vectors are used to represent them numerically without implying order or magnitude:

Example:

[male, female] = [1, 0], [0, 1]

This avoids incorrect assumptions like "greater = male" and "smaller = female" which may happen if scalar values are used.

If we represent unordered categorical data as scalars instead of arrays, we distort the structure and meaning of that data.


Foundations of Probability

Probability is axiomatic — based on agreed-upon rules that help us model uncertainty. It’s not about proving what's true, but about agreeing on how to handle unknowns.


Topics and Their Computer Science, Hardware & Software Perspectives

Probability and statistics touch almost every aspect of computing. Below is a brief overview of key topics in this domain, described from three perspectives: computer science use cases, hardware impact, and software support.

These concepts help build intelligent systems that make decisions, process uncertainty, and handle large-scale data.

TopicDescriptionComputer Science PerspectiveHardware PerspectiveSoftware Perspective
Data TypesDifferent forms of data: structured, unstructuredOrganize and clean datasetsEfficient memory accesspandas, NumPy
One-Hot VectorsBinary representation for categoriesPreprocessing for ML modelsStored efficiently in RAMUsed in embeddings, TensorFlow
Histograms & VisualizationsGraphical summary of data distributionDetect patterns and anomaliesGPU acceleration for big datamatplotlib, seaborn
Central TendencyMean, Median, ModeSummarize large datasetsRequires optimized storagenumpy.mean(), pandas functions
Variance & Standard DeviationMeasure of data spreadFeature scaling and normalizationComputed using parallel float opsscikit-learn, NumPy
Entropy & InformationMeasure of uncertaintyDecision trees, NLP modelsGPU-based entropy calculationsklearn, scipy.stats
Probability & EventsLikelihood of an eventFoundation of model predictionRNG hardware/simulationPython’s random, numpy.random
Conditional ProbabilityProbability given a conditionBayesian models, filtersEfficient table lookupsscikit-learn, PyMC3
Bayes’ RuleUpdate belief with evidenceSpam filters, medical diagnosisUses logarithmic transformationsNaive Bayes classifiers
Random VariablesVariables with probabilistic outcomesModeling random processesSupported by dedicated chipsNumPy, PyMC3, TensorFlow
Distributions (PMF/PDF)Probability models for dataClassification, noise estimationSIMD instructions for speedscipy.stats, torch.distributions
Joint & Marginal ProbabilityModeling multivariate scenariosNLP, image classificationParallel computation across corespandas, statsmodels
Expected ValuesLong-run average outcomesPolicy evaluation in RLSummation using vector unitsNumPy, torch.mean
KL DivergenceMeasure of distribution differenceRegularization, GANsLog-sum-exp ops on GPUPyTorch, TensorFlow
Bayesian InferenceProbabilistic belief updateUncertainty-aware learningMonte Carlo simulationsPyMC3, NumPyro, Edward

Summary

  • Probability and statistics are crucial for making decisions under uncertainty, which is common in all computing systems.

  • They form the mathematical backbone of machine learning and data science.

  • From a hardware view, modern CPUs/GPUs accelerate statistical computation.

  • On the software side, Python libraries like NumPy, Pandas, SciPy, and PyTorch simplify implementation.

Whether you're analyzing logs, training models, or building systems like Jetson Nano for robotics — understanding probability and statistics is essential to designing intelligent, efficient, and scalable solutions.

P.S:if you spot any mistakes, feel free to point them out — we’re all here to learn together! 😊

Haris
FAST-NUCES
BS Computer Science | Class of 2027

🔗 Portfolio: zenvila.github.io

🔗 GitHub: github.com/Zenvila

🔗 LinkedIn: linkedin.com/in/haris-shahzad-7b8746291
🔬 Member: COLAB (Research Lab)

0
Subscribe to my newsletter

Read articles from Zenvila directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Zenvila
Zenvila

I'm Haris aka Zen, currently in my 4th semester of Computer Science at FAST-NUCES and a member of COLAB (Research Lab) in Tier 3. I'm currently exploring AI/ML in its early stages, and also focusing on improving my problem-solving techniques. 🐧 Proud user of Arch Linux | Command line is my playground. I'm interested in Automation & Robotics Automation enthusiast on a mission to innovate! 🚀 Passionate about turning manual tasks into automated brilliance.