Imagine I hand you a coin and ask: “What is the probability of getting heads when you flip this coin?“ Most people would say \(0.5\), half heads and half tails. Now imagine you flip this coin \(10\) times and observe that heads occur \(6\) times and tails occur \(4\) times. Would you still say the probability is \(0.5\)? Probably not. Now the probability of getting a heads is . But wasn’t it a fair coin?

Why this change? Why do we start with \(0.5\), and then move to \(0.6\)? And more deeply, what does it even mean to assign a “probability“ to something as simple as a coin flip?

In this blog, we’ll explore this question from both a mathematical and a philosophical lens. Along the way, we’ll uncover powerful tools like maximum entropy, maximum likelihood, and two main schools of statistical thought: Bayesian and Frequentist inference. Let’s begin with our first assumption.

Part 1: Ignorance and Fairness

When we say a coin has \(50\%\) chance of landing heads, what are we really saying?

We are saying that we do not know anything special about this coin. It has two sides, and we have no reason to believe it favors one over the other. This idea is ancient. It goes back to Laplace’s Principle of Insufficient Reason: if we have no reason to favor one outcome over another, we assign them equal probabilities. In our case:

\[\text{Probability of heads = Probability of tails = 0.5}\]

This is a purely logical assignment, based on symmetry and ignorance. We don’t need any experiments, just the knowledge that the coin has two sides. But can we make this more formal?

Part 2: What is Entropy and Why Does It Matter?

Entropy, in information theory, measures uncertainty. The more uncertain we are, the higher the entropy. For a coin flip (a binary outcome), the entropy is defined as:

\[H(p) = -p \log_2p - (1-p)\log_2(1-p)\]

This function reaches its maximum value when \(p=0.5\). That means: if we want to represent maximum ignorance, we should choose \(p=0.5\), because that leads to the most uncertain, or least committed, probability distribution.

Think of it this way: Assigning \(0.5\) means “I know nothing, I treat heads and tails as equally likely.“ and assigning \(0\) or \(1\) means: “I am absolutely certain the coin will land one way.“. Entropy gives us a principled way of ignorance. So far so good. But what happens when we actually start flipping the coin?

Part 3: Evidence Enters

Let us say, we flip the coin \(10\) times, and see \(6\) heads and \(4\) tails. Should we change our belief? Our initial belief \((0.5)\) came from ignorane. Now we have data, And data should change beliefs. This is where maximum likelihood estimate (MLE) comes in. The idea of MLE is simple: Choose the probability that makes the observed data most likely.

For a coin, we model the number of heads using a binomial distribution*.* The likelihood of seeing \(6\) heads in \(10\) flips, assuming the true probability of heads is \(p\) is:

\[L(p) = \binom{10}{6} \cdot p^6 \cdot (1-p)^4\]

To find the best value of \(p\), we maximize this function. The value of \(p\) that makes this expression largest is:

\[\hat{p} = \frac{6}{10} = 0.6\]

This is the maximum likelihood estimate: the value of \(p\) that best explains the data.

A Physical Analogy: Wobbly Coin

Suppose this coin is slightly uneven by being little heavier on one side. If you had no idea about this defect, you would start with \(0.5\). But after flipping it several times, you notice it lands on heads more often. You begin to suspect it is biased. That is what MLE is doing: it says, “Forget assumption and look at what is actually happening“.

So now we have a tension:

Entropy says: \(0.5\) is the most unbiased guess when you know nothing.
Likelihood says: \(0.6\) best explains what you have seen.

What should you believe?

Part 4: Philosophical Fork

This is where philosophy steps in. There are two major camps in statistics, each with its own answer.

The Frequentist View

Frequentists say that probability is about frequency in the long run. So, they treat the true value of \(p\) as fixed but unknown. Our job is to estimate it. In this view, the maximum likelihood estimate is the best estimate after seeing the observations.

If you flip more coins, you will get more accurate estimate. Frequentists don’t assign probabilities to the parameter \(p\) itself, it’s either \(0.6\) or it isn’t.

Pros:

Simple and intuitive.
Great for large data and repeated trials.

Cons:

Does not easily allow prior knowledge
Cannot say things like “there is a \(70\%\) chance that the coin’s bias is between \(0.5\) and \(0.7\).“

The Bayesian View

Bayesians think differently. They say probability is about belief. You can assign probabilities to hypotheses, like “the coin is fair“. Bayesians start with a prior, a distribution over \(p\). Often, they choose a uniform prior (all values of \(p\) are equally likely from \(0\) to \(1\)), which is consistent with the maximum entropy idea. Then they use Bayes’ Rule to update this belief after seeing data.

If we start with a uniform prior and observe \(6\) heads and \(4\) tails, the posterior belief becomes a \(\beta(7, 5)\) distribution.

This posterior has a:

Mean of \(\sim 0.583\)
Mode of \(0.6\)

So now we believe that “The coin’s bias is probably around \(0.58\), but we are uncertain.“

Pros:

Naturally incorporates prior knowledge and updates it.
Gives full distribution, and not just a point estimate.

Cons:

Requires choosing a prior.
Sometimes harder to compute.

Part 5: A Continuum, Not a Conflict

It might seem like Frequentists and Bayesians are always fighting. But they are really emphasizing different parts of the same story.

Maximum Entropy gives you the starting point when you know nothing.
Maximum Likelihood tells you what the data are saying.
Bayesian inference shows you how to connect the two.

Here is a beautiful way to think about it:

Entropy is what you do when you have no data.

Likelihood is what you do when you data but no beliefs,

Bayesian inference is what you do when you have both.

They are tools on a spectrum of knowledge — from total ignorance to data-driven uncertainty.

Real-World Implications

Let’s see how this idea shows up in the physical world:

Before testing a drug, we might assume it works \(50\%\) of the time (pure uncertainty). After a clinical trial, we may update this to \(70\%\). Should we keep using the old belief? Or trust the new evidence?
Suppose a new weather model says there’s a \(60\%\) chance of rain tomorrow. But we know from past years it’s usually \(50\%\). Do we believe the model, or our prior knowledge?
A machine is expected to produce defective items only \(1\%\) of the time. But your first \(20\) samples show \(3\) defects. Is the machine broken, or is this just random variation?

In all these cases, we move from prior belief (possibly based on entropy or symmetry) to updated belief (based on observed data), using either likelihood or Bayesian updating.

Conclusion

Let’s revisit our original question.

“Why do we say the probability is 0.5 at first, and 0.6 after some flips?”

Because we learn.

At the beginning, all we had was symmetry. We assigned \(0.5\) using maximum entropy which is the most honest representation of ignorance.

After seeing data (\(6\) heads in \(10\) tosses), we allowed the evidence to guide us to a new estimate of \(0.6\), using maximum likelihood. Or slightly less, using Bayesian updating.

Both methods are part of a deeper truth:

Probability is not just about randomness.
It is about what we know — and how we update what we know.

So next time you flip a coin, or make any uncertain decision, think of yourself walking a path from entropy (ignorance), through likelihood (evidence), towards understanding (inference). And that perhaps, is the most honest form of probability we can aspire to.

From Ignorance to Evidence