Histograms - The Bucket Truth

So here I am — writing a blog that probably no one cares about, after watching a 3-minute YouTube video on histograms.
Histograms
Ok.
What are these things?
Suppose we are measuring weights of people in military base (not in PUBG btw) . and the military general asks you to make a visual representation of measurement (cuz.. you know , he can’t read) .
Ofcourse,
If there are 5 people — Cool, we can cross-mark the weights of all the soldiers on a number line
but if there are 500, you might need a damn telescope , not a graph - because you’ll need to set the resolution of that cross-marks to be tiny. if you refuse to do that and keep your marks large , your marks overlap
and that is exactly when Histogram says - There is a reason for my existence
Please dont bore yourselves to death or do by reading all of that*Here goes,*
you first decide the width of your histogram .
In a histogram, each bar covers a range of values — and you don’t give a damn where exactly your data point falls inside it
Example : There are 40 soldiers in military base , suppose we want the bucket to be 5
[91.9, 70.3, 75.3, 79.1, 67.1, 75.0, 75.0, 57.5, 85.2, 81.0, 68.7, 73.3, 80.1, 72.4, 72.6, 60.5, 80.5, 76.2, 77.7, 59.7, 91.5, 76.5, 71.1, 95.3, 74.5, 60.5, 70.9, 52.1, 85.5, 70.8, 67.6, 85.7, 58.5, 80.4, 54.4, 68.4, 63.0, 89.6, 92.7, 71.7]
Now if we show this to our General Bates or whoever , He can understand that his soldiers are nomally distributed in weight
Ok , now the width selection matters.
Too Narrow - Too much noise
Too Wide - Too much loss of signal
Think of it this way,
Histogram bin width is like camera zoom:
Zoom in too much — and everything’s noise.
Zoom out too far — and you miss all the detail.
We have to find that sweet spot right in middle somewhere where signal is high and noise is acceptable.
Quick decode :
Signal - The data that represents something that is useful to the model ( Good Stuff )
Noise - other than signal ( Garbage )
Like this , we can use Histograms to our convenience
Histograms are more than pretty bars
It can also tell how our data is distributed like Normally , Exponentially …..
and for godsake - Histograms dont just show frequency per bucket
They show frequency density - so the bars wont lie when the width changes
frequency density is frequency after normalisation
so yeah - Trust the Area, Not the height
TL;DR;
Go read the whole thing
Final Note :
Thanos told me to write this blog in the shittiest way possible to save the universe. If you don’t like it — blame the purple guy.
Subscribe to my newsletter
Read articles from Chandra prakash Dereddy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
