Bits Don't Lie: Datatypes in Modern LLMs

Jay GalaJay Gala
6 min read

Let’s talk about the different datatypes that are being used in modern LLMs like GPT, LLaMa and the like. The most common ones that you might have heard: FP32, BFloat16, Float16, INT8, etc. These are all standard data types and available in PyTorch and other popular deep learning frameworks. But what exactly is a FP32 datatype, when to use it and and when not to?

When asked, most people I know (myself included until very recently) would say that FP32 is a 32 bit representation of a numerical value and the difference between that and BF16 is that the latter takes half the memory to store these values and is better for training and inference in LLMs. But why is it better or equivalent? What tradeoffs does it make? Let’s find out.

Bits and Bytes. The Boring Stuff.

First let’s understand how floating point numbers are represented in bits and bytes. Floating-point numbers follow the IEEE 754 standard and are structured using three core components:

  • Sign (S): 1 bit to represent positive (0) or negative (1)

  • Exponent (E): Determines the scale or range (scientific notation)

  • Mantissa (M): Holds the significant digits (precision)

The value of a floating-point number is given by the formula:

$$(-1)^S \times 1.M \times 2^{(E - \text{bias})}$$

As an example let’s try to convert the number -6.5 (minus six point five) to FP32. FP32 by definition has 32 bits. So we need to see how -6.5 gets represented in 32 bit format.

Step 1: Determine S

Since the value is negative, S value will be 1.

Step 2: Determine E

Now, FP32 uses a Bias of 127


Detour: What Is Bias?

In IEEE 754 floating-point numbers, the exponent is stored using a biased representation. The bias is a fixed value that gets added to the actual exponent to ensure that the exponent can be stored as a non-negative unsigned integer.

Why? Computers are optimized for storing and comparing unsigned binary numbers, but exponents in math can be negative (e.g., 2−3=0.1252^(-3) = 0.1252−3=0.125).

To handle this, we use a bias so that the exponent field:

  • Can store both positive and negative exponents

  • Uses only positive integers in hardware


Coming back to step 2.

Convert 6.5 (not the sign, since it is handled in step 1) to binary → 110.1 and then convert it to scientific notation (i.e. make it 1.xxxxx * 2^n) → 110.1 × 2² = 1.101 so you get the exponent as 2 (i.e. n from the 2^n).

So basically, exponent is the number n which is used to convert the binary form to an scientific notation using an exponential to 2. So, finally E = 2 + Bias = 2 + 127 = 129 and converted to binary → 10000001

Step 3: Determine the Mantissa

Take the digits of the scientific notation after the decimal point → 101 (ignoring the 1 before the decimal because it is always 1.xxx in this notation) and then fill the rest with zeros till 23 digits.

So the final representation of -6.5 in FP32 would be:

$$11000000110100000000000000000000$$


Recap: FP32

So to summarize, a FP32 number looks like this:

  • 1 bit for Sign

  • 8 bits for Exponent (with bias of 127)

  • 23 bits for Mantissa (fractional part only — the leading 1 is implicit)

And with those 32 bits, we get a number that can represent a huge range of values — from about ±1.4 × 10³⁸ down to around ±1.2 × 10⁻³⁸ and with around 7 decimal digits of precision.

But here's the problem: this is kind of overkill for training neural networks.

Most of the time, we don’t need 7 digits of precision. We're doing approximate matrix multiplications, not quantum physics. And that’s where alternative datatypes like Float16 and BF16 come in.


Enter Float16 and BFloat16

To reduce memory usage and improve training speed, newer hardware (like NVIDIA GPUs and Google TPUs) supports 16-bit floating point types. The idea is simple: chop the FP32 in half, make things faster.

But there are two popular ways of doing this — FP16 (also called Float16 or Half) and BF16 (BFloat16) — and they’re very different under the hood. Let’s compare:

TypeTotal BitsExponent BitsMantissa BitsBiasApprox. RangeApprox. Precision
FP3232823127~1e±38~7 digits
FP161651015~1e±5~3.3 digits
BF161687127~1e±38~2.3 digits

As you can see, BF16 keeps the same 8-bit exponent and bias as FP32 meaning it has the same range, but sacrifices precision with a smaller mantissa. FP16, on the other hand, chops both the exponent and mantissa so you lose both range and precision.

Why BFloat16 Became the Hero

So why does BFloat16 (BF16) matter so much in training large models like GPT or LLaMA?

Fun Fact: The B in BF16 stands for Brain 🧠 since it was created by a team at Google Brain 🤯

FP16 reduces memory and bandwidth, which is great but its smaller exponent range (~1e±5) makes it risky during training. If your gradients explode or vanish outside that narrow range, you’re in trouble.

BF16 fixes this by keeping the same exponent width and bias as FP32 (8 bits, bias 127), meaning it can represent very large and very small numbers just like FP32. This makes it much more robust during training, especially in early stages when the scale of numbers can vary widely.

You give up precision (BF16 has only 7 bits of mantissa, vs. 23 in FP32), but that’s usually okay. Neural networks don’t need pinpoint decimal accuracy they need stability and throughput.

In short:

  • BF16 = FP32’s range + reduced precision + 2x smaller

  • FP16 = reduced range + reduced precision + 2x smaller

  • FP32 = full range and precision + slow and big

This tradeoff is why BF16 is now standard on Google TPUs and widely adopted on NVIDIA GPUs too.


INT8 and INT4: When Accuracy Isn’t Everything

Now let’s talk about inference — where you’re no longer training the model, just running it to generate predictions. Here, speed and memory efficiency are king, and we can go even further with quantization.

Quantization is the process of mapping floating-point values to lower-bit integer formats, like:

  • INT8: 8-bit signed integer (values from -128 to +127)

  • INT4: 4-bit signed integer (values from -8 to +7)

This sounds crude and it is but it works surprisingly well. The trick is to apply quantization with a scale and zero-point (full topics on their own - will explain in a future blog) that maps float values to integer values and back:

$$\text{quantized} = \text{round}\left( \frac{x - \text{zeropoint}}{\text{scale}} \right)$$

$$x \approx \text{scale} \times \text{quantized} + \text{zeropoint}$$

During inference, models use fast integer arithmetic and only convert back to floats when absolutely needed. With proper calibration (like GPTQ or AWQ techniques - deserves a separate blog), quantized LLMs can retain 95–99% of their original accuracy with significantly smaller memory footprints.


When to Use What?

PhasePreferred datatypesWhy?
TrainingBF16, Mixed FP16/FP32Speed + Stability
InferenceINT8, INT4Speed + Memory Efficiency
DebuggingFP32High Precision, Easy to Inspect

In the world of LLMs, datatypes aren’t just a hardware detail but a design choice that shapes everything from training time to serving latency. By understanding how sign, exponent, mantissa, and bias all play together, we get a much clearer picture of the tradeoffs made under the hood.

Next time you run model.to(torch.bfloat16), you’ll know exactly what’s happening.

7
Subscribe to my newsletter

Read articles from Jay Gala directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jay Gala
Jay Gala

Currently working as an AI Software Engineer at Intel