Understanding Floating-Point Formats

Introduction

Most programming languages have at least one data type that represents integral numbers within a specific range. For example, Python has a built-in data type int that can hold the integers -2147483648 through 2147483647, corresponding to 4 bytes. The C programming language has several integer types for signed and unsigned integers. For example, the unsigned int type can hold integers from 0 to at least 4294967295 (corresponding to at least 4 bytes), and the signed short type holds at least the integers -32768 through 32767 (corresponding to at least 2 bytes).

Closer to the machine level, calculating with integers is among the fundamental capabilities of any CPU. Any CPUS's instruction set includes commands to perform basic arithmetic with integers, offering different variants depending on the size and signedness of the integers. For example, the x86 architecture includes instructions to multiply two signed 4-byte integers or add two unsigned 2-byte integers.

However, not all numerical values are integers.

Science and engineering handle real numbers that are not integers. One famous example of such a number is the circle number pi, whose first few digits are 3.141592653589793...

How do we represent, say, irrational numbers such as pi or fractional numbers such as 0.0000456 or 12345.6789 on a computer? Different ideas have been attempted since the invention of digital computers. Whilst number representations such as fixed point or binary-coded decimals (BCD) have been in use in earlier times, these have been largely supplanted by floating-point number formats over the last fifty years. The industry has adopted floating-point numbers as standard practice in hardware and software, especially to accommodate the computational needs of scientists and engineers. On the one hand, floating-point numbers are supported as data types in most programming languages. On the other hand, most modern CPUs support instructions with numbers encoded in floating-point formats. These calculations are typically handled by the processor's floating-point unit (FPU).

From scientific notation to floating-point numbers

Most real numbers cannot be represented on computers since computers can only store a finite number of bits. In fact, one can mathematically prove that there are real numbers whose digit sequences cannot be generated by any computer program. At best, the vast majority of numbers in our calculations can only be approximated on a computer. The floating-point number representation stands out among the many possibilities for representing (or approximating) non-integer numbers. Its origins predate the advent of digital computers: it is directly inspired by the scientific notation in science and engineering.

Given the central role of scientific notation, the floating-point number representation should be primarily understood as a number format for computational methods in science and engineering. Other number representations (e.g., fixed-point formats) may be more suitable for different domains, such as computational finance or machine learning.

Let us illustrate the scientific notation with an example from physics. The fine structure constant, which is a fundamental constant of particle physics, has a measured approximate value of

f = 0.0072973525 = 7.2973525 x 10^(-3)

The scientific notation rescales numbers by factoring out some power of ten (positive or negative), leaving a decimal number with only one leading digit. We call that decimal number the mantissa or significand. The power of ten determines the exponent. In this example, 7.2973525 is the mantissa and -3 is the exponent.

The main advantages of the scientific notation are two:

We can quickly determine the order of magnitude for this number. Scientific calculations can involve numbers across several orders of magnitude.
We can easily control the precision. When operating within a particular order of magnitude, we generally only need to know the scientific value up to a certain precision, that is, a relative error. The more fractional digits we keep in our mantissa, the more precision we maintain. Practical calculations discard excess precision and only maintain a certain number of fractional digits in the mantissa.

For example, the numerical value of the Avogrado constant in physics and chemistry is roughly equal to

A = 6.0221 x 10^(23)

With four fractional digits in the mantissa, we read that the relative error between the written value and the exact value is at most 10^(-4), meaning that the written value lies within the percentage of a percentage of the actual value (which is 6.02214076 x 10^(23)). Increasing the length of the mantissa reduces the relative errors of our number representation, provided our measured data warrant that accuracy. Apparently, a trade-off between precision and practical effort needs to be balanced by CPU manufacturers and domain experts.

In summary, up to a certain precision determined by the length of the mantissa, the scientific notation rewrites any real number by splitting it into a mantissa and some power of ten. More concisely, we can use the exponential notation or E notation:

f = 7.2973525e-3
A = 6.0221e23

The scientific notation has evolved over centuries in science and engineering. It represents numbers in compact form and enables scientists to balance between computational effort and practically relevant precision. We can represent numbers over numerous orders of magnitudes, from the very large to the very small. With that in mind, the scientific notation is the natural candidate for implementing fractional calculations on a computer.

This has led to the name floating-point number: given any number, we can choose a suitable exponent such that the decimal point is shifted right behind the leading digit. The term floating-point reflects how the decimal point "floats" to the desired position after the leading digit, always dynamically adjusted with the exponent.

Binary floating-point numbers

The scientific notation is not restricted to the decimal system: any radix system supports the scientific notation. On computers, it seems most natural to work with the binary system. For example, the fine-structure constant and the Avogrado constant above read in binary roughly like:

f = 1.110111100001111010100001001010101101110101111 x 2^(-8)
A = 1.111111100001 x 2^(78)

Our floating-point formats are conceptually based on this binary scientific notation. The key idea is this:

Floating-point numbers implement scientific notation in binary on computers.

Here are a few things that we keep in mind:

Any floating-point format will have M bits to represent a binary mantissa and N to represent the binary exponent. The number of bits M and N are fixed for each particular floating-point format. Effectively, there are only finitely many possible mantissae, and the exponent will be constrained to a finite number of possible values.
The leading non-fractional digit of any binary mantissa is always 1. In order to save memory, many floating-point formats store only the fractional digits of the mantissa explicitly in memory. This is known as hidden bit convention, as we will discuss soon.
An issue we have swept under the rug so far is how to represent the number zero. This is not trivial. An engineer doing computations by hand would write down zero 0 or something like 0.00000 x 10^0. However, we cannot represent the zero in a binary floating-point format that adheres to the hidden bit convention, which always implicitly assumes a leading digit 1. If we were to explicitly save the first non-fractional digit, then this would cost us an additional bit that virtually never carries any interesting information. Furthermore, we would have numerous representations of zero sharing the same exponent: 0.000 x 2^3, 0.000 x 2^0, or 0.000 x 2^(-7). In summary, storing the explicit leading digit comes with many disadvantages. Instead, the mainstream formats reserve specific bit patterns for exceptional cases.
The base used in the scientific notation and floating-point formats is also called the radix. We focus on binary floating-point numbers with radix 2. Of course, we could theoretically use any radix: the radix 10 representation is commonly used when humans calculate by hand. CPUs natively supporting floating-point calculations in some decimal encoding have appeared throughout the years.

Binary floating-point numbers, more details

IEEE 754 is a technical standard that defines several floating-point number formats. The most important of these formats are natively supported by CPU instruction sets across different hardware vendors. More specifically, the floating-point number formats represent numbers as follows:

1 bit for the sign, the sign bit
M bits for the mantissa, the mantissa bits.
N bits for the exponent, the exponent bits.

Each format is characterized by how many bytes any floating-point number occupies in memory and how many bits are allocated for the mantissa and the exponent. Each such floating-point number format represents a trade-off between:

the exponent range, which determines the possible orders of magnitudes
the mantissa size, which governs the precision
the total size, which impacts processing time and memory footprint

Some key clarifications are in order here:

Most formats use an implicit leading bit. Then the M bits of the mantissa only represent the fractional part of the mantissa (in binary expansion), while the leading digit 1 is not stored explicitly. This is also known as the hidden-bit convention.
The exponent is not stored as an N-bit signed integer. This might be counterintuitive. Instead, we subtract a bias from the exponent and store the result, which is non-negative, as an N-bit unsigned integer. The bias is typically 2^(N-1)-1.
The sign bit is stored at the very beginning. Next come the exponent bits, which are stored before the mantissa bits. This has some advantages in particular situations. For example, with that convention, we can sort floating-point numbers as if they were signed integers.
When the exponent bits are set to either the largest or the smallest possible values of the exponent range, alternative interpretations are applied to the entire bit sequence. The exponent bits store an unsigned integer from 0 to 2^N-1. Subject to the typical exponent bias, the nominal exponent range is from -2^(N-1)+1 up to 2^N-1 - 2^(N-1)+1 = 2^N-2^(N-1) = 2^(N-1). Excluding the extremal exponents, which are reserved for encoding exceptional cases, the effective exponent range goes from the lowest effective value -2^(N-1)+2 up to highest effective value 2^(N-1)-1.

Numerous floating-point formats are in use. Different CPU architectures and programming languages support different such formats. They generally differ by how many bits they afford for the mantissa and the exponent.

The most important floating-point number formats are IEEE 754 single precision and IEEE 754 double precision. Let us review them in more detail.

IEEE 754 single precision occupies 32 bits of memory. This corresponds to the type float in most implementations of the C programming language. A single-precision floating-point number reserves 23 bits for the mantissa and 8 bits for the exponent.

The exponent occupies 8 bits, representing unsigned integers from 0 to 255. Accordingly, the exponent bias in single precision is 127 = 2^7, and the nominal exponent range goes from -127 to 128. However, the smallest and largest exponents are reserved for representing exceptional values. The actual exponent range, therefore, only goes from -126 to 127.

The mantissa bits encode an unsigned integer within the range of 0 through 8388607, representing the fractional binary digits. If we divide that by 8388608 = 2^23 and add 1, then we recover the true mantissa of the number.

IEEE 754 double precision occupies 64 bits of memory and corresponds to the type double in most implementations of the C programming language. A double-precision floating-point number reserves 52 bits for the mantissa and 11 bits for the exponent.

The exponent occupies 11 bits, representing unsigned integers from 0 to 2047 = 2^11 - 1. Accordingly, the exponent bias in double-precision is 1023 = 2^10-1, and the nominal exponent range goes from -1023 to 1024. Again, the smallest and largest exponents are reserved for representing exceptional values, so the actual exponent range only goes from -1022 to 1023.

The mantissa bits encode an unsigned integer within the range of 0 through 2^52-1, representing the fractional binary digits. If we divide that by 2^52 and add 1, then we recover the true mantissa of the number.

Common floating-point formats

Common folklore says that the single-precision type (float) is processed faster by floating-point units than the double-precision type (double) whilst only taking half the size. However, though single-precision is adequate for many computations, scientific applications may require double-precision calculations. Several other floating-point formats are summarized in the following table:

Name	Total bits	Mantissa bits	Exponent bits	Bias	Exponent range
IEEE 754 Half (binary16)	16	10	5	15	−14 to +15
IEEE 754 Single (binary32)	32	23	8	127	−126 to +127
IEEE 754 Double (binary64)	64	52	11	1023	−1022 to +1023
IEEE 754 Quadruple (binary128)	128	112	15	16383	−16382 to +16383
IEEE 754 Octuple (binary256)	256	236	19	16383	−16382 to +16383
bfloat16 (Brain Float)	16	7	8	127	−126 to +127
x86 Extended Precision (80-bit)	80	64(or 63)	15	16383	−16382 to +16383

The 80-bit is technically the odd man out here: not following the hidden bit convention, it stores the leading digit. It is a legacy format stemming from the early days of Intel floating-point math.

Single and double precision are natively supported in mainstream CPUs. By contrast, the half-, quad-, and octo-precision types lack widespread hardware support and are emulated via software at best. Half-precision types are more relevant for GPU computing, and might be accessed indirectly via libraries that control GPU-side calculations. Extended precision types beyond double have found resonance in specialized scientific applications.

When the first C compiler was introduced in 1972, the data types float and double were already part of the language. The terminology suggests that single-precision was considered the standard floating-point type back then. With ANSI C in 1989, double-precision became the default type for floating-point literals, echoing that double-precision is the norm nowadays. Floating-point formats besides single and double precision are not natively supported in C but may be provided via compiler extensions (e.g., _Float16 or _Float128). The C type long double is implementation-defined: in the Windows ecosystem, it is equivalent to double; contingent on hardware support, it corresponds to the 80-bit standard on x86 and to the 128bit standard on RISC-V processors.

To get a feel for these floating-point formats, we take a look at approximate decimal precisions and ranges for some IEEE 754 formats:

Name	Mantissa dec.	max value	min value
IEEE 754 Half (binary16)	3.31	65504	5.96e-8
IEEE 754 Single (binary32)	7.22	3.40e38	1.40e-45
IEEE 754 Double (binary64)	15.95	1.80e308	4.94e-324
IEEE 754 Quadruple (binary128)	34.02	1.19e932	6.48e-4966
IEEE 754 Octuple (binary245)	71.34	1.61e78913	2.25e-78984

Normalized floating-point numbers and special cases

When the exponent bits assume the extremal values of the exponent range, either the highest or lowest possible exponent, then the bits of the floating-point number is not interpreted in the usual way. Instead, that situation encodes special values.

First, we need the notion of normalized floating-point number. A normalized floating-point number for a floating-point format with N exponent bits is simply a number where the exponent bit part is at least 1 and at most 2^N-2, that is, it is any exponent except the largest and the smallest. Floating-point numbers whose bit patterns satisfy that condition, the normalized floating-point numbers, are interpreted in the standard manner.

Floating-point numbers whose exponent assumes one of the two extremal values are not normalized. The IEEE 754 standard reserves special interpretations for such floating-point numbers, where the bits now take on a different meaning.

Infinity

If all exponent bits are set to 1 and all mantissa bits are set to 0, then the floating-point number is interpreted as an infinity. This is either +inf if the sign bit is not set or -inf if the sign bit is set.

These infinities are introduced to handle overflows: when the result of a floating-point computation becomes so large in magnitude that they cannot be represented by a normalized floating-point number, then it is stored as a positive or negative infinity.

Not-a-number

If all exponent bits are set to 1 and not all mantissa bits are set to 0, then the floating-point number is interpreted as not-a-number, or NaN. This encodes an invalid computational result, e.g., the division of infinities results in a nan, and the square root of a negative number yields nan. For now, we shall not be further concerned with the technical intricacies of not-a-number.

Zero

The number zero cannot be represented as a normalized floating-point number: we remember that any normalized floating-point number must have 1 as the leading digit of its mantissa, following the hidden bit convention. This obviously excludes the number zero.

Hence, zero is treated as a special case in the floating-point format. By convention, a zero is a floating-point number where both the exponent bits and the mantissa bits are all zero. The sign bit can be either zero or one, which is why we have two possible zeroes: either the positive zero or the negative zero.

These two zeroes are examples of denormalized numbers, which we shall discuss now.

Denormalized numbers

We recall once again that normalized floating-point number formats generally adhere to the hidden bit convention: the leading digit in their mantissa is implicitly 1 and not stored explicitly in memory. Only the fractional part of the mantissa is stored explicitly. But in the special case when the exponent bits are all zero, the mantissa and exponent bits are interpreted differently, and we call such floating-point numbers denormalized or subnormal.

Denormalized numbers represent values extremely close to zero and are implemented as follows. First, the exponent is now interpreted as -2^(N-1)+2. If we were to follow the usual convention of normalized numbers, then the exponent would be the negative bias -2^(N-1)+1 instead. Second, the mantissa is now interpreted with a leading 0 instead of a leading 1. Lastly, the sign bit is interpreted as usual.

Effectively, we interpret the mantissa bit as an unsigned integer x with M bits, ranging from 0 to 2^M-1, and the denormalized floating-point number represents x/2^(M) * 2^(-2^(N-1)+2). This is strictly less than the smallest normalized floating-point number, which is 2^(-2^(N-1)+2).

What is the purpose of denormalized numbers?

Representation of the number zero, as mentioned above.
Handling of underflows: If the computation result is too small to be represented by a normalized number, then it is stored as a denormalized number. Denormalized numbers are designed to enable a gradual transition to zero when computational results become very small, avoiding an abrupt flush to zero.

However, computation with denormalized numbers may incur a performance penalty. The background is that the floating-point units are designed and optimized to handle normalized floating-point numbers, which are generally considered the standard case; denormalized numbers are expected to be the rare exception. Most FPUs do not support subnormal numbers in hardware but emulate them via microcode or software, which saves transistors on the chip. Even if the hardware natively supports subnormal calculations, it usually comes with a performance penalty.

This is why many CPUs support optional switches that guarantee that subnormal numbers are automatically treated as zero in computations or flushed to zero immediately, thus avoiding subnormal calculations.

Finish

This article has explored the very basics of floating-point numbers and their bit representations. Future articles will address select topics in more depth.

Floating-Point Formats For You!

Table of contents