Floating point numbers#

We’ve seen that we can represent decimals using floats, and also that these floats can sometimes have strange behaviour.

0.1 + 0.2

0.30000000000000004

It’s important to understand what is going on here. A floating point number is one that is (approximately) represented in a format similar to “scientific notation”, but where the number of significant figures and the base of the exponent is fixed. For example, we might fix the number of significant figures at \(16\) and the base as \(10\), and represent two numbers of very different magnitudes:

\[\begin{split} \sqrt{2} \approx 1.414213562373095 \times 10^{0}, \\ e^{35} \approx 1.586013452313431 \times 10^{15}. \end{split}\]

The significant digits are called the significand or mantissa, and the exponent is conveniently called the exponent.

Note that the error in the second approximation will be much larger in absolute value. The term “floating point” refers to how the exponent moves the decimal point across the significant figures.

Computers typically use a floating-point system to represent non-integer real numbers. The system used by Python is a little different to the representation above. It assumes that the point lies after the last significant digit, rather than after the first as above. It also uses base-2 (binary), and stores 53 significant binary digits (bits) along with 11 bits for the exponent, for a total of 64 bits (8 bytes). This system is called a double-precision float. The two numbers above would be represented as follows:

\[\begin{split} \sqrt{2} \approx 6369051672525773 \times 2^{-52}, \\ e^{35} \approx 6344053809253723 \times 2^{-2}. \end{split}\]

Here we have given the significands and exponents in base ten for convenience, but they would be stored in binary.

Since \(2^{53} \approx 10^{16}\), we roughly get 16 significant decimal digits in a double-precision float. There are also single-precision floats, which take up 4 bytes (24 significant bits and 8 exponent bits). This translates into around 7 significant decimal digits.

Precision and the machine epsilon#

Since there are a fixed number of significant digits, there are often issues when adding together numbers of different magnitudes. Consider the following:

import numpy as np

np.exp(35) + 0.1 == np.exp(35)

True

Since the exponent for \(e^{35}\) is large, the fixed 53 significant bits cannot show the difference between \(e^{35}\) and \(e^{35} + 0.1\).

A very important example comes from considering numbers just slightly larger than 1.

# 1e-14 is shorthand for 10**(-14).
# Test if Python can distinguish between 1 + 1e-14 and 1
1 + 1e-14 == 1

False

1 + 1e-15 == 1

False

1 + 1e-16 == 1

True

Python cannot distinguish between \(1\) and \(1 + 10^{-16}\); they are represented by the same float. This value of \(10^{-16}\) is a good approximation for \(2^{-53}\), which is the “true” largest value \(\varepsilon\) such that Python cannot distinguish between \(1\) and \(1 + \varepsilon\). This value \(\varepsilon\) is called the machine epsilon, and represents the relative error that appears in floating point representations.

It is important to remember that the machine epsilon is a relative error. The gaps between indistinguishable floats grow as the exponent increases, and shrink as it decreases - the machine epsilon is the gap when the exponent is 0. The machine epsilon is not the smallest representible number - see the section on underflow.

We saw above that \(e^{35}\) and \(e^{35} + 0.1\) also could not be distinguished. We can use the machine epsilon to get a rough estimate for the largest value \(\delta\) such that \(e^{35} + \delta\) is indistinguishable from \(e^{35}\) as follows:

delta = np.exp(35) * 2**-53

delta

0.17608286521236036

np.exp(35) + delta == np.exp(35)

False

np.exp(35) + 0.5 * delta == np.exp(35)

True

Binary representations#

There is another issue that can crop up with floats: the fact that they use a binary representation means that some simple decimals cannot be easily represented. For example, the number \(0.1\) is a nice decimal fraction, but cannot be represented as a finite binary fraction. This can cause some strange effects:

a = 0.1
3 * a

0.30000000000000004

3 * a == 0.3

False

The issue here is that a will be the closest representable float to \(0.1\), and 3 * a is then not necessarily the closest float to the true value \(0.3\). You can find out the representation that Python is using:

(0.1).as_integer_ratio()

(3602879701896397, 36028797018963968)

np.log2(36028797018963968)

55.0

This means that \(0.1\) is being represented as \(\frac{3602879701896397}{2^{55}}\).

Comparing floats#

Given the issues above, it is often not a good idea to directly compare floats x and y using x == y. Instead, consider testing their absolute difference: abs(x - y) <= err for some fixed value of err.

Overflow and underflow#

As well as the limitations discussed above, caused by the number of significant bits, there are limitations caused by the fixed number of bits available for the exponent. Since we have 11 bits available for the exponent, and one of those bits is used to determine whether it is positive or negative, the exponent can go up to \(2^{10} - 1\).

2.0 ** (2 ** 10 - 1)

8.98846567431158e+307

2.0 ** (2 ** 10)

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 2.0 ** (2 ** 10)

OverflowError: (34, 'Result too large')

An OverflowError occurs when the result of a calculation is too large to fit in a float.

A similar issue can occur when the exponent gets too small, though here we don’t get an error.

2**-1074 

5e-324

2**-1075

0.0

Infinity and NaN#

If we directly create a float which is too large, Python will treat it like infinity.

# 2.3 * (10**310) is, of course, equal to infinity
2.3e310

inf

The other special value is nan, standing for “not a number”, which can arise if your calculations take a strange turn like multiplying infinity by 0.

# infinity times 0 is not a number
2.3e310 * 0

nan

# infinity minus infinity is not a number
2.3e310 - 4.5e350

nan

MT3510 Introduction to Mathematical Computing

Floating point numbers

Contents