Floating Point Number Representation


We have seen how to represent integers - positive and negative - in binary. Unfortunately, in the real world we must deal with numbers which are not integers, numbers such as 1/2, 3.14159260, 2.71828, etc.

We can construct binary fractions using a binary "point", in the same was as we do with decimal:

	||| |||
	||| ||2-3 (1/8)
	||| |2-2  (1/4)
	||| 2-1   (1/2)
	||20	     (1)
	|21       (2)
        22        (4)

And so the above number represents 4 + 1 + 1/2 + 1/8 = 5 5/8 (5.625)

There are several problems with this, unfortunately:

Scientists have faced a similar problem for decades when writing calculations etc. For example, looking quickly, are the following two numbers the same? 0.0000000000000000000000024 and 0.000000000000000000000000024 How about 5300000000000000000000 and 5300000000000000000000

In order to avoid this problem (and save writing) scientists have developed the so-called scientific notation for writing down both very large and very small numbers. The general format is:

[S] M e E

Where S is an optional sign, M is the mantissa and E is the exponent.

The number written as


signifies -1.25 x 1020 or 125000000000000000000

Note that the exponent can also be signed, to represent extremely small numbers:


signifies -1.25 x 10-20 or -0.0000000000000000000125

Floating-point representation is the binary equivalent of scientific notation. in floating-point the number is divided into two parts, the exponent and the mantissa (aka significand).

Because these numbers are being represented inside a computer, inevitably there will be limits to each of the field lengths within the storage used. In the case of the IEEE 754 standard (see below), a single-precision number uses 1 bit for the mantissa's sign, 8 for the exponent and 23 for the mantissa's magnitude.

In order to maximise the precision (number of significant digits) of numbers stored they are normalised; this means that the exponent is adjusted so that there are no significant bits before the binary point.

In decimal the equivalent would be to have no significant digits before the decimal point. In other words 1.25e10 normalised would be 0.125e11; 634259876e4 normalised would be 0.634259876e13

Multiplying and dividing FP numbers is simple; addition and subtraction pose problems.

For example multiplying 3e7 by 4e-2 is simple: the rule (algorithm) says multiply the mantissas and add the exponents, so the result is:

12e5 (3x4 = 12, 7 + -2 = 5).

Or, using the normalised versions:

0.3e8 x 0.4e-1 = 0.12e7 (which is the normalised version of 12e5)

For division we simply divide the mantissas and subtract the exponents.

But how can we add 0.3e3 and 0.25e-1?

Firstly we must arrange matters so that both numbers have the same exponent, a process usually called denormalisation. This can be time-consuming, but will be necessary.

In this case, suppose that the numbers are denormalised to 3e2 and 0.00025e2, the sum is then easily calculated as 3.00025e2, which normalises to 0.300025e3.


Floating point numbers arrived at in calculations are always approximations. Historically floating-point representations differed between manufacturers and between different machines from the same manufacturer, had differing accuracies (for example the IBM Series 360's floating point was, in some cases, less accurate than that of the IBM 7090, an earlier machine)and even used different bases (10, 16, 2).

Programmers had to learn (and must still learn some) strange and counter-intuitive procedures in order to work successfully with floating point.

Floating Point Arithmetic is NOT associative

In arithmetic we rely on the fact that (x + y) + z = x + (y + z). In floating point arithmetic, it ain't necessarily so.

To see why imagine a decimal floating point scheme with the following limitations: 6 digits of mantissa, 3 digits of exponent. Now imagine the following addition:

1.0e20 + 1.0 + -1.0e20 (1020 + 1 - 1020)

Let us assume addition from left to right; in our representation the numbers will be normalised as 0.1e21, 0.1e1, -0.1e21.

When we try to denormalise the first two to use the same exponent we have a problem: 10000000000000000000e1 and 0.1e1 have the same exponent, and so do 0.1e21 and 0.000000000000000000001e21, but in both cases one of the two mantissas is far too large for our representation and the smaller number will inevitably be truncated to zero, so:

1.0e20 + 1.0 = 1.0e20, hence (1.0e20 + 1.0) + -1.0e20 = 0.0 and not 1.0 as expected.

In this case our representation prevents doing this calculation accurately (I told you everything was approximate).

There are other occasions when we can get the right answer but only if we add the numbers in the correct order; the rule is: when adding (or subtracting) floating point numbers add from the smallest to the largest.

Too see why consider this example:

0.1e3 + 0.1e-4 +....

If we add these two numbers, we shall get 0.1e3 (for the same reasons as above, we can't denormalise both numbers accurately). But, if these are just the first two terms of a long series, and the next 1000 numbers are all 0.1e-4, then these 1000 items add up to 0.1e-1 and we can add that accurately to 0.1e3: 0.1e3 + 0.00001e3 = 0.10001e3.

The introduction of the IEE 754 standard (and its successor IEEE 854, which is virtually identical) have at least guaranteed that almost every computer being used is at least compatible in the floating point area (although there are still some major exceptions still being phased out: IBM mainframes and Cray supercomputers to name two).

Testing for equality

Testing two floating point numbers to see if they are equal is asking for trouble.

Are the following two numbers equal? 1.000000000000000000 and 1.000000000000000000000000000001? Obviously they are not equal mathematically speaking, but if each is the result of a calculation (necessarily approximate, to rub the point in) shouldn't we perhaps consider them equal?

Usually in floating point arithmetic we'll check two values to see if they are close enough (obviously the meaning will change for different algorithms). As in

if (x - y) < delta

where delta (traditional term) is a small number representing how close is close enough.


Related to the above is the danger of rounding. The basic rule is simple: don't round until you absolutely must.

To see the dangers, imagine we are rounding to 1 decimal place after each addition and we want to add the following numbers: 1.23 2.32 6.45. The true sum, to one decimal place, should be 10.0 (these numbers are not chosen at random folks)

But, if we round to one decimal place after each addition:

1.23 + 2.32 = 3.55, which rounds up to 3.6

3.60 + 6.45 = 10.05, which rounds up to 10.1

A 1% inaccuracy in just two additions!

Historical Oddities

In addition to learning the rules above, historically suspect implementations (all manufacturers were guilty at one time or another) meant that programmers had to be wary of things like:

	if x == 0.0
	   z = 1.0
	   z = y / x

In one case (the CDC6600) the comparison was done by the CPU's add unit and the division (obviously) by the divide unit. Unfortunately the two units tested different numbers of bits to decide whether a number was (effectively) zero. This meant that for very small x the test for equal to zero would fail (the adder didn't consider the number zero), but the divide unit would consider x to be zero and would cause a divide by zero hardware error and the program would crash.

The "fix" was to program

	if (1.0 * x) == 0.0
	   z = 1.0
	   z = y / x

Unfortunately, this was not portable and would fail on other hardware.

For a similar reasons, on the original IBM S/360 the test:

	if (1.0 * x) == x

would also fail. And many programmers became accustomed (on several machines) to writing:

	y = (0.5 - x) + 0.5

instead of

	y = 1.0 - x

which would give the wrong answer.

IEE 754 Standard

 0   1   2   3   4   5   6   7   8   9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
s exponent mantissa/significand

IEEE 754 floating-point: single precision (32-bit) format

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
s exponent mantissa/significand
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
mantissa/significand (continued)

IEEE 754 floating-point: double precision (64-bit) format

If we use S for the sign, E for the exponent and M for the mantissa (significand), then the number stored is equivalent to:

(-1)S x M x 2E

The IEEE standard has been in use since ~1980 (although the standard was not ratified until 1985) and uses some at-first sight strange optimisations to fit as much as possible into as small a space as possible.

For example, when we normalise a binary number, as it says above, the exponent is adjusted so that there are no significant bits before the binary point.

This means that the first bit after the binary point will always be a 1 (it wouldn't be significant if it was a zero), so it is not stored! This increases single-precision accuracy to 24 bits.

Furthermore, the exponent is stored in what is called biased format: whatever the exponent field contains, you must subtract 127 (1023 in double precision) to get the true exponent (so 1 => -126, 254 => 127). The reason for this is to make sorting floating point numbers easier - and we have just seen why one might want to sort them.

There are also special patterns which don't represent normal numbers. The full IEEE 754 layout is given below:

Single precision Double precision Represents
Exponent (8 bits) Significand (mantissa) (23 bits) Exponent (11 bits) Significand (mantissa) (52 bits)
0 0 0 0 0
0 nonzero 0 nonzero +/- denormalised number
1-254 anything 1-2046 anything +/- normalised floating point number
255 0 2047 0 +/- infinity
255 nonzero 2047 nonzero NaN (Not a Number)

IEEE 754/854 Floating Point layout

A denormalised number is a way of allowing very small values (which don't have a 1 immediately after the binary point) and is used in specialised operations.

The two representations for + and -infinity mean that a division by zero can be dealt with without having to cause a run-time hardware error. NaN values result from attempts to divide zero by zero, or subtract infinity from itself.

The Bottom Line

Dealing with floating point is tricky: if you don't have to, don't.