PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 1/10
TITLE : An Overview of Floating Point Numbers.
Understanding How an IBM-PC Stores a Number
*******************************************
Binary Notation
===============
Of zero's and one's. Binary notation starts with the 2 being
the number base. A base ten system goes from 0 to 9, while a
binary goes from 0 to 1 without repeating a digit. In base ten,
the number 123 is represented as one 100, two 10's and three 1's.
Each being a multiple of ten. Disecting the number in base ten
from right to left:
Total
3 10**0 = 1 3
2 10**1 = 10 23
1 10**2 = 100 123
The same number (123) represented in binary would look like
this:
1111011
which doesn't really look like 123. But in disecting it, it
breaks down like this from right to left:
Total
1 2**0 = 1 1
1 2**1 = 2 3
0 2**2 = 4 3
1 2**3 = 8 11
1 2**4 = 16 27
1 2**5 = 32 59
1 2**6 = 64 123
Two's Complement
================
Binary notation is perfect for describing positive numbers
and zero. But when we want to allow for negative numbers, an
additional mechanism is needed to indicate the sign of the
number. The easiest way to do this is to use the leftmost bit to
indicate the sign of a number. For example, using 8 bits:
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 2/10
TITLE : An Overview of Floating Point Numbers.
decimal binary
4 0000 0100
-4 1000 0100
127 0111 1111
-127 1111 1111
A problem is introduced when doing binary subtraction
however. For example, subtract 1 from 0:
decimal binary
0 0000 0000
1 0000 0001
=========
-127 1111 1111
So, if we want to use signed number representation, we need
to invent a system where -127 is equalivent to -1, and it's
called two's complement. To further exemplify, subtract a
positive 1 from the above
results:
decimal binary
-1 1111 1111
1 0000 0001
=========
-2 1111 1110
To see if it works, try it on 3 + (-2):
decimal binary
3 0000 0011
-2 1111 1110
=========
1 0000 0001
How to convert a number to two's complement positive number to
negative:
decimal binary
5 0000 0101
step 1: reverse the 0's and 1's
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 3/10
TITLE : An Overview of Floating Point Numbers.
decimal binary
5' 1111 1010
step 2: add 1
decimal binary
5' 1111 1010
1 0000 0001
=========
-5 1111 1011
Now try this number by adding 7 + (-5):
decimal binary
7 0000 0111
-5 1111 1011
=========
2 0000 0010
Data Formats
============
bits mantissa exponent sign
character 8 7 0 1
integer 16 15 0 1
long integer 32 31 0 1
float 32 23 8 1
double 64 52 11 1
long double 80 64 15 1
Integers
--------
There are essentially three integer formats supported by the
numeric processor: char, integer, and long integer. The format
of each is the same except for the length, thus only the range is
different for each. The format is a flat mantissa (or magnitude)
and the high bit is used for the sign. Two's complement is used
for negative numbers. The range for a character is 2**7 (128),
integer 2**15 (32,768) , and long integer 2**31 (2,147,483,648).
Real Numbers
------------
There are three forms of real number representation supported
by the numeric processor: float, double, and long double. Each
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 4/10
TITLE : An Overview of Floating Point Numbers.
of these three types has three components: mantissa, exponent,
and sign.
The mantissa is stored in a form called a "normalized
mantissa". This means that the leftmost bit of the mantissa is
ASSUMED to be a one, and the IEEE format exploits this. For
example:
4 + 1 + 1/4 + 1/8
in the binary form would look like:
101.011
The "normalized" form is obtained by adjusting the exponent until
the decimal point is to the right of the most significant one:
1.01011 * 2**2
and the upper one IS NOT stored, except in the case of a long
double. By not storing the most significant one, a greater range
can be obtained.
The exponent is stored in a form called a "biased" exponent.
The exponent field specifies the power of 2 by which the mantissa
must be multiplied to obtain the value of the floating-point
number. In order to accommodate negative exponents, the exponent
field contains the sum of the actual exponent and a positive
constant called the "bias". This bias insures that the exponent
field will always be a positive integer. The actual "bias" for
floats is 127, doubles is 1023, and for long doubles is 16383.
Using a float for example, suppose the exponent field contained
132:
132 - 127 = 5
So in this scenario, the power by which the mantissa must be
multiplied is 2**5. If the mantissa contained 122:
122 - 127 = -5
so the mantissa must multiplied by 2**-5 to obtain the correct
value. The 8087 chip reserves the highest and lowest exponents
for handling errors, so the largest exponent is 127 and the
lowest -126.
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 5/10
TITLE : An Overview of Floating Point Numbers.
The sign field is used just as in integers. If the high
order bit is 0, the number is positive, if 1, the number is
negative.
Significant Digit Precision Accuracy
------------------------------------
So how accurate (how many decimals of precision) are floats,
doubles and long doubles? In these constants are
defined:
Mantissa Digits:
#define FLT_MANT_DIG 23
#define DBL_MANT_DIG 53
#define LDBL_MANT_DIG 64
If you take 2 and raise it to each of these powers: (approximate)
2 ** FLT_MANT_DIG = 10e6
2 ** DBL_MANT_DIG = 10e15
2 ** LDBL_MANT_DIG = 10e19
and thus the precision for each of the real types can be
approximated:
#define FLT_DIG 6
#define DBL_DIG 15
#define LDBL_DIG 19
Exponent Range:
#define FLT_MAX_10_EXP +38
#define FLT_MIN_10_EXP -37
#define DBL_MAX_10_EXP +308
#define DBL_MIN_10_EXP -307
#define LDBL_MAX_10_EXP +4932
#define LDBL_MIN_10_EXP -4931
8087 Status Word format
-----------------------
Error conditions or exceptions sometimes arise during the
execution of floating-point operations. The most common one
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 6/10
TITLE : An Overview of Floating Point Numbers.
being Division by Zero. However other exceptions are possible.
These conditions are held in the 8087 Status Word, and the
Borland C/C++ compilers define the following exception
conditions:
#define SW_INVALID 0x0001 /* Invalid operation */
This exception occurs when no other recovery action is
possible, and is the most serious error. If an invalid operation
exception occurs within an operation, the operation returns a
NaN, which stands for Not a Number. A NaN is returned if the
exponent of a number contains all 1's and the mantissa has
anything other than 0's in it.
#define SW_DENORMAL 0x0002 /* Denormalized operand */
A denormalized operand exception occurs when precision is
sacrificed in order to increase range. The 8087 tries to prevent
the inferior precision of denormals from corrupting the precision
of the rest of the computation by providing a warning. A
denormal usually occurs as a result of masking a particular
exception out of the status word.
#define SW_ZERODIVIDE 0x0004 /* Zero divide */
This exception occurs whenever an attempt to divide by either
+0 or -0 is made.
#define SW_OVERFLOW 0x0008 /* Overflow */
This exception occurs whenever an attempt is made to
represent a number which is too big to represent in IEEE format.
#define SW_UNDERFLOW 0x0010 /* Underflow */
This exception occurs whenever an attempt is made to
represent a number which is too small to represent in the IEEE
format.
#define SW_INEXACT 0x0020 /* Precision (Inexact result)*/
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 7/10
TITLE : An Overview of Floating Point Numbers.
This exception occurs a number cannot be exactly represented,
and thus will be approximated. If this exception is masked, then
the rounding control of the Control Word will be used.
8087 Control Word
-----------------
The Control Word controls the actions taken when an exception
is generated. If the mask for a particular field is 0, the
program will suspend operation. If the mask is 1, the
corresponding exception is masked and exception values are
produced. The Borland C/C++ compilers mask all the bits in the
control word by masking it with a 0x3F.
#define MCW_EM 0x003f /* interrupt Exception Masks*/
When an exception is encountered, appropriate corrective action
is taken to fix-up the exception if the exception is masked.
#define EM_INVALID 0x0001 /* invalid */
#define EM_DENORMAL 0x0002 /* denormal */
#define EM_ZERODIVIDE 0x0004 /* zero divide */
#define EM_OVERFLOW 0x0008 /* overflow */
#define EM_UNDERFLOW 0x0010 /* underflow */
#define EM_INEXACT 0x0020 /* inexact(precision)*/
The Control Word contains three additional bit fields for
correcting exceptions:
#define MCW_IC 0x1000 /* Infinity Control */
#define IC_PROJECTIVE 0x0000 /* projective */
#define IC_AFFINE 0x1000 /* affine */
The first is the infinity contol bit. If the bit is 0, a
projective infinity occurs when conditions occur such as division
by zero. This projective form is the default, and the most
conservative. If the bit is 1, the affine mode is more liberal,
but the programmer should analyze how introducing infinity into a
calculation could affect the program.
#define MCW_RC 0x0c00 /* Rounding Control */
#define RC_CHOP 0x0c00 /* chop */
#define RC_UP 0x0800 /* up */
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 8/10
TITLE : An Overview of Floating Point Numbers.
#define RC_DOWN 0x0400 /* down */
#define RC_NEAR 0x0000 /* near */
The next 2-bit field is the rounding control field. The
rounding of the inexact result will occur according these bit
fields, and is typically masked to RC_NEAR.
#define MCW_PC 0x0300 /* Precision Control */
#define PC_24 0x0000 /* 24 bits */
#define PC_53 0x0200 /* 53 bits */
#define PC_64 0x0300 /* 64 bits */
The purpose of this bit field is to cause the 8087 to round
all numbers to something less than extend precision before
placing them in numeric registers. However, the C language
requires that all intermediate results be stored as long doubles,
so for C/C++ the precision control is set to PC_64. This field
is available for compatibility with programs coming from other
operating systems.
Round-Off Problems
==================
One of the problems with floating point numbers is round-off.
Round off errors occur when attempting to represent certain
numbers in any number base. For example, 1/3 is not exactly
representable in base ten, while 1/10th is easily representable.
But since we're dealing with computers, we are specifically in
base two numbers. As opposed to base ten, 1/10th is not exactly
representable in base two. For example, the fractional portions
of base two are:
1/2 1/4 1/8 1/16 1/32 1/64 1/128 1/256 1/512
The numbers 1/2, 1/4, 1/8, all powers of two, are exactly
representable in a computer. But since 1/10 lies between 1/8 and
1/16, it is not exactly representable using binary notation. So
internally the computer has to decide which fractional binary
portions to add together to sum close to 1/10. For example:
1/2 1/4 1/8 1/16 1/32 1/64 1/128 1/256 1/512
0 0 0 1 1 1 0 0 0
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 9/10
TITLE : An Overview of Floating Point Numbers.
this adds up to:
0.1093 which is close to 0.1000 but could easily be rounded
to 1.1 so the computer internal algorithm must try to find
another combination of binary fractions which come closer to
0.1000 When it's internal algorithm is satisfied, it will have a
number which is CLOSE to 1/10th but not EXACT. This inexactness
is known as ROUND-OFF error.
Floating Point Round Off Error
------------------------------
Round off error is especially noticable in the smallest
floating point data type available: the float. The float data
type is four bytes in length, and uses these bytes to hold the
mantissa, exponent, and sign of the number. The following
program demonstrates that round off error with floating point
number's occur even with simple assignments:
#include
#include
void main()
{
float number = 123.45;
cout << number << endl;
}
The round off error can be significant when doing multiple or
iterative calculations, as the following program illustrates:
#include
void main()
{
float anumber = 1.693 / 10.0;
float original = 1000000.00;
int i, j;
for (i=0; i<10; i++)
{
original = original * anumber;
}
PRODUCT : Borland C++ NUMBER : 1027
VERSION : 3.x
OS : DOS
DATE : October 19, 1993 PAGE : 10/10
TITLE : An Overview of Floating Point Numbers.
for (j=0; j<10; j++)
{
original = original / anumber;
}
cout << original << endl;
}
At the end of ten multiplications and divisions, the original
number is off by a 0.1875. Increasing the size from a 4 byte to
8 byte real improves things somewhat, as the following code
illustrates:
#include
void main()
{
double anumber = 1.693 / 10.0;
double original = 1000000.00;
int i, j;
for (i=0; i<10; i++)
{
original = original * anumber;
}
for (j=0; j<10; j++)
{
original = original / anumber;
}
cout << original << endl;
}
The difference between the original and calculated original are
only off by 0.0625 with doubles.