PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 1/10 TITLE : An Overview of Floating Point Numbers. Understanding How an IBM-PC Stores a Number ******************************************* Binary Notation =============== Of zero's and one's. Binary notation starts with the 2 being the number base. A base ten system goes from 0 to 9, while a binary goes from 0 to 1 without repeating a digit. In base ten, the number 123 is represented as one 100, two 10's and three 1's. Each being a multiple of ten. Disecting the number in base ten from right to left: Total 3 10**0 = 1 3 2 10**1 = 10 23 1 10**2 = 100 123 The same number (123) represented in binary would look like this: 1111011 which doesn't really look like 123. But in disecting it, it breaks down like this from right to left: Total 1 2**0 = 1 1 1 2**1 = 2 3 0 2**2 = 4 3 1 2**3 = 8 11 1 2**4 = 16 27 1 2**5 = 32 59 1 2**6 = 64 123 Two's Complement ================ Binary notation is perfect for describing positive numbers and zero. But when we want to allow for negative numbers, an additional mechanism is needed to indicate the sign of the number. The easiest way to do this is to use the leftmost bit to indicate the sign of a number. For example, using 8 bits: PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 2/10 TITLE : An Overview of Floating Point Numbers. decimal binary 4 0000 0100 -4 1000 0100 127 0111 1111 -127 1111 1111 A problem is introduced when doing binary subtraction however. For example, subtract 1 from 0: decimal binary 0 0000 0000 1 0000 0001 ========= -127 1111 1111 So, if we want to use signed number representation, we need to invent a system where -127 is equalivent to -1, and it's called two's complement. To further exemplify, subtract a positive 1 from the above results: decimal binary -1 1111 1111 1 0000 0001 ========= -2 1111 1110 To see if it works, try it on 3 + (-2): decimal binary 3 0000 0011 -2 1111 1110 ========= 1 0000 0001 How to convert a number to two's complement positive number to negative: decimal binary 5 0000 0101 step 1: reverse the 0's and 1's PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 3/10 TITLE : An Overview of Floating Point Numbers. decimal binary 5' 1111 1010 step 2: add 1 decimal binary 5' 1111 1010 1 0000 0001 ========= -5 1111 1011 Now try this number by adding 7 + (-5): decimal binary 7 0000 0111 -5 1111 1011 ========= 2 0000 0010 Data Formats ============ bits mantissa exponent sign character 8 7 0 1 integer 16 15 0 1 long integer 32 31 0 1 float 32 23 8 1 double 64 52 11 1 long double 80 64 15 1 Integers -------- There are essentially three integer formats supported by the numeric processor: char, integer, and long integer. The format of each is the same except for the length, thus only the range is different for each. The format is a flat mantissa (or magnitude) and the high bit is used for the sign. Two's complement is used for negative numbers. The range for a character is 2**7 (128), integer 2**15 (32,768) , and long integer 2**31 (2,147,483,648). Real Numbers ------------ There are three forms of real number representation supported by the numeric processor: float, double, and long double. Each PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 4/10 TITLE : An Overview of Floating Point Numbers. of these three types has three components: mantissa, exponent, and sign. The mantissa is stored in a form called a "normalized mantissa". This means that the leftmost bit of the mantissa is ASSUMED to be a one, and the IEEE format exploits this. For example: 4 + 1 + 1/4 + 1/8 in the binary form would look like: 101.011 The "normalized" form is obtained by adjusting the exponent until the decimal point is to the right of the most significant one: 1.01011 * 2**2 and the upper one IS NOT stored, except in the case of a long double. By not storing the most significant one, a greater range can be obtained. The exponent is stored in a form called a "biased" exponent. The exponent field specifies the power of 2 by which the mantissa must be multiplied to obtain the value of the floating-point number. In order to accommodate negative exponents, the exponent field contains the sum of the actual exponent and a positive constant called the "bias". This bias insures that the exponent field will always be a positive integer. The actual "bias" for floats is 127, doubles is 1023, and for long doubles is 16383. Using a float for example, suppose the exponent field contained 132: 132 - 127 = 5 So in this scenario, the power by which the mantissa must be multiplied is 2**5. If the mantissa contained 122: 122 - 127 = -5 so the mantissa must multiplied by 2**-5 to obtain the correct value. The 8087 chip reserves the highest and lowest exponents for handling errors, so the largest exponent is 127 and the lowest -126. PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 5/10 TITLE : An Overview of Floating Point Numbers. The sign field is used just as in integers. If the high order bit is 0, the number is positive, if 1, the number is negative. Significant Digit Precision Accuracy ------------------------------------ So how accurate (how many decimals of precision) are floats, doubles and long doubles? In these constants are defined: Mantissa Digits: #define FLT_MANT_DIG 23 #define DBL_MANT_DIG 53 #define LDBL_MANT_DIG 64 If you take 2 and raise it to each of these powers: (approximate) 2 ** FLT_MANT_DIG = 10e6 2 ** DBL_MANT_DIG = 10e15 2 ** LDBL_MANT_DIG = 10e19 and thus the precision for each of the real types can be approximated: #define FLT_DIG 6 #define DBL_DIG 15 #define LDBL_DIG 19 Exponent Range: #define FLT_MAX_10_EXP +38 #define FLT_MIN_10_EXP -37 #define DBL_MAX_10_EXP +308 #define DBL_MIN_10_EXP -307 #define LDBL_MAX_10_EXP +4932 #define LDBL_MIN_10_EXP -4931 8087 Status Word format ----------------------- Error conditions or exceptions sometimes arise during the execution of floating-point operations. The most common one PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 6/10 TITLE : An Overview of Floating Point Numbers. being Division by Zero. However other exceptions are possible. These conditions are held in the 8087 Status Word, and the Borland C/C++ compilers define the following exception conditions: #define SW_INVALID 0x0001 /* Invalid operation */ This exception occurs when no other recovery action is possible, and is the most serious error. If an invalid operation exception occurs within an operation, the operation returns a NaN, which stands for Not a Number. A NaN is returned if the exponent of a number contains all 1's and the mantissa has anything other than 0's in it. #define SW_DENORMAL 0x0002 /* Denormalized operand */ A denormalized operand exception occurs when precision is sacrificed in order to increase range. The 8087 tries to prevent the inferior precision of denormals from corrupting the precision of the rest of the computation by providing a warning. A denormal usually occurs as a result of masking a particular exception out of the status word. #define SW_ZERODIVIDE 0x0004 /* Zero divide */ This exception occurs whenever an attempt to divide by either +0 or -0 is made. #define SW_OVERFLOW 0x0008 /* Overflow */ This exception occurs whenever an attempt is made to represent a number which is too big to represent in IEEE format. #define SW_UNDERFLOW 0x0010 /* Underflow */ This exception occurs whenever an attempt is made to represent a number which is too small to represent in the IEEE format. #define SW_INEXACT 0x0020 /* Precision (Inexact result)*/ PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 7/10 TITLE : An Overview of Floating Point Numbers. This exception occurs a number cannot be exactly represented, and thus will be approximated. If this exception is masked, then the rounding control of the Control Word will be used. 8087 Control Word ----------------- The Control Word controls the actions taken when an exception is generated. If the mask for a particular field is 0, the program will suspend operation. If the mask is 1, the corresponding exception is masked and exception values are produced. The Borland C/C++ compilers mask all the bits in the control word by masking it with a 0x3F. #define MCW_EM 0x003f /* interrupt Exception Masks*/ When an exception is encountered, appropriate corrective action is taken to fix-up the exception if the exception is masked. #define EM_INVALID 0x0001 /* invalid */ #define EM_DENORMAL 0x0002 /* denormal */ #define EM_ZERODIVIDE 0x0004 /* zero divide */ #define EM_OVERFLOW 0x0008 /* overflow */ #define EM_UNDERFLOW 0x0010 /* underflow */ #define EM_INEXACT 0x0020 /* inexact(precision)*/ The Control Word contains three additional bit fields for correcting exceptions: #define MCW_IC 0x1000 /* Infinity Control */ #define IC_PROJECTIVE 0x0000 /* projective */ #define IC_AFFINE 0x1000 /* affine */ The first is the infinity contol bit. If the bit is 0, a projective infinity occurs when conditions occur such as division by zero. This projective form is the default, and the most conservative. If the bit is 1, the affine mode is more liberal, but the programmer should analyze how introducing infinity into a calculation could affect the program. #define MCW_RC 0x0c00 /* Rounding Control */ #define RC_CHOP 0x0c00 /* chop */ #define RC_UP 0x0800 /* up */ PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 8/10 TITLE : An Overview of Floating Point Numbers. #define RC_DOWN 0x0400 /* down */ #define RC_NEAR 0x0000 /* near */ The next 2-bit field is the rounding control field. The rounding of the inexact result will occur according these bit fields, and is typically masked to RC_NEAR. #define MCW_PC 0x0300 /* Precision Control */ #define PC_24 0x0000 /* 24 bits */ #define PC_53 0x0200 /* 53 bits */ #define PC_64 0x0300 /* 64 bits */ The purpose of this bit field is to cause the 8087 to round all numbers to something less than extend precision before placing them in numeric registers. However, the C language requires that all intermediate results be stored as long doubles, so for C/C++ the precision control is set to PC_64. This field is available for compatibility with programs coming from other operating systems. Round-Off Problems ================== One of the problems with floating point numbers is round-off. Round off errors occur when attempting to represent certain numbers in any number base. For example, 1/3 is not exactly representable in base ten, while 1/10th is easily representable. But since we're dealing with computers, we are specifically in base two numbers. As opposed to base ten, 1/10th is not exactly representable in base two. For example, the fractional portions of base two are: 1/2 1/4 1/8 1/16 1/32 1/64 1/128 1/256 1/512 The numbers 1/2, 1/4, 1/8, all powers of two, are exactly representable in a computer. But since 1/10 lies between 1/8 and 1/16, it is not exactly representable using binary notation. So internally the computer has to decide which fractional binary portions to add together to sum close to 1/10. For example: 1/2 1/4 1/8 1/16 1/32 1/64 1/128 1/256 1/512 0 0 0 1 1 1 0 0 0 PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 9/10 TITLE : An Overview of Floating Point Numbers. this adds up to: 0.1093 which is close to 0.1000 but could easily be rounded to 1.1 so the computer internal algorithm must try to find another combination of binary fractions which come closer to 0.1000 When it's internal algorithm is satisfied, it will have a number which is CLOSE to 1/10th but not EXACT. This inexactness is known as ROUND-OFF error. Floating Point Round Off Error ------------------------------ Round off error is especially noticable in the smallest floating point data type available: the float. The float data type is four bytes in length, and uses these bytes to hold the mantissa, exponent, and sign of the number. The following program demonstrates that round off error with floating point number's occur even with simple assignments: #include #include void main() { float number = 123.45; cout << number << endl; } The round off error can be significant when doing multiple or iterative calculations, as the following program illustrates: #include void main() { float anumber = 1.693 / 10.0; float original = 1000000.00; int i, j; for (i=0; i<10; i++) { original = original * anumber; } PRODUCT : Borland C++ NUMBER : 1027 VERSION : 3.x OS : DOS DATE : October 19, 1993 PAGE : 10/10 TITLE : An Overview of Floating Point Numbers. for (j=0; j<10; j++) { original = original / anumber; } cout << original << endl; } At the end of ten multiplications and divisions, the original number is off by a 0.1875. Increasing the size from a 4 byte to 8 byte real improves things somewhat, as the following code illustrates: #include void main() { double anumber = 1.693 / 10.0; double original = 1000000.00; int i, j; for (i=0; i<10; i++) { original = original * anumber; } for (j=0; j<10; j++) { original = original / anumber; } cout << original << endl; } The difference between the original and calculated original are only off by 0.0625 with doubles.