PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  1/10

    TITLE  :  An Overview of Floating Point Numbers.





  Understanding How an IBM-PC Stores a Number
  *******************************************


  Binary Notation
  ===============

      Of zero's and one's.  Binary notation starts with the 2 being
  the number base.  A base ten system goes from 0 to 9, while a
  binary goes from 0 to 1 without repeating a digit.  In base ten,
  the number 123 is represented as one 100, two 10's and three 1's.
  Each being a multiple of ten.  Disecting the number in base ten
  from right to left:

                                    Total
         3       10**0    =   1       3
         2       10**1    =  10      23
         1       10**2    = 100     123

  The same number (123) represented in binary would look like
  this:
                          1111011

  which doesn't really look like 123.  But in disecting it, it
  breaks down like this from right to left:
                                     Total
          1       2**0    =   1        1
          1       2**1    =   2        3
          0       2**2    =   4        3
          1       2**3    =   8       11
          1       2**4    =  16       27
          1       2**5    =  32       59
          1       2**6    =  64      123

  Two's Complement
  ================

      Binary notation is perfect for describing positive numbers
  and zero.  But when we want to allow for negative numbers, an
  additional mechanism is needed to indicate the sign of the
  number.  The easiest way to do this is to use the leftmost bit to
  indicate the sign of a number.  For example, using 8 bits:













  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  2/10

    TITLE  :  An Overview of Floating Point Numbers.




      decimal     binary
         4        0000 0100
        -4        1000 0100
       127        0111 1111
      -127        1111 1111

      A problem is introduced when doing binary subtraction
  however.  For example, subtract 1 from 0:

      decimal     binary
         0        0000 0000
         1        0000 0001
                  =========
      -127        1111 1111

      So, if we want to use signed number representation, we need
  to invent a system where -127 is equalivent to -1, and it's
  called two's complement.  To further exemplify, subtract a
  positive 1 from the above
  results:

      decimal     binary
        -1        1111 1111
         1        0000 0001
                  =========
        -2        1111 1110

  To see if it works, try it on 3 + (-2):

      decimal     binary
         3        0000 0011
        -2        1111 1110
                  =========
         1        0000 0001

  How to convert a number to two's complement positive number to
  negative:

      decimal     binary
         5        0000 0101

  step 1:  reverse the 0's and 1's














  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  3/10

    TITLE  :  An Overview of Floating Point Numbers.




      decimal     binary
         5'       1111 1010

  step 2:  add 1

      decimal     binary
         5'       1111 1010
         1        0000 0001
                  =========
        -5        1111 1011

  Now try this number by adding 7 + (-5):

      decimal     binary
         7        0000 0111
        -5        1111 1011
                  =========
         2        0000 0010

  Data Formats
  ============
                      bits    mantissa    exponent    sign
      character          8           7           0       1
      integer           16          15           0       1
      long integer      32          31           0       1
      float             32          23           8       1
      double            64          52          11       1
      long double       80          64          15       1

  Integers
  --------
      There are essentially three integer formats supported by the
  numeric processor: char, integer, and long integer.  The format
  of each is the same except for the length, thus only the range is
  different for each.  The format is a flat mantissa (or magnitude)
  and the high bit is used for the sign.  Two's complement is used
  for negative numbers.  The range for a character is 2**7 (128),
  integer 2**15 (32,768) , and long integer 2**31 (2,147,483,648).

  Real Numbers
  ------------
      There are three forms of real number representation supported
  by the numeric processor:  float, double, and long double.  Each













  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  4/10

    TITLE  :  An Overview of Floating Point Numbers.




  of these three types has three components:  mantissa, exponent,
  and sign.
      The mantissa is stored in a form called a "normalized
  mantissa".  This means that the leftmost bit of the mantissa is
  ASSUMED to be a one, and the IEEE format exploits this.  For
  example:

      4 + 1 + 1/4 + 1/8

  in the binary form would look like:

      101.011

  The "normalized" form is obtained by adjusting the exponent until
  the decimal point is to the right of the most significant one:

      1.01011 * 2**2

  and the upper one IS NOT stored, except in the case of a long
  double. By not storing the most significant one, a greater range
  can be obtained.
      The exponent is stored in a form called a "biased" exponent.
  The exponent field specifies the power of 2 by which the mantissa
  must be multiplied to obtain the value of the floating-point
  number.  In order to accommodate negative exponents, the exponent
  field contains the sum of the actual exponent and a positive
  constant called the "bias".  This bias insures that the exponent
  field will always be a positive integer.  The actual "bias" for
  floats is 127, doubles is 1023, and for long doubles is 16383.
  Using a float for example, suppose the exponent field contained
  132:

      132 - 127 = 5

  So in this scenario, the power by which the mantissa must be
  multiplied is 2**5.  If the mantissa contained 122:

      122 - 127 = -5

  so the mantissa must multiplied by 2**-5 to obtain the correct
  value.  The 8087 chip reserves the highest and lowest exponents
  for handling errors, so the largest exponent is 127 and the
  lowest -126.













  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  5/10

    TITLE  :  An Overview of Floating Point Numbers.




      The sign field is used just as in integers.  If the high
  order bit is 0, the number is positive, if 1, the number is
  negative.

  Significant Digit Precision Accuracy
  ------------------------------------

      So how accurate (how many decimals of precision) are floats,
  doubles and long doubles?  In <float.h> these constants are
  defined:

      Mantissa Digits:
          #define FLT_MANT_DIG        23
          #define DBL_MANT_DIG        53
          #define LDBL_MANT_DIG       64

  If you take 2 and raise it to each of these powers: (approximate)

          2 **  FLT_MANT_DIG = 10e6
          2 **  DBL_MANT_DIG = 10e15
          2 ** LDBL_MANT_DIG = 10e19

  and thus the precision for each of the real types can be
  approximated:

      #define FLT_DIG              6
      #define DBL_DIG             15
      #define LDBL_DIG            19

  Exponent Range:

      #define FLT_MAX_10_EXP        +38
      #define FLT_MIN_10_EXP        -37
      #define DBL_MAX_10_EXP       +308
      #define DBL_MIN_10_EXP       -307
      #define LDBL_MAX_10_EXP     +4932
      #define LDBL_MIN_10_EXP     -4931

  8087 Status Word format
  -----------------------

      Error conditions or exceptions sometimes arise during the
  execution of floating-point operations.  The most common one













  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  6/10

    TITLE  :  An Overview of Floating Point Numbers.




  being Division by Zero.  However other exceptions are possible.
  These conditions are held in the 8087 Status Word, and the
  Borland C/C++ compilers define the following exception
  conditions:

  #define SW_INVALID      0x0001  /* Invalid operation        */

      This exception occurs when no other recovery action is
  possible, and is the most serious error.  If an invalid operation
  exception occurs within an operation, the operation returns a
  NaN, which stands for Not a Number.  A NaN is returned if the
  exponent of a number contains all 1's and the mantissa has
  anything other than 0's in it.

  #define SW_DENORMAL     0x0002  /* Denormalized operand     */

      A denormalized operand exception occurs when precision is
  sacrificed in order to increase range.  The 8087 tries to prevent
  the inferior precision of denormals from corrupting the precision
  of the rest of the computation by providing a warning.  A
  denormal usually occurs as a result of masking a particular
  exception out of the status word.

  #define SW_ZERODIVIDE   0x0004  /* Zero divide              */

      This exception occurs whenever an attempt to divide by either
  +0 or -0 is made.

  #define SW_OVERFLOW     0x0008  /* Overflow                 */

      This exception occurs whenever an attempt is made to
  represent a number which is too big to represent in IEEE format.

  #define SW_UNDERFLOW    0x0010  /* Underflow                */

      This exception occurs whenever an attempt is made to
  represent a number which is too small to represent in the IEEE
  format.

  #define SW_INEXACT      0x0020  /* Precision (Inexact result)*/
















  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  7/10

    TITLE  :  An Overview of Floating Point Numbers.




      This exception occurs a number cannot be exactly represented,
  and thus will be approximated.  If this exception is masked, then
  the rounding control of the Control Word will be used.

  8087 Control Word
  -----------------

      The Control Word controls the actions taken when an exception
  is generated.  If the mask for a particular field is 0, the
  program will suspend operation.  If the mask is 1, the
  corresponding exception is masked and exception values are
  produced.  The Borland C/C++ compilers mask all the bits in the
  control word by masking it with a 0x3F.

      #define MCW_EM        0x003f  /* interrupt Exception Masks*/

  When an exception is encountered, appropriate corrective action
  is taken to fix-up the exception if the exception is masked.

      #define     EM_INVALID      0x0001  /*   invalid        */
      #define     EM_DENORMAL     0x0002  /*   denormal       */
      #define     EM_ZERODIVIDE   0x0004  /*   zero divide    */
      #define     EM_OVERFLOW     0x0008  /*   overflow       */
      #define     EM_UNDERFLOW    0x0010  /*   underflow      */
      #define     EM_INEXACT      0x0020  /*   inexact(precision)*/

  The Control Word contains three additional bit fields for
  correcting exceptions:

      #define MCW_IC              0x1000  /* Infinity Control */
      #define     IC_PROJECTIVE   0x0000  /*   projective     */
      #define     IC_AFFINE       0x1000  /*   affine         */

      The first is the infinity contol bit.  If the bit is 0, a
  projective infinity occurs when conditions occur such as division
  by zero.  This projective form is the default, and the most
  conservative.  If the bit is 1, the affine mode is more liberal,
  but the programmer should analyze how introducing infinity into a
  calculation could affect the program.

      #define MCW_RC          0x0c00  /* Rounding Control     */
      #define     RC_CHOP     0x0c00  /*   chop               */
      #define     RC_UP       0x0800  /*   up                 */













  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  8/10

    TITLE  :  An Overview of Floating Point Numbers.




      #define     RC_DOWN     0x0400  /*   down               */
      #define     RC_NEAR     0x0000  /*   near               */

      The next 2-bit field is the rounding control field.  The
  rounding of the inexact result will occur according these bit
  fields, and is typically masked to RC_NEAR.

      #define MCW_PC          0x0300  /* Precision Control    */
      #define     PC_24       0x0000  /*    24 bits           */
      #define     PC_53       0x0200  /*    53 bits           */
      #define     PC_64       0x0300  /*    64 bits           */

      The purpose of this bit field is to cause the 8087 to round
  all numbers to something less than extend precision before
  placing them in numeric registers.  However, the C language
  requires that all intermediate results be stored as long doubles,
  so for C/C++ the precision control is set to PC_64.  This field
  is available for compatibility with programs coming from other
  operating systems.

  Round-Off Problems
  ==================

      One of the problems with floating point numbers is round-off.
  Round off errors occur when attempting to represent certain
  numbers in any number base.  For example, 1/3 is not exactly
  representable in base ten, while 1/10th is easily representable.
  But since we're dealing with computers, we are specifically in
  base two numbers.  As opposed to base ten, 1/10th is not exactly
  representable in base two.  For example, the fractional portions
  of base two are:

  1/2    1/4    1/8    1/16   1/32   1/64   1/128   1/256   1/512

  The numbers 1/2, 1/4, 1/8, all powers of two, are exactly
  representable in a computer.  But since 1/10 lies between 1/8 and
  1/16, it is not exactly representable using binary notation.  So
  internally the computer has to decide which fractional binary
  portions to add together to sum close to 1/10.  For example:

  1/2    1/4    1/8    1/16    1/32   1/64   1/128   1/256   1/512
   0      0      0      1       1      1      0       0       0














  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                        PAGE  :  9/10

    TITLE  :  An Overview of Floating Point Numbers.




  this adds up to:

      0.1093 which is close to 0.1000 but could easily be rounded
  to 1.1 so the computer internal algorithm must try to find
  another combination of binary fractions which come closer to
  0.1000  When it's internal algorithm is satisfied, it will have a
  number which is CLOSE to 1/10th but not EXACT.  This inexactness
  is known as ROUND-OFF error.

  Floating Point Round Off Error
  ------------------------------

      Round off error is especially noticable in the smallest
  floating point data type available:  the float.  The float data
  type is four bytes in length, and uses these bytes to hold the
  mantissa, exponent, and sign of the number.  The following
  program demonstrates that round off error with floating point
  number's occur even with simple assignments:

      #include <iostream.h>
      #include <bcd.h>

      void main()
      {
          float number = 123.45;
          cout << number << endl;
      }

  The round off error can be significant when doing multiple or
  iterative calculations, as the following program illustrates:

      #include <iostream.h>

      void main()
      {
          float anumber = 1.693 / 10.0;
          float original = 1000000.00;
          int i, j;

          for (i=0; i<10; i++)
          {
              original = original * anumber;
          }













  PRODUCT  :  Borland C++                           NUMBER  :  1027
  VERSION  :  3.x
       OS  :  DOS
     DATE  :  October 19, 1993                       PAGE  :  10/10

    TITLE  :  An Overview of Floating Point Numbers.




          for (j=0; j<10; j++)
          {
              original = original / anumber;
          }
          cout << original << endl;
      }

  At the end of ten multiplications and divisions, the original
  number is off by a 0.1875.  Increasing the size from a 4 byte to
  8 byte real improves things somewhat, as the following code
  illustrates:

      #include <iostream.h>

      void main()
      {
          double anumber = 1.693 / 10.0;
          double original = 1000000.00;
          int i, j;

          for (i=0; i<10; i++)
          {
              original = original * anumber;
          }

          for (j=0; j<10; j++)
          {
              original = original / anumber;
          }
          cout << original << endl;
      }

  The difference between the original and calculated original are
  only off by 0.0625 with doubles.