Thursday, July 28, 2011

Floating Point Numbers

What is a Floating Point Number: There are two ways a decimal/fractional number can be represented in binary form - fixed point and floating point. In a fixed point number, the decimal point is in a fixed arbitrary position which is predefined. For example, in a 32 bit fixed point number, the decimal point may be assumed to be after the 16th bit. So any value represented by the 32 bits needs to be divided by 216 = 65536, to get the value of the decimal number represented by a fixed point.

A floating point number on the other hand has a completely different representation. A floating point number compromises on precision for achieving the following.
  1. Broader range of numbers that it can represent
  2. The precision of the number is fixed, irrespective of its value
  3. Represent special values like infinity and not-a-number
Bit Patterns of a Floating Point Number: As per IEEE 754, the floating point numbers are of two types - 32 bit single precision and 64 bit double precision.

Mantissa and Exponent: Floating point numbers are written in scientific notation. What I mean to say is, let us suppose, we need to represent the number 10.5. In binary (as in mathematics and not in computers), this value is 1010.1 . Now in scientific notation, this will be (1.0101 X 1011)bin. Note that 10 is actually 2 in decimal and 11 is 3.

Now 1.0101 is the mantissa and 11 is the exponent.

The above number could also be represented as (10.101 X 1010)bin. But, in normalized form, only the most significant digit comes before the radix. In binary, there is only one possible value for the most significant digit, i.e. 1. Hence, it is not required that we store it. In IEEE floating point number bit patterns, only the part of mantissa that is after the radix is stored. This part is called the fraction.

The exponent needs to be both positive and negative. To cater that, exponent is given a bias. If the exponent has n bits, this bias is 2n-1 - 1. So, a value of 2n-1 - 1 in the exponent actually represents that the exponent is zero. The following table shows the bit pattern in IEEE floating point numbers.

Sign BitExponentFractionBias
Single Precision1823127
Double Precision111521023

The sign bit represents the sign of the numbers. A value of 0 means positive, a value of 1 is negative. Note that unlike integer values, a 2's compliment is not used.

Overflow and Underflow: Overflow happens when the value that needs to be represented is so large that it cannot be accommodated in the given datatype. In case of floating point numbers, it means a number that needs an exponent too large. Now if the sign bit is 0, its a positive overflow, and if the sign bit is 1, its a negative overflow.

Underflow is a little trickier. It happens when the number to be represented is so small ( or rather its absolute value is so small) that it cannot be represented by the datatype. In floating point numbers, it happens when the exponent overflows in the negative direction. Again, underflow can be positive or negative just like overflow.

Special Values: Floating point numbers support a number of special values. I will talk about them one by one.

Zero: Yes, that's right! zero is a special value. Remember that in the fraction, the digit before the radix is not stored. It is assumed to be always 1. But, that way, zero cannot be represented. Thus zero is represented by special bit patters. Zero is represented by setting all the bits in the fraction as well as the exponent to 0. Hence, the value of the sign bit creates two distinct zeros - positive zero and negative zero. However, these two zeros are equal.

Denormalized: A number is denormalized, it all its exponent bits are zero. In this case the digit before the radix is assumed to be zero. This also means that zero is a special case of denormalized numbers.

Infinity: Infinity is represented by setting all bits in the exponent to 1 and those of mantissa to 0. Depending on the sign bit, infinity happens to be either positive or negative.

Not a Number: Not-a-number means a value that is not defined. For example, if we divide 0 by 0, the result will be not-a-number. If the exponent bits are all 1, and at least one fraction bit 1. NaNs (Not-a-Number)s are of two types, QNaN (Quiet NaN) and SNaN (Signaling NaN). A QNaN passes through arithmetic operations (resulting in more QNaNs), but SNaNs generates error if any operation is done with them. If the most significant bit in the mantissa is set, its a QNaN, otherwise its an SNaN.


Orion Caspar said...

Nice article! truly helpful. but it needs some literal editing. I mean, it should be written in such a way, so that, a layman should grow interest in this domain.

Post a Comment