Suppose you have bought two things: one costed 7.19 euros and the other 1.18 euros. So the total price is 8.37 euros, is not it?

In Java:

double a = 7.19;

double b = 1.18;

double total = a + b;

System.out.println("total = "+ total);

In Oberon (Component Pascal, BlackBox):

a, b, total: REAL;

a := 7.19; b:= 1.18;

total := a + b;

StdLog.String("total = "); StdLog.Real(total); StdLog.Ln;

The answer in Java and Oberon is the same:

8.370000000000001

This arithmetic anomaly is disturbing. How can it be? Even the simple calculator of my mobile telephone knows how to compute this sum correctly !

Of course this problem arises because there are finite decimal numbers than can only be represented in binary by an infinite sequence of bits. For instance, if you convert the decimal number 2.35 into binary, you obtain:

10.0101100110011001100110... (the 4 bits 0110 repeat forever)

But you can not store infinite bits in the computer. If you take only 53 bits then the decimal number actually represented is (using by hand the Windows calculator):

2.3499999999999996447286321199499

but not 2.35

In the article "Where's your point? Tricks and traps with floating point and decimal numbers" (

http://www.ibm.com/developerworks/java/library/j-jtp0114/) by Brian Goetz, I have read the following paragraph:

Don't use floating point numbers for exact values.

Some non-integral values, like dollars-and-cents decimals, require exactness.

Floating point numbers are not exact, and manipulating them will result in rounding errors.

As a result, it is a bad idea to use floating point to try to represent exact quantities like monetary amounts.

Using floating point for dollars-and-cents calculations is a recipe for disaster.

Floating point numbers are best reserved for values such as measurements, whose values are fundamentally inexact to begin with.

Brian Goetz advise to use the class BigDecimal in Java to treat with numbers that represent money. Note that this class uses decimal --and not binary-- arithmetic.

I wonder if it not would be helpful to have in Oberon a new primitive data type called DECIMAL(p,s) similar to the data type offered by some relational databases. "p" stands for precision, and "s" for scale. For instance DECIMAL(10,2) means "a signed number with 10 digits, two of them are fractionals. Therefore its values range from -99999999.99 to +99999999.99". And all the operations with variables of this type, will be performed in decimal fixed-point, rather than binary floating-point. Since hardware typically only provide binary floating-point support, the decimal arithmetic would have to be implemented by the compiler.

If the idea of a new primitive data type does not seems good, you can think about an utility user data type similar to the BigDecimal class in Java. Perhaps Oberon already has this type, I do not know.

----------

Changing the topic, now I leave the convenience of decimal fixed-point arithmetic to treat with variables holding money, and I focus in explicit binary floating-point arithmetic.

As Brian Goetz says, floating-point arithmetic is not exact due to limited precision and roundoffs, but it is okey for variables holding values from a sensor, because those values are approximations and they have never been originally in decimal like the money are.

If you have a decimal constant in your program, do not expect that that value will be exactly converted into binary.

Taking an example from "Comparing floating point numbers" (

http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm) by Bruce Dawson, these are some consecutives floating-point values using 32 bits:

...

1.99999976

1.99999988

2.00000000

2.00000024

2.00000048

...

Between 2.00000000 and 2.00000024 there are a lot (infinite) real numbers, but each of them happen to be mapped into only one of two floating-point numbers: either 2.00000000 or 2.00000024.

2.00000000 2.00000024

...------+----------------+------...

<-------||------->

One thing are the real numbers (they are an ideal; they are infinite) and other thing are the floating-point numbers (they are concrete; under a given representation --32 or 64 bits-- there exists a limited number of them).

With 32 bits it is easy to imagine that the infinite real numbers between 2.00000012 and 2.00000023 will be repaced by the floating-point number 2.00000024, and real numbers between 2.00000001 and 2.00000011 will be repaced by the floating-point number 2.00000000.

So "2.00000014 = 2.00000023" is true because both of them are replaced by 2.00000024.

But "2.00000011 = 2.00000013" is false because it is like evaluating "2.00000000 = 2.00000024".

Note the inconsistency: 2.00000011 and 2.00000013 are considered different, but actually they are "more equal" than 2.00000014 and 2.00000023 are.

If you have in your program the constant 2.00000015 (32 bits) then the true number that the computer will be working with will be 2.00000024. Putting it simply, the computer is not able to know about 2.00000015; that number is beyond its representation capability (using 32 bits) and instead it has to use 2.00000024.

And even this may vary: it is not the same to have the constant permanently stored in a fixed memory cell --something which happens when you declare a constant--, than to use a bare literal constant (not previously declared) inside an expression. In the latter case, the value of the volatile constant will sprout for the first time in a CPU register, which tipically has more bits of precision, and so the value will be closer to 2.00000015 than 2.00000024 is.

Similarly happen with 64 bits, but the floating-point numbers are more precise.

These seem to be the different floating-point numbers under 64 bits:

...

2.0000000000000000

2.0000000000000004

2.0000000000000010

2.0000000000000013

2.0000000000000018

2.0000000000000020

2.0000000000000027

2.0000000000000030

...

And because of that, the real numbers 2.0000000000000012 and 2.0000000000000015 are mapped into the same floating-point number 2.0000000000000013, and they look to be equal !! (you can try it)

As you can see, floating-point representation is full of surprises:

- Sometimes you find that two expressions known to be equal are different (7.19 + 1.18 # 8.37)
- Other times you find the opposite (2.0000000000000012 = 2.0000000000000015)

Bruce Dawson says that you can not use happily the equal comparison operator with floating-point numbers. For example, you can compute "(a+b)+c" on one hand, and "a+(b+c)" on the other. Although the result will have to be the same, due to roundoffs it easily will not be.

What is needed here is a relaxed replacement operator for equality, which takes into account the discrete (not continuous) distribution of the floating-point numbers over the real axis. Instead of asking "are these numbers equal?" you should ask "are they roughly equal?"

Traditionally people have used an epsilon (very small) value: if the difference between the two numbers under comparison is less than a given epsilon, then they are considered as being equal.

But Bruce Dawson proposes a different strategy: Take the 32 (o 64) bits that represent a number in the IEEE-754 format, and consider those bits as an integer number. Then simply subtract the integer representation of the two floating-point numbers. Because the IEEE-754 standard has been designed intentionally so, the difference gives to you how many representable floating-point values there are between them (plus one). This difference is called ULPS (UnitS in the Last Place), and is a more convenient value than the usual epsilon to measure the proximity of two floating-point numbers.

Caution: When subtracting the integer values, special care have to be taken with the bit patterns representing special numbers (subnormals, zeroes, infinites, NANs). Details are in the article by Bruce Dawson.

For instance, with 32 bits the integer value derived for the bits that represent the floating-point number 2.00000000 is 1073741824. The greater consecutive representable floating-point numbers are: 2.00000024, 2.00000048,... which have respectively the integer values 1073741825, 1073741826,... (Later in the article by Bruce Dawson, you can see that it is more convenient to take the twos-complement integer value, rather than the simple signed magnitude)

Bruce Dawson suggest a function named ALMOST_EQUAL(a, b, n) defined as follows:

Being "a" and "b" two floating-point numbers of the same precision in the IEEE-754 format, and being "n" an integer value greater or equal than zero, then, ALMOST_EQUAL(a, b, n) returns true if and only if, the absolute value of the difference between the integers numbers derived for the interpretation of the bits stored in "a" and "b" as integer values, is less or equal than "n".

Another way to define the function is that between "a" and "b" there are no more than "n-1" representable floating-point numbers, excluding "a" and "b"; if "n" is zero then "a" and "b" have to be exactly the same number in order to the function to return true.

Therefore, when you specify n=1, you want to test if "a" is exactly equal to "b" or if "a" and "b" are two consecutive representable values; the order between "a" and "b" does not matter.

"n" is the looseness or laxity of the comparison, that you are willing to accept in order to consider as equals two numbers. As "n" increase, the less strict you are.

As an aside, this laxed equal operator is commutative, but not transitive: "ALMOST_EQUAL(x, y, n) AND ALMOST_EQUAL(y, z, n)" does not imply ALMOST_EQUAL(x, z, n); but it implies ALMOST_EQUAL(x, z, (n+n)).

Example: The 32 bits floating-point numbers 2.00000000 and 2.00000048 are separated by only one another floating-point number: 2.00000024. So the difference between their integer values are 2. Therefore:

ALMOST_EQUAL(2.00000000, 2.00000048, 0) is false

ALMOST_EQUAL(2.00000000, 2.00000048, 1) is false

ALMOST_EQUAL(2.00000000, 2.00000048, 2) is true

ALMOST_EQUAL(2.00000000, 2.00000048, 3) is true

Instead of ALMOST_EQUAL, I find easier to understand a function named DISTANCE such that:

DISTANCE(a, b) <= n <==> ALMOST_EQUAL(a, b, n) = true

This function DISTANCE(a,b) may be provided by the compiler or by an utility module, and may be useful sometimes when you want to test if two floating-point numbers are "the same", meaning this that the two numbers are close "enough" to each other. When you use this function, you are trying to compensate the roundoffs of the floating-point arithmetic.

Summarizing:

1.- If you are dealing with exact monetary quantities (what of them are not?), it would be desirable to have the data type DECIMAL(p,s) with decimal fixed-point arithmetic.

2.- If the quantities you are working with, come from inherent inexact sources such as sensors, then, fine, work with floating-point numbers. But be very aware of the territory you are walking over. Floating-point representation is "as is". In particular, do not expect that a fractional decimal number (a constant in your program, or a value from the user input or from the database) will be exactly translated into binary; most of the time it will not; instead it will be replaced by the nearest floating-point number available under the representation system used by the machine. In order to compare for equality two floating-point numbers, somebody could find useful a function named DISTANCE(a, b).