root/trunk/docsrc/d-floating-point.dd

Revision 2040, 29.8 kB (checked in by walter, 1 year ago)

typography

Line 
1 Ddoc
2
3 $(D_S $(TITLE),
4
5 $(H2 Introduction)
6
7 $(P $(I by Don Clugston))
8
9 $(P Computers were originally conceived as devices for performing mathematics. The earliest computers spent most of their time solving equations. Although the engineering and scientific community now forms only a miniscule part of the computing world, there is a fantastic legacy from those former times: almost all computers now feature superb hardware for performing mathematical calculations accurately and extremely quickly. Sadly, most programming languages make it difficult for programmers to take full advantage of this hardware. An even bigger problem is the lack of documentation; even for many mathematical programmers, aspects of floating-point arithmetic remain shrouded in mystery.
10 )
11
12 $(P As a systems programming language, the D programming language attempts to remove all barriers between the programmer and the compiler, and between the programmer and the machine. This philosophy is particularly evident in support for floating-point arithmetic. A personal anecdote may illustrate the importance of having an accurate understanding of the hardware.
13 )
14
15 $(P My first floating-point nightmare occurred in a C++ program which hung once in every hundred runs or so. I eventually traced the problem to a while loop which occasionally failed to terminate. The essence of the code is shown in Listing 1.
16 )
17
18 ------
19 double q[8];
20 ...
21 int x = 0;
22 while (x < 8) {
23   if ( q[x] >= 0 ) return true;
24   if ( q[x] < 0 ) ++x;
25 }
26 return false;
27 ------
28
29 $(P Initially, I was completely baffled as to how this harmless-looking loop could fail. But eventually, I discovered that q had not been initialized properly; q[7] contained random garbage. Occasionally, that garbage had every bit set, which mean that q[7] was a Not-a-Number (NaN), a special code which indicates that the value of the variable is nonsense. NaNs were not mentioned in the compiler's documentation - the only information I could find about them was in Intel's assembly instruction set documentation! Any comparison involving a NaN is false, so q[7] was neither >= 0, nor < 0, killing my program. Until that unwelcome discovery, I'd been unaware that NaNs even existed. I had lived in a fool's paradise, mistakenly believing that every floating point number was either positive, negative, or zero.
30 )
31
32 $(P My experience would have been quite different in D. The "strange" features of floating point have a higher visibility in the language, improving the education of numerical programmers.
33 Uninitialized floating point numbers are initialized to NaN by the compiler, so the problematic loop would fail every time, not intermittently.
34 Numerical programmers in D will generally execute their programs with the 'invalid' floating point exception enabled. Under those circumstances, as soon as the program accessed the uninitialized variable, a hardware exception would occur, summoning the debugger.
35 Easy access to the "strange" features of floating point results in better educated programmers, reduced confusion, faster debugging, better testing, and hopefully, more reliable and correct numerical programs.
36 This article will provide a brief overview of the support for floating point in the D programming language.
37 )
38
39 $(H2 Demystifying Floating-Point)
40
41 $(P D guarantees that all built-in floating-point types conform to IEEE 754 arithmetic, making behaviour entirely predictable (note that this is $(I not) the same as producing identical results on all platforms). IEEE 754-2008 is the latest revision of the IEEE 754 Standard for Floating-Point Arithmetic. D is progressing towards full compliance with 754-2008.)
42
43 $(P The IEEE standard floating point types currently supported by D are $(D float) and $(D double). Additionally, D supports the $(D real) type, which is either 'IEEE 80-bit extended' if supported by the CPU; otherwise, it is the same as $(D double). In the future, the new types from 754-2008 will be added: $(D quadruple), $(D decimal64), and $(D decimal128).)
44
45 $(P The characteristics of these types are easily accessible in the language via $(I properties). For example, $(D float.max) is the maximum value which can be stored in a float; $(D float.mant_dig) is the number of digits (bits) stored in the mantissa.)
46
47 $(P To make sense of mathematics in D, it's necessary to have a basic understanding of IEEE floating-point arithmetic. Fundamentally, it is a mapping of the infinitely many real numbers onto a small number of bytes. Only 4000 million distinct numbers are representable as an IEEE 32-bit float. Even with such a pathetically small representation space, IEEE floating point arithmetic does a remarkably good job of maintaining the illusion that mathematical real numbers are being used; but it's important to understand when the illusion breaks down.)
48
49 $(P Most problems arise from the distribution of these representable numbers. The IEEE number line is quite different to the mathematical number line.)
50
51 ---
52
53      +     +-----------+------------+    ..   +    ..    +----------+----------+     +       #
54 -infinity -float.max  -1  -float.min_normal   0   float.min_normal  1  float.max infinity  NaN
55
56 ---
57
58 $(P Notice that half of the IEEE number line lies between -1 and +1. There are 1000 million representable floats between 0 and 0.5, but only 8 million between 0.5 and 1. This has important implications for accuracy: the effective precision is incredibly high near zero. Several examples will be presented where numbers in the range -1 to +1 are treated seperately to take advantage of this.)
59
60 $(P Notice also the special numbers: $(PLUSMNINF); the so-called "subnormals" between $(PLUSMN)float.min_normal and 0, which are represented at reduced precision; the fact that there are TWO zeros, +0 and -0, and finally "NaN"("Not-a-Number"), the nonsense value, which caused so much grief in Listing 1.)
61
62 $(P Why does NaN exist? It serves a valuable role: it $(I eradicates undefined behaviour) from floating-point arithmetic. This makes floating-point completely predictable. Unlike the $(D int) type, where 3/0 invokes a hardware division by zero trap handler, possibly ending your program, the floating-point division 3.0/0.0 results in $(INFIN). Numeric overflow (eg, $(D real.max*2)) also creates $(INFIN). Depending on the application, $(INFIN) may be a perfectly valid result; more typically, it indicates an error. Nonsensical operations, such as $(D 0.0 / 0.0), result in NaN; but $(I your program does not lose control). At first glance, infinity and NaN may appear unnecessary -- why not just make it an error, just as in the integer case? After all, it is easy to avoid division by zero, simply by checking for a zero denominator before every division. The real difficulty comes from overflow: it is extremely difficult to determine in advance whether an overflow will occur in a multiplication.)
63
64 $(P Subnormals are necessary to prevent certain anomalies, and preserve important relationships such as: "x - y == 0 if and only if x == y".)
65
66 $(P Since $(INFIN) can be produced by overflow, both +$(INFIN) and -$(INFIN) are required. Both +0 and -0 are required in order to preserve identities such as: if $(D x>0), then $(D 1/(1/x) > 0). In almost all other cases, however, there is no difference between +0 and -0.)
67
68 $(P It's worth noting that these $(SINGLEQUOTE special values) are usually not very efficient. On x86 machines, for example, a multiplication involving a NaN, an infinity, or a subnormal can be twenty to fifty times slower than an operation on normal numbers. If your numerical code is unexpectedly slow, it's possible that you are inadvertently creating many of these special values. Enabling floating-point exception trapping, described later, is a quick way to confirm this.)
69
70 $(P One of the biggest factor obscuring what the machine is doing is in the conversion between binary and decimal. You can eliminate this by using the $(D "%a") format when displaying results. This is an invaluable debugging tool, and an enormously helpful aid when developing floating-point algorithms. The $(D 0x1.23Ap+6) hexadecimal floating-point format can also be used in source code for ensuring that your input data is $(I exactly) what you intended.)
71
72 $(H2 The Quantized Nature of Floating-Point)
73
74 $(P The fact that the possible values are limited gives access to some operations which are not possible on mathematical real numbers. Given a number x,
75 $(D nextUp(x)) gives the next representable number which is greater than x.
76 $(D nextDown(x)) gives the next representable number which is less than x.
77 )
78
79 $(P Numerical analysts often describe errors in terms of "units in the last place"(ulp), a surprisingly subtle term which is often used rather carelessly. [footnote:
80 The most formal definition is found in [J.-M. Muller, "On the definition of ulp(x)",INRIA Technical Report 5504 (2005).]: If $(D x) is a real number that lies between two finite consecutive floating-point numbers a and b of type F, without being equal to one of them, then ulp(x)=abs(b-a); otherwise ulp(x) = $(D x*F.epsilon). Moreover, ulp(NaN) is NaN, and ulp($(PLUSMN)F.infinity) = $(PLUSMN)$(D F.max*F.epsilon).]
81 I prefer a far simpler definition: The difference in ulps between two numbers x and y is is the number of times which you need to call nextUp() or nextDown() to move from x to y. [Footnote: This will not be an integer if either x or y is a real number, rather than a floating point number.]
82 The D library function $(D feqrel(x, y)) gives the number of bits which are equal between x and y; it is an easy way to check for loss-of-precision problems.
83 )
84
85 $(P The quantized nature of floating point has some interesting consequences.)
86
87 $(UL
88 $(LI ANY mathematical range [a,b$(RPAREN), $(LPAREN)a,b], or (a,b) can be converted into a range
89 or the form [a,b]. (The converse does not apply: there is no (a,b)
90 equivalent to [-$(INFIN), $(INFIN)]).)
91 $(LI A naive binary chop doesn't work correctly. The fact that there are hundreds or thousands of times as many representable numbers between 0 and 1, as there are between 1 and 2, is problematic for divide-and-conquer algorithms. A naive binary chop would divide the interval [0 .. 2] into [0 .. 1] and [1 .. 2]. Unfortunately, this is not a true binary chop, because the interval [0 .. 1] contains more than 99% of the representable numbers from the original interval!)
92 )
93
94 $(H2 Condition number)
95
96 $(P Using nextUp, it's easy to approximately calculate the condition number.)
97
98 ---
99 real x = 0x1.1p13L;
100 real u = nextUp(x);
101
102 int bitslost = feqrel(x, u) - feqrel(exp(x), exp(u));
103 ---
104
105 $(P This shows that at these huge values of x, a one-bit error in x destroys 12 bits of accuracy in exp(x)!
106 The error has increased by roughly 6000 units in the last place. The condition number is thus 6000 at this value of x.
107 )
108
109 $(H2 The semantics of float, double, and real)
110
111 $(P For the x86 machines which dominate the market, floating point has traditionally been performed on a descendent of the 8087 math coprocessor. These "x87" floating point units were the first to implement IEEE754 arithmetic. The SSE2 instruction set is an alternative for x86-64 processors, but x87 remains the only portable option for floating point 32-bit x86 machines (no 32-bit AMD processors support SSE2).)
112
113 $(P The x87 is unusual compared to most other floating-point units. It _only_ supports 80-bit operands, henceforth termed "real80". All $(D double) and $(D float) operands are first converted to 80-bit, all arithmetic operations are performed at 80-bit precision, and the results are reduced to 64-bit or 32-bit precision if required. This means that the results can be significantly more accurate than on a machine which supports at most 64 bit operations. However, it also poses challenges for writing portable code.
114 (Footnote: The x87 allows you to reduce the mantissa length to be the same as '$(D double) or $(D float), but it retains the real80 exponent, which means different results are obtained with subnormal numbers. To precisely emulate $(D double) arithmetic slows down floating point code by an order of magnitude).
115 )
116
117 $(P Apart from the x87 family, the Motorola 68K (but not ColdFire) and Itanium processors are the only ones which support 80-bit floating point.)
118
119 $(P A similar issue relates to the FMA (fused multiply and accumulate) instruction, which is available on an increasing number of processors, including PowerPC, Itanium, Sparc, and Cell. On such processors, when evaluating expressions such as $(D x*y + z), the $(D x*y) is performed at twice the normal precison. Some calculations which would otherwise cause a total loss of precision, are instead calculated exactly.
120 The challenge for a high-level systems programming language is to create an abstraction which provides predictable behaviour on all platforms, but which nonetheless makes good use of the available hardware.
121 )
122
123 $(P D's approach to this situation arises from the following observations:)
124
125 $(OL
126 $(LI It is extremely costly performance-wise to ensure identical behaviour on all processors. In particular, it is crippling for the x87.)
127 $(LI Very many programs will only run on a particular processor. It would be unfortunate to deny the use of more accurate hardware, for the sake of portability which would never be required.)
128 $(LI The requirements for portability and for high precision are never required simultaneously. If $(D double) precision is inadequate, increasing the precision on only some processors doesn't help.)
129 $(LI The language should not be tied to particular features of specific processors. )
130 )
131
132 $(P A key design goal is: it should be possible to write code such that, regardless of the processor which is used, the accuracy is never worse than would be obtained on a system which only supports the $(D double) type.)
133
134 $(P (Footnote: $(D real) is close to $(SINGLEQUOTE indigenous) in the Borneo proposal for the Java programming language[Ref Borneo]).)
135
136 $(P Consider evaluating $(D x*y + z*w), where $(D x, y, z) and $(D w) are double.)
137
138 $(OL
139 $(LI double r1 = x * y + z * w;)
140 $(LI double a  = x * y; double r2 = a + z * w;)
141 $(LI real   b  = x * y; double r3 = b + z * w;)
142 )
143
144 $(P Note that during optimisation, (2) and (3) may be transformed into (1), but this is implementation-dependent.
145 Case (2) is particularly problematic, because it introduces an additional rounding.
146 )
147
148 $(P On a "simple" CPU, r1==r2==r3. We will call this value r0.
149 On PowerPC, r2==r3, but r1 may be more accurate than the others, since it enables use of FMA.
150 On x86, r1==r3, which may be more accurate than r0, though not as much as for the PowerPC case.
151 r2, however, may be LESS accurate than r0.
152 )
153
154 $(P By using $(D real) for intermediate values, we are guaranteed that our results are never worse than for a simple CPU which only supports $(D double).)
155
156 $(H2 Properties of the Built-in Types)
157
158 $(P The fundamental floating-point properties are $(D epsilon), $(D min_normal) and $(D max). The six integral properties are simply the log2 or log10 of these three.)
159
160 $(TABLE1
161 $(TR $(TH &nbsp;) $(TH float) $(TH double) $(TH real80) $(TH quadruple) $(TH decimal64) $(TH decimal128))
162 $(TR $(TD epsilon) $(TD 0x1p-23) $(TD 0x1p-52) $(TD 0x1p-63) $(TD 0x1p-112) $(TD 1e-16 (1p-54)) $(TD 1e-34 (1p-113)))
163 $(TR $(TD [min_normal) $(TD 0x1p-126) $(TD 0x1p-1022) $(TD 0x1p-16382) $(TD 0x1p-16382) $(TD 1e-383) $(TD 1e-6143))
164 $(TR $(TD ..max$(RPAREN)) $(TD 0x1p+128) $(TD 0x1p+1024) $(TD 0x1p+16384) $(TD 0x1p+16384) $(TD 1e+385) $(TD 1e+6145))
165 $(TR <td colspan=7>binary properties</td>)
166 $(TR $(TD mant_dig) $(TD 24) $(TD 53) $(TD 64) $(TD 113) $(TD 53) $(TD 112))
167 $(TR $(TD min_exp) $(TD -125) $(TD -1021) $(TD -16381) $(TD -16381) $(TD &nbsp;) $(TD &nbsp;))
168 $(TR $(TD max_exp) $(TD +128) $(TD +1024) $(TD +16384) $(TD +16384) $(TD &nbsp;) $(TD &nbsp;))
169 $(TR <td colspan=7>decimal properties</td>)
170 $(TR $(TD dig) $(TD 6) $(TD 15) $(TD 18) $(TD 33) $(TD 16) $(TD 34))
171 $(TR $(TD min_10_exp) $(TD -37) $(TD -307) $(TD -4932) $(TD -4932) $(TD -382) $(TD -6142))
172 $(TR $(TD max_10_exp) $(TD +38) $(TD +308) $(TD +4932) $(TD +4932) $(TD 385) $(TD +6145))
173 )
174
175 $(P When writing code which should adapt to different CPUs at compile time, use $(D static if) with the $(D mant_dig) property. For example, $(D static if (real.mant_dig==64)) is true if 80-bit reals are available.
176 For binary types, the $(D dig) property gives only the $(I minimum) number of valid decimal digits. To ensure that that every representable number has a unique decimal representation, two additional digits are required. Similarly, for decimal numbers, $(D mant_dig) is a lower bound on the number of valid binary digits.
177 )
178
179 $(H2 Useful relations for a floating point type $(D F), where $(D x) and $(D y) are of type $(D F))
180
181 $(UL
182 $(LI The smallest representable number is $(D F.min_normal * F.epsilon))
183 $(LI Any integer between 0 and $(D (1/F.epsilon)) can be stored in F without loss of precision.
184   $(D 1/F.epsilon) is always a exact power of the base.)
185 $(LI If a number $(D x) is subnormal, $(D x*(1/F.epsilon)) is normal, and
186   $(D exponent(x) = exponent(x*(1/F.epsilon)) - (mant_dig-1)).)
187 $(LI $(D x>0) if and only if $(D 1/(1/x) > 0); $(D x<0) if and only if $(D 1/(1/x) < 0).)
188 $(LI If $(D x-y==0), then $(D x==y  && isFinite(x) && isFinite(y)). Note that if $(D x==y==infinity), then $(D isNaN(x-y)).)
189 $(LI $(D F.max * F.min_normal = 4.0) for binary types, $(D 10.0) for decimal types.)
190 )
191
192 $(H3  Addition and subtraction)
193
194 $(UL 
195 $(LI Some loss of precision occurs with x$(PLUSMN)y if exponent(x)!=exponent(y). The number of digits of precision which are lost is abs(exponent(x)-exponent(y)).)
196 $(LI x$(PLUSMN)y has total loss of precision, if and only if
197    (1)  $(D abs(x * F.epsilon) > abs(y)), in which case x+y == x, x-y == x
198 or (2)  $(D abs(y * F.epsilon) > abs(x)), in which case x+y == y, x-y == -y)
199 $(LI Addition is commutative: $(D a + b == b + a).)
200 $(LI Subtraction is not quite commutative: $(D a - b == -(b - a)), but produce +0 and -0 if a==b.)
201 $(LI Addition is not associative at all.)
202 )
203
204 $(H3 Multiplication and division)
205    
206 $(UL 
207 $(LI Multiplication and division are $(I always) at risk of overflow or underflow.
208   For any $(D abs(x) > F.epsilon), there is at least one finite $(D y) such that $(D x/y) will overflow to $(INFIN).
209   For any $(D abs(x) < F.epsilon), there is at least one finite $(D y) such that $(D x/y) will underflow to zero.
210   For any $(D abs(x) > 1), there is at least one finite $(D y) such that $(D x*y) will overflow to $(INFIN).
211   For any $(D abs(x) < 1), there is at least one finite $(D y) such that $(D x*y) will underflow to zero.
212 )
213 $(LI $(D x*x) will overflow if $(D abs(x)>sqrt(F.max)), and underflow to zero if $(D abs(x) < sqrt(F.min_normal*F.epsilon))  )
214 $(LI Multiplication is commutative. $(D a * b == b * a)).
215 $(LI Multiplication is not associative in general: $(D a*(b*c) != (a*b)*c), because (1) there is a risk of overflow or underflow and (2) $(D b*c) may be an exact calculation, so that $(D a*(b*c)) contains only one round-off error, whereas $(D (a*b)*c) contains two. The roundoff errors may therefore accumulate at the rate of just under 1 ulp per multiplication.)
216 $(LI However, a limited form of associativity is possible if the type used for intermediate results is larger than any of the operands (which happens on x87 and Itanium machines). If $(D R) is the intermediate type, and $(D F) is the type being multiplied, up to $(D min(R.max_exp/F.max_exp, R.epsilon/F.epsilon)) values of type $(D F) can be multiplied together in any order without influencing the result. For example, if $(D R) is $(D double), multiplication of 8 floats $(D f1*f2*f3*f4*f5*f6*f7*f8) is completely associative. On x87, 130 floats can be safely multiplied together in any order, and 16 doubles can similarly be multiplied together safely.
217 Strict distributivity does not hold even under these circumstances, as it may destroy the sign of -0.)
218 $(LI The distributive law almost never holds. For example, $(D 4*x + 6*x != 10*x) if $(D x==nextDown(1.5)). $(D a*x + b*x == (a+b)*x) for all $(D x) only if the operations $(D a*x, b*x), and $(D (a+b)) are all exact operations, which is true only if $(D a) and $(D b) are exact powers of 2. Even then, if $(D a==-b) and $(D x==-0), then $(D a*x+b*x==0.0, (a+b)*x==-0.0).)
219 $(LI Performing a division by multiplication by the reciprocal returns a result which (in round-to-nearest mode) is at most 1.5 ulps from the correctly rounded result. For almost any denominator, the rounding is incorrect (>0.5ulps) for 27% of numerators. [Ref: N. Brisebarre, J-M Muller, and S.K. Raina, "Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance", IEEE Trans. on Computers, Vol 53, pp 1069-1072 (2004)].)
220 )
221
222
223 $(H3  Powers and logarithms)
224
225 $(UL 
226 $(LI $(D F.mant_dig = -log2(F.epsilon)) for binary types;)
227 $(LI  $(D F.dig = -log10(F.epsilon)) for decimal types.)
228 $(LI $(D F.max =  exp2(F.max_exp*(1-F.epsilon))) for binary types;)
229 $(LI $(D F.max = exp10(F.max_10_exp*(1-F.epsilon))) for decimal types.)
230 $(LI For any positive finite $(D x), $(D F.min_exp - F.mant_dig <= log2(x) < F.max_exp) for binary types,
231                              $(D F.min_10_exp - F.dig <= log10(x) < F.max_10_exp)  for decimal types)
232 $(LI $(D exp2(x) == 0) if $(D x < F.min_exp - F.mant_dig), $(D exp2(x) == infinity) if $(D x >= F.max_exp))
233 )
234
235 $(H2 NaN payloads)
236
237 $(P According to the IEEE 754 standard, a $(SINGLEQUOTE payload) can be stored in the mantissa of a NaN. This payload can contain information about how or why it was generated. Historically, almost no programming languages have ever made use of this potentially powerful feature. In D, this payload consists of a positive integer.)
238
239 $(UL 
240 $(LI $(D real NaN(ulong payload)) -- create a NaN with a "payload", where the payload is a $(D ulong).)
241 $(LI $(D ulong getNaNPayload(real x)) -- returns the integer payload. Note that if narrowing conversions have occured, the high-order bits may have changed.)
242 )
243
244 $(P $(I Never) store a pointer as an integer payload inside a NaN. The garbage collector will not be able to find it!)
245
246 $(H2 NCEG comparison operations)
247
248 $(P As well as the usual $(D <), $(D >), $(D <=), and $(D >=) comparison operators, D also supports the "NCEG" operators. Most of them are the direct negation of the ordinary operators. Additionally, $(D <>), $(D <>=), $(D !<>), and $(D !<>=) are provided. These 8 new operators are different from the normal operators only when a NaN is involved, so for the most part they are quite obscure. They are useful mainly in eliminating the possibility of NaN before beginning a calculation.
249 The most useful relationships are probably:
250 $(UL
251 $(LI $(D x  <>= y)   is the same as $(D !isNaN(x) && !isNaN(y)), (except that signalling NaNs may be triggered by <>=).)
252 $(LI $(D x !<>= y)   is the same as $(D isNaN(x) ||  isNaN(y)).)
253 )
254 If $(D y) is any compile-time constant (eg 0), these reduce to $(D !isNaN(x) and isNaN(x)).
255 Note that $(D x==x) is the same as $(D !isNaN(x)), and $(D x!=x) is the same as $(D isNaN(x)).
256 $(D abs(x) !< x.infinity)  is the same as $(D isNaN(x) || isInfinity(x))
257 The above relationships are useful primarily because they can be used in compile time functions.
258 Very few uses are known for the remaining NCEG operators.
259 )
260
261 $(H2 The IEEE Rounding Modes)
262
263 $(P The rounding mode is controlled within a scope. Rounding mode will be restored to its previous state at the end of that scope.
264 Four rounding modes can be set. The default mode, $(I Round to nearest), is the most statistically accurate, but the least intuitive. In the event of tie, the result is rounded to an even number.
265 )
266
267 $(TABLE1
268 $(TR $(TH Rounding mode) $(TH rndint(4.5)) $(TH rndint(5.5)) $(TH rndint(-4.5)) $(TH Notes))
269 $(TR $(TD Round to nearest) $(TD 4) $(TD 6) $(TD -4) $(TD Ties round to an even number))
270 $(TR $(TD Round down) $(TD 4) $(TD 5) $(TD -5) $(TD &nbsp;))
271 $(TR $(TD Round up) $(TD 5) $(TD 6) $(TD -4) $(TD &nbsp;))
272 $(TR $(TD Round to zero) $(TD 4) $(TD 5) $(TD -4) $(TD &nbsp;))
273 )
274
275 $(P There are very few reasons for changing the rounding mode.
276 The round-up and round-down modes were created specifically to allow fast implementations of interval arithmetic; they are crucial to certain libraries, but rarely used elsewhere.
277 The round-to-zero mode is used for casting floating-point numbers to integers. Since mode switching is slow, especially on Intel machines, it may be useful to switch to round-to-zero mode, in order to exactly duplicate the behaviour of $(D cast(int)) in an inner loop.
278 )
279
280 $(P The only other commonly cited reason for changing the rounding mode is as a simple check for numerical stability: if the calculation produces wildly different results when the rounding mode is changed, it's a clear sign that it is suffering from round-off errors. )
281
282 $(H2 The IEEE Exception Status Flags)
283
284 $(P All IEEE-compiliant processors include special status bits that indicate when "weird" things have happened that programs might want to know about. For example, $(D ieeeFlags.divideByZero) tells if any infinities have been created by dividing by zero. They are 'sticky' bits: once they have been set, they remain set until explicitly cleared. By only checking this once at the end of a calculation, it may be possible to avoid comparing thousands of comparisions that are almost never going to fail.)
285
286 $(P Here's a list of the weird things that can be detected:)
287
288 $(DL
289 $(DT invalid) $(DD This is set if any NaN's have been generated. This can happen with $(INFIN) - $(INFIN), $(INFIN) * 0, 0 * $(INFIN), 0/0, $(INFIN)/$(INFIN), $(INFIN)%$(INFIN), or $(D x%0), for any number $(D x). Several other operations, such as sqrt(-1), can also generate a NaN. The $(I invalid) condition is also set when a 'signalling NaN' is accessed, indicating use of an uninitialized variable. This almost always indicates a programming error.)
290
291 $(DT overflow) $(DD Set if $(INFIN) was generated by adding or multiplying two numbers that were so large that the sum was greater than $(D real.max). This almost always indicates that the result is incorrect; and corrective action needs to be taken.)
292
293 $(DT divisionByZero) $(DD Set if $(PLUSMNINF) was generated by dividing by zero. This usually indicates a programming error, but not always; some types of calculations return correct results even when division by zero occurs.
294 (For example, $(D 1/(1+ 1/x) == 0) if $(D x == 0)). Note that division by a tiny, almost-zero number also produces an infinite result, but sets the overflow flag rather than the divisionByZero flag.
295 )
296
297 $(DT underflow) $(DD This happens if two numbers are subtracted or divided and are so tiny that the result lost precision because it was subnormal. Extreme underflow produces a zero result. Underflow almost never creates problems, and can usually be ignored.)
298
299 $(DT inexact) $(DD This indicates that rounding has occurred. Almost all floating point operations set this flag! It was apparently included in the hardware to support some arcane tricks used in the pioneering days of numerical analysis. It can always be ignored.)
300 )
301
302 $(P Floating-point traps can be enabled for any of the categories listed above. When enabled, a hardware exception will be generated.
303 This can be an invaluable debugging aid.
304 A more advanced usage, not yet supported on any platform(!) is to provide a nested function to be used as a hardware exception handler. This is most useful for the overflow and underflow exceptions.
305 )
306
307 $(H2 Floating point and $(SINGLEQUOTE pure nothrow))
308
309 $(P Every floating point operation, even the most trivial, is affected by the floating-point rounding state, and writes to the sticky flags. The status flags and control state are thus 'hidden variables', potentially affecting every $(D pure) function; and if the floating point traps are enabled, any floating point operation can generate a hardware exception.
310 D provides a facility for the floating-point control mode and exception flags to be usable in limited circumstances even when $(D pure) and $(D nothrow) functions are called.
311 )
312
313 $(P [TODO: I've made two proposals, but I haven't persauded Walter yet!].)
314
315 $(H2 Conclusion)
316
317 $(P Although D is a general-purpose programming language and supports many high-level concepts, it gives direct and convenient access to almost all features of modern floating-point hardware. This makes it an excellent language for development of robust, high-performance numerical code. It is also a language which encourages a deep understanding of the machine, making it fertile ground for innovation and for developing new algorithms.)
318
319
320
321 $(H2 References and Further Reading)
322
323 $(OL
324 $(LI
325 $(LINK2 http://docs.sun.com/source/806-3568/ncg_goldberg.html,
326 "What Every Computer Scientist Should Know About Floating-Point Arithmetic")
327 )
328 $(LI
329 $(LINK2 http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html,
330 "An Interview with the Old Man of Floating-Point: Reminiscences elicited from William Kahan by Charles Severance")
331 )
332 $(LI
333 N. Brisebarre, J-M Muller, and S.K. Raina, "Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance", IEEE Trans. on Computers, Vol 53, pp 1069-1072 (2004).
334 )
335 $(LI
336 $(LINK2 http://www.sonic.net/~jddarcy/Borneo/,
337 "The Borneo language")
338 )
339 )
340
341 )
342
343 Macros:
344       TABLE = <table border=1 cellpadding=4 cellspacing=0>
345               $0</table>
346       CAPTION = <caption>$0</caption>
347       SVH = $(TR $(TH $1) $(TH $2))
348       SV  = $(TR $(TD $1) $(TD $2))
349       NAN = $(RED NAN)
350       SUP = <span style="vertical-align:super;font-size:smaller">$0</span>
351       POWER = $1<sup>$2</sup>
352       SUB = $1<sub>$2</sub>
353       BIGSUM = $(BIG &Sigma; <sup>$2</sup><sub>$(SMALL $1)</sub>)
354       CHOOSE = $(BIG &#40;) <sup>$(SMALL $1)</sup><sub>$(SMALL $2)</sub> $(BIG &#41;)
355       PLUSMN = &plusmn;
356       INFIN = &infin;
357       PLUSMNINF = &plusmn;&infin;
358       PI = &pi;
359       LT = &lt;
360       GT = &gt;
361       SQRT = &radix;
362       HALF = &frac12;
363       D = <font face=Courier><b>$0</b></font>
364       D = <span class="d_inlinecode">$0</span>
365     H2=<h2>$0</h2>
366     H3=<h3>$0</h3>
367     TITLE=Real Close to the Machine: Floating Point in D
368     WIKI=FloatingPointInD
Note: See TracBrowser for help on using the browser.