root/trunk/docsrc/float.dd

Revision 1278, 7.3 kB (checked in by walter, 3 years ago)

Don's suggestion for restricting rounding modes

  • Property svn:eol-style set to native
Line 
1 Ddoc
2
3 $(SPEC_S Floating Point,
4
5 <h3>Floating Point Intermediate Values</h3>
6
7     $(P On many computers, greater
8     precision operations do not take any longer than lesser
9     precision operations, so it makes numerical sense to use
10     the greatest precision available for internal temporaries.
11     The philosophy is not to dumb down the language to the lowest
12     common hardware denominator, but to enable the exploitation
13     of the best capabilities of target hardware.
14     )
15
16     $(P For floating point operations and expression intermediate values,
17     a greater precision can be used than the type of the
18     expression.
19     Only the minimum precision is set by the types of the
20     operands, not the maximum. $(B Implementation Note:) On Intel
21     x86 machines, for example,
22     it is expected (but not required) that the intermediate
23     calculations be done to the full 80 bits of precision
24     implemented by the hardware.
25     )
26
27     $(P It's possible that, due to greater use of temporaries and
28     common subexpressions, optimized code may produce a more
29     accurate answer than unoptimized code.
30     )
31
32     $(P Algorithms should be written to work based on the minimum
33     precision of the calculation. They should not degrade or
34     fail if the actual precision is greater. Float or double types,
35     as opposed to the real (extended) type, should only be used for:
36     )
37
38     $(UL
39         $(LI reducing memory consumption for large arrays)
40         $(LI when speed is more important than accuracy)
41         $(LI data and function argument compatibility with C)
42     )
43
44 <h3>Floating Point Constant Folding</h3>
45
46     $(P Regardless of the type of the operands, floating point
47     constant folding is done in $(B real) or greater precision.
48     It is always done following IEEE 754 rules and round-to-nearest
49     is used.)
50
51     $(P Floating point constants are internally represented in
52     the implementation in at least $(B real) precision, regardless
53     of the constant's type. The extra precision is available for
54     constant folding. Committing to the precision of the result is
55     done as late as possible in the compilation process. For example:)
56
57 ---
58 const float f = 0.2f;
59 writefln(f - 0.2);
60 ---
61     $(P will print 0. A non-const static variable's value cannot be
62     propagated at compile time, so:)
63
64 ---
65 static float f = 0.2f;
66 writefln(f - 0.2);
67 ---
68     $(P will print 2.98023e-09. Hex floating point constants can also
69     be used when specific floating point bit patterns are needed that
70     are unaffected by rounding. To find the hex value of 0.2f:)
71
72 ---
73 import std.stdio;
74
75 void main()
76 {
77     writefln("%a", 0.2f);
78 }
79 ---
80     $(P which is 0x1.99999ap-3. Using the hex constant:)
81
82 ---
83 const float f = 0x1.99999ap-3f;
84 writefln(f - 0.2);
85 ---
86
87     $(P prints 2.98023e-09.)
88
89     $(P Different compiler settings, optimization settings,
90     and inlining settings can affect opportunities for constant
91     folding, therefore the results of floating point calculations may differ
92     depending on those settings.)
93
94 <h3>Complex and Imaginary types</h3>
95
96     $(P In existing languages, there is an astonishing amount of effort expended in trying to jam a
97     complex type onto existing type definition facilities: templates, structs, operator
98     overloading, etc., and it all usually ultimately fails. It fails because the semantics of
99     complex operations can be subtle, and it fails because the compiler doesn't know what the
100     programmer is trying to do, and so cannot optimize the semantic implementation.
101     )
102
103     $(P This is all done to avoid adding a new type. Adding a new type means that the compiler
104     can make all the semantics of complex work "right". The programmer then can rely on a
105     correct (or at least fixable <g>) implementation of complex.
106     )
107
108     $(P Coming with the baggage of a complex type is the need for an imaginary type. An
109     imaginary type eliminates some subtle semantic issues, and improves performance by not
110     having to perform extra operations on the implied 0 real part.
111     )
112
113     $(P Imaginary literals have an i suffix:
114     )
115
116 ------
117 ireal j = 1.3i;
118 ------
119
120     $(P There is no particular complex literal syntax, just add a real and
121     imaginary type:
122     )
123
124 ------
125 cdouble cd = 3.6 + 4i;
126 creal c = 4.5 + 2i;
127 ------
128
129     $(P Complex, real and imaginary numbers have two properties:
130     )
131
132 <pre>
133 .re get real part (0 for imaginary numbers)
134 .im get imaginary part as a real (0 for real numbers)
135 </pre>
136
137     $(P For example:
138     )
139
140 <pre>
141 cd.re       is 4.5 double
142 cd.im       is 2 double
143 c.re        is 4.5 real
144 c.im        is 2 real
145 j.im        is 1.3 real
146 j.re        is 0 real
147 </pre>
148
149 <h3>Rounding Control</h3>
150
151     $(P IEEE 754 floating point arithmetic includes the ability to set 4
152     different rounding modes.
153     These are accessible via the functions in std.c.fenv.
154     )
155
156 $(V2
157     $(P If the floating-point rounding mode is changed within a function,
158     it must be restored before the function exits. If this rule is violated
159     (for example, by the use of inline asm), the rounding mode used for
160     subsequent calculations is undefined.
161     )
162 )
163
164 <h3>Exception Flags</h3>
165
166     $(P IEEE 754 floating point arithmetic can set several flags based on what
167     happened with a
168     computation:)
169
170     $(TABLE
171     $(TR $(TD FE_INVALID))
172     $(TR $(TD FE_DENORMAL))
173     $(TR $(TD FE_DIVBYZERO))
174     $(TR $(TD FE_OVERFLOW))
175     $(TR $(TD FE_UNDERFLOW))
176     $(TR $(TD FE_INEXACT))
177     )
178
179     $(P These flags can be set/reset via the functions in
180     $(LINK2 phobos/std_c_fenv.html, std.c.fenv).)
181
182 <h3>Floating Point Comparisons</h3>
183
184     $(P In addition to the usual &lt; &lt;= &gt; &gt;= == != comparison
185     operators, D adds more that are
186     specific to floating point. These are
187     !&lt;&gt;=
188     &lt;&gt;
189     &lt;&gt;=
190     !&lt;=
191     !&lt;
192     !&gt;=
193     !&gt;
194     !&lt;&gt;
195     and match the semantics for the
196     NCEG extensions to C.
197     See $(LINK2 expression.html#floating_point_comparisons, Floating point comparisons).
198     )
199
200 <h3><a name="floating-point-transformations">Floating Point Transformations</a></h2>
201
202     $(P An implementation may perform transformations on
203     floating point computations in order to reduce their strength,
204     i.e. their runtime computation time.
205     Because floating point math does not precisely follow mathematical
206     rules, some transformations are not valid, even though some
207     other programming languages still allow them.
208     )
209
210     $(P The following transformations of floating point expressions
211     are not allowed because under IEEE rules they could produce
212     different results.
213     )
214
215     $(TABLE1
216     <caption>Disallowed Floating Point Transformations</caption>
217     $(TR
218     $(TH transformation) $(TH comments)
219     )
220     $(TR
221     $(TD $(I x) + 0 &rarr; $(I x)) $(TD not valid if $(I x) is -0)
222     )
223     $(TR
224     $(TD $(I x) - 0 &rarr; $(I x)) $(TD not valid if $(I x) is &plusmn;0 and rounding is towards -&infin;)
225     )
226     $(TR
227     $(TD -$(I x) &harr; 0 - $(I x)) $(TD not valid if $(I x) is +0)
228     )
229     $(TR
230     $(TD $(I x) - $(I x) &rarr; 0) $(TD not valid if $(I x) is NaN or &plusmn;&infin;)
231     )
232     $(TR
233     $(TD $(I x) - $(I y) &harr; -($(I y) - $(I x))) $(TD not valid because (1-1=+0) whereas -(1-1)=-0)
234     )
235     $(TR
236     $(TD $(I x) * 0 &rarr; 0) $(TD not valid if $(I x) is NaN or &plusmn;&infin;)
237     )
238 $(COMMENT
239     $(TR
240     $(TD $(I x) * 1 &rarr; $(I x)) $(TD not valid if $(I x) is a signaling NaN)
241     )
242 )
243     $(TR
244     $(TD $(I x) / $(I c) &harr; $(I x) * (1/$(I c))) $(TD valid if (1/$(I c)) yields an e$(I x)act result)
245     )
246     $(TR
247     $(TD $(I x) != $(I x) &rarr; false) $(TD not valid if $(I x) is a NaN)
248     )
249     $(TR
250     $(TD $(I x) == $(I x) &rarr; true) $(TD not valid if $(I x) is a NaN)
251     )
252     $(TR
253     $(TD $(I x) !$(I op) $(I y) &harr; !($(I x) $(I op) $(I y))) $(TD not valid if $(I x) or $(I y) is a NaN)
254     )
255     )
256
257     $(P Of course, transformations that would alter side effects are also
258     invalid.)
259
260 )
261
262 Macros:
263     TITLE=Floating Point
264     WIKI=Float
Note: See TracBrowser for help on using the browser.