| 1 |
Ddoc |
|---|
| 2 |
|
|---|
| 3 |
$(SPEC_S Floating Point, |
|---|
| 4 |
|
|---|
| 5 |
<h3>Floating Point Intermediate Values</h3> |
|---|
| 6 |
|
|---|
| 7 |
$(P On many computers, greater |
|---|
| 8 |
precision operations do not take any longer than lesser |
|---|
| 9 |
precision operations, so it makes numerical sense to use |
|---|
| 10 |
the greatest precision available for internal temporaries. |
|---|
| 11 |
The philosophy is not to dumb down the language to the lowest |
|---|
| 12 |
common hardware denominator, but to enable the exploitation |
|---|
| 13 |
of the best capabilities of target hardware. |
|---|
| 14 |
) |
|---|
| 15 |
|
|---|
| 16 |
$(P For floating point operations and expression intermediate values, |
|---|
| 17 |
a greater precision can be used than the type of the |
|---|
| 18 |
expression. |
|---|
| 19 |
Only the minimum precision is set by the types of the |
|---|
| 20 |
operands, not the maximum. $(B Implementation Note:) On Intel |
|---|
| 21 |
x86 machines, for example, |
|---|
| 22 |
it is expected (but not required) that the intermediate |
|---|
| 23 |
calculations be done to the full 80 bits of precision |
|---|
| 24 |
implemented by the hardware. |
|---|
| 25 |
) |
|---|
| 26 |
|
|---|
| 27 |
$(P It's possible that, due to greater use of temporaries and |
|---|
| 28 |
common subexpressions, optimized code may produce a more |
|---|
| 29 |
accurate answer than unoptimized code. |
|---|
| 30 |
) |
|---|
| 31 |
|
|---|
| 32 |
$(P Algorithms should be written to work based on the minimum |
|---|
| 33 |
precision of the calculation. They should not degrade or |
|---|
| 34 |
fail if the actual precision is greater. Float or double types, |
|---|
| 35 |
as opposed to the real (extended) type, should only be used for: |
|---|
| 36 |
) |
|---|
| 37 |
|
|---|
| 38 |
$(UL |
|---|
| 39 |
$(LI reducing memory consumption for large arrays) |
|---|
| 40 |
$(LI when speed is more important than accuracy) |
|---|
| 41 |
$(LI data and function argument compatibility with C) |
|---|
| 42 |
) |
|---|
| 43 |
|
|---|
| 44 |
<h3>Floating Point Constant Folding</h3> |
|---|
| 45 |
|
|---|
| 46 |
$(P Regardless of the type of the operands, floating point |
|---|
| 47 |
constant folding is done in $(B real) or greater precision. |
|---|
| 48 |
It is always done following IEEE 754 rules and round-to-nearest |
|---|
| 49 |
is used.) |
|---|
| 50 |
|
|---|
| 51 |
$(P Floating point constants are internally represented in |
|---|
| 52 |
the implementation in at least $(B real) precision, regardless |
|---|
| 53 |
of the constant's type. The extra precision is available for |
|---|
| 54 |
constant folding. Committing to the precision of the result is |
|---|
| 55 |
done as late as possible in the compilation process. For example:) |
|---|
| 56 |
|
|---|
| 57 |
--- |
|---|
| 58 |
const float f = 0.2f; |
|---|
| 59 |
writefln(f - 0.2); |
|---|
| 60 |
--- |
|---|
| 61 |
$(P will print 0. A non-const static variable's value cannot be |
|---|
| 62 |
propagated at compile time, so:) |
|---|
| 63 |
|
|---|
| 64 |
--- |
|---|
| 65 |
static float f = 0.2f; |
|---|
| 66 |
writefln(f - 0.2); |
|---|
| 67 |
--- |
|---|
| 68 |
$(P will print 2.98023e-09. Hex floating point constants can also |
|---|
| 69 |
be used when specific floating point bit patterns are needed that |
|---|
| 70 |
are unaffected by rounding. To find the hex value of 0.2f:) |
|---|
| 71 |
|
|---|
| 72 |
--- |
|---|
| 73 |
import std.stdio; |
|---|
| 74 |
|
|---|
| 75 |
void main() |
|---|
| 76 |
{ |
|---|
| 77 |
writefln("%a", 0.2f); |
|---|
| 78 |
} |
|---|
| 79 |
--- |
|---|
| 80 |
$(P which is 0x1.99999ap-3. Using the hex constant:) |
|---|
| 81 |
|
|---|
| 82 |
--- |
|---|
| 83 |
const float f = 0x1.99999ap-3f; |
|---|
| 84 |
writefln(f - 0.2); |
|---|
| 85 |
--- |
|---|
| 86 |
|
|---|
| 87 |
$(P prints 2.98023e-09.) |
|---|
| 88 |
|
|---|
| 89 |
$(P Different compiler settings, optimization settings, |
|---|
| 90 |
and inlining settings can affect opportunities for constant |
|---|
| 91 |
folding, therefore the results of floating point calculations may differ |
|---|
| 92 |
depending on those settings.) |
|---|
| 93 |
|
|---|
| 94 |
<h3>Complex and Imaginary types</h3> |
|---|
| 95 |
|
|---|
| 96 |
$(P In existing languages, there is an astonishing amount of effort expended in trying to jam a |
|---|
| 97 |
complex type onto existing type definition facilities: templates, structs, operator |
|---|
| 98 |
overloading, etc., and it all usually ultimately fails. It fails because the semantics of |
|---|
| 99 |
complex operations can be subtle, and it fails because the compiler doesn't know what the |
|---|
| 100 |
programmer is trying to do, and so cannot optimize the semantic implementation. |
|---|
| 101 |
) |
|---|
| 102 |
|
|---|
| 103 |
$(P This is all done to avoid adding a new type. Adding a new type means that the compiler |
|---|
| 104 |
can make all the semantics of complex work "right". The programmer then can rely on a |
|---|
| 105 |
correct (or at least fixable <g>) implementation of complex. |
|---|
| 106 |
) |
|---|
| 107 |
|
|---|
| 108 |
$(P Coming with the baggage of a complex type is the need for an imaginary type. An |
|---|
| 109 |
imaginary type eliminates some subtle semantic issues, and improves performance by not |
|---|
| 110 |
having to perform extra operations on the implied 0 real part. |
|---|
| 111 |
) |
|---|
| 112 |
|
|---|
| 113 |
$(P Imaginary literals have an i suffix: |
|---|
| 114 |
) |
|---|
| 115 |
|
|---|
| 116 |
------ |
|---|
| 117 |
ireal j = 1.3i; |
|---|
| 118 |
------ |
|---|
| 119 |
|
|---|
| 120 |
$(P There is no particular complex literal syntax, just add a real and |
|---|
| 121 |
imaginary type: |
|---|
| 122 |
) |
|---|
| 123 |
|
|---|
| 124 |
------ |
|---|
| 125 |
cdouble cd = 3.6 + 4i; |
|---|
| 126 |
creal c = 4.5 + 2i; |
|---|
| 127 |
------ |
|---|
| 128 |
|
|---|
| 129 |
$(P Complex, real and imaginary numbers have two properties: |
|---|
| 130 |
) |
|---|
| 131 |
|
|---|
| 132 |
<pre> |
|---|
| 133 |
.re get real part (0 for imaginary numbers) |
|---|
| 134 |
.im get imaginary part as a real (0 for real numbers) |
|---|
| 135 |
</pre> |
|---|
| 136 |
|
|---|
| 137 |
$(P For example: |
|---|
| 138 |
) |
|---|
| 139 |
|
|---|
| 140 |
<pre> |
|---|
| 141 |
cd.re is 4.5 double |
|---|
| 142 |
cd.im is 2 double |
|---|
| 143 |
c.re is 4.5 real |
|---|
| 144 |
c.im is 2 real |
|---|
| 145 |
j.im is 1.3 real |
|---|
| 146 |
j.re is 0 real |
|---|
| 147 |
</pre> |
|---|
| 148 |
|
|---|
| 149 |
<h3>Rounding Control</h3> |
|---|
| 150 |
|
|---|
| 151 |
$(P IEEE 754 floating point arithmetic includes the ability to set 4 |
|---|
| 152 |
different rounding modes. |
|---|
| 153 |
These are accessible via the functions in std.c.fenv. |
|---|
| 154 |
) |
|---|
| 155 |
|
|---|
| 156 |
$(V2 |
|---|
| 157 |
$(P If the floating-point rounding mode is changed within a function, |
|---|
| 158 |
it must be restored before the function exits. If this rule is violated |
|---|
| 159 |
(for example, by the use of inline asm), the rounding mode used for |
|---|
| 160 |
subsequent calculations is undefined. |
|---|
| 161 |
) |
|---|
| 162 |
) |
|---|
| 163 |
|
|---|
| 164 |
<h3>Exception Flags</h3> |
|---|
| 165 |
|
|---|
| 166 |
$(P IEEE 754 floating point arithmetic can set several flags based on what |
|---|
| 167 |
happened with a |
|---|
| 168 |
computation:) |
|---|
| 169 |
|
|---|
| 170 |
$(TABLE |
|---|
| 171 |
$(TR $(TD FE_INVALID)) |
|---|
| 172 |
$(TR $(TD FE_DENORMAL)) |
|---|
| 173 |
$(TR $(TD FE_DIVBYZERO)) |
|---|
| 174 |
$(TR $(TD FE_OVERFLOW)) |
|---|
| 175 |
$(TR $(TD FE_UNDERFLOW)) |
|---|
| 176 |
$(TR $(TD FE_INEXACT)) |
|---|
| 177 |
) |
|---|
| 178 |
|
|---|
| 179 |
$(P These flags can be set/reset via the functions in |
|---|
| 180 |
$(LINK2 phobos/std_c_fenv.html, std.c.fenv).) |
|---|
| 181 |
|
|---|
| 182 |
<h3>Floating Point Comparisons</h3> |
|---|
| 183 |
|
|---|
| 184 |
$(P In addition to the usual < <= > >= == != comparison |
|---|
| 185 |
operators, D adds more that are |
|---|
| 186 |
specific to floating point. These are |
|---|
| 187 |
!<>= |
|---|
| 188 |
<> |
|---|
| 189 |
<>= |
|---|
| 190 |
!<= |
|---|
| 191 |
!< |
|---|
| 192 |
!>= |
|---|
| 193 |
!> |
|---|
| 194 |
!<> |
|---|
| 195 |
and match the semantics for the |
|---|
| 196 |
NCEG extensions to C. |
|---|
| 197 |
See $(LINK2 expression.html#floating_point_comparisons, Floating point comparisons). |
|---|
| 198 |
) |
|---|
| 199 |
|
|---|
| 200 |
<h3><a name="floating-point-transformations">Floating Point Transformations</a></h2> |
|---|
| 201 |
|
|---|
| 202 |
$(P An implementation may perform transformations on |
|---|
| 203 |
floating point computations in order to reduce their strength, |
|---|
| 204 |
i.e. their runtime computation time. |
|---|
| 205 |
Because floating point math does not precisely follow mathematical |
|---|
| 206 |
rules, some transformations are not valid, even though some |
|---|
| 207 |
other programming languages still allow them. |
|---|
| 208 |
) |
|---|
| 209 |
|
|---|
| 210 |
$(P The following transformations of floating point expressions |
|---|
| 211 |
are not allowed because under IEEE rules they could produce |
|---|
| 212 |
different results. |
|---|
| 213 |
) |
|---|
| 214 |
|
|---|
| 215 |
$(TABLE1 |
|---|
| 216 |
<caption>Disallowed Floating Point Transformations</caption> |
|---|
| 217 |
$(TR |
|---|
| 218 |
$(TH transformation) $(TH comments) |
|---|
| 219 |
) |
|---|
| 220 |
$(TR |
|---|
| 221 |
$(TD $(I x) + 0 → $(I x)) $(TD not valid if $(I x) is -0) |
|---|
| 222 |
) |
|---|
| 223 |
$(TR |
|---|
| 224 |
$(TD $(I x) - 0 → $(I x)) $(TD not valid if $(I x) is ±0 and rounding is towards -∞) |
|---|
| 225 |
) |
|---|
| 226 |
$(TR |
|---|
| 227 |
$(TD -$(I x) ↔ 0 - $(I x)) $(TD not valid if $(I x) is +0) |
|---|
| 228 |
) |
|---|
| 229 |
$(TR |
|---|
| 230 |
$(TD $(I x) - $(I x) → 0) $(TD not valid if $(I x) is NaN or ±∞) |
|---|
| 231 |
) |
|---|
| 232 |
$(TR |
|---|
| 233 |
$(TD $(I x) - $(I y) ↔ -($(I y) - $(I x))) $(TD not valid because (1-1=+0) whereas -(1-1)=-0) |
|---|
| 234 |
) |
|---|
| 235 |
$(TR |
|---|
| 236 |
$(TD $(I x) * 0 → 0) $(TD not valid if $(I x) is NaN or ±∞) |
|---|
| 237 |
) |
|---|
| 238 |
$(COMMENT |
|---|
| 239 |
$(TR |
|---|
| 240 |
$(TD $(I x) * 1 → $(I x)) $(TD not valid if $(I x) is a signaling NaN) |
|---|
| 241 |
) |
|---|
| 242 |
) |
|---|
| 243 |
$(TR |
|---|
| 244 |
$(TD $(I x) / $(I c) ↔ $(I x) * (1/$(I c))) $(TD valid if (1/$(I c)) yields an e$(I x)act result) |
|---|
| 245 |
) |
|---|
| 246 |
$(TR |
|---|
| 247 |
$(TD $(I x) != $(I x) → false) $(TD not valid if $(I x) is a NaN) |
|---|
| 248 |
) |
|---|
| 249 |
$(TR |
|---|
| 250 |
$(TD $(I x) == $(I x) → true) $(TD not valid if $(I x) is a NaN) |
|---|
| 251 |
) |
|---|
| 252 |
$(TR |
|---|
| 253 |
$(TD $(I x) !$(I op) $(I y) ↔ !($(I x) $(I op) $(I y))) $(TD not valid if $(I x) or $(I y) is a NaN) |
|---|
| 254 |
) |
|---|
| 255 |
) |
|---|
| 256 |
|
|---|
| 257 |
$(P Of course, transformations that would alter side effects are also |
|---|
| 258 |
invalid.) |
|---|
| 259 |
|
|---|
| 260 |
) |
|---|
| 261 |
|
|---|
| 262 |
Macros: |
|---|
| 263 |
TITLE=Floating Point |
|---|
| 264 |
WIKI=Float |
|---|