Changeset 166
- Timestamp:
- 12/21/07 03:17:57 (8 months ago)
- Files:
-
- trunk/blade/BladeDemo.d (modified) (2 diffs)
- trunk/blade/BladeSimplify.d (modified) (1 diff)
- trunk/blade/CodegenX86.d (modified) (8 diffs)
- trunk/blade/PostfixX86.d (modified) (1 diff)
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
trunk/blade/BladeDemo.d
r165 r166 38 38 39 39 mixin(vectorize("q+= q*2.01")); 40 mixin(vectorize(" r-=another[0]"));40 mixin(vectorize("another[0]+=r-=another[0]+another[0]")); 41 41 42 42 // All of the next four are equivalent 43 43 mixin(vectorize("a+=6*another[1,0..$]")); 44 /* 44 mixin(vectorize("a+=6*(another[1,0..$]+another[1,0..$])")); 45 45 46 mixin(vectorize("a+=6*another[1][0..$]")); 46 47 mixin(vectorize("a+=6*another[1]")); … … 48 49 // I don't think I'll support this syntax long-term. 49 50 mixin(vectorize("a+=6*another[1,[0,$]]")); 50 */51 51 52 // Strided vector 52 53 mixin(vectorize("another[0..$,1]=6*a[0..2]")); trunk/blade/BladeSimplify.d
r165 r166 63 63 else { 64 64 char [] expr2 = removeDuplicates(tree); 65 // Check for rank errors 65 // Check for rank errors 66 66 int wholerank = exprRank(expr2, ranks); 67 67 if (wholerank<0) trunk/blade/CodegenX86.d
r165 r166 15 15 * FEATURES: 16 16 * - Supports any mix of vector addition, subtraction, dot product, and multiplication 17 * by a scalar .17 * by a scalar, with strided vector access. 18 18 * - Generates either x87, SSE, or SSE2 asm code. 19 19 * - If x87 code is generated, 80-bit precision is used whenever possible. … … 23 23 * None of these support dot product, or matrix operations. 24 24 * X87: 25 * The x87 code generation targets early Pentiums, which are now irrelevant.26 * It needs to be updated to PM/Core2 (this will significantly simplify it).27 25 * - Not optimal for the case of multiple real vectors (they could share a counter). 28 26 * - Not optimal for the case where all vectors are 80-bit (two counters are used, but only is required). … … 49 47 private: 50 48 51 // num chars before we get a comma.52 int paramLength(char [] s)53 {54 for (int i=0; i<s.length; ++i) {55 if (s[i]==',') return i;56 }57 assert(0);58 }59 60 49 // -------------- 61 50 // Ranklist functions … … 278 267 multiple accumulators, in order to break dependency chains. 279 268 280 The key optimisation rules are:269 The key optimisation rules for DAXPY loops are: 281 270 1. keep the loop overhead to one clock cycle if possible. 282 271 2. (FMUL latency) don't use the result of a multiply immediately 283 3. (FST latency) don't save a value to memory immediately after it's calculated.284 4. (AGI stall) don't use the counter variable immediately after it's modified.285 272 Techniques to address these are: 286 273 1. Use EAX as a counter and index variable, which begins negative and counts UP to zero. 274 Combine counters for all packed doubles and floats into this single counter. 287 275 2. The latency of fmul is avoided by swapping fadd/fsub with fmul whenever possible. 288 3. The latency of fstp is avoided by calculating a result in one iteration,289 but not storing it to memory until the subsequent iteration.290 4. (NOT YET IMPLEMENTED): first operation in the loop should be loading a scalar (for a multiply),291 if possible, otherwise load an 80-bit vector, if possible.292 276 293 277 The generated code is of the form: … … 295 279 load scalars onto FPU stack 296 280 load vector pointers into EAX, EBX, ... 297 calculate result[0] into ST(0)298 goto L2299 281 L1: 300 calculate result[i+1] into ST(0) 301 swap so that result[i] is in ST(0) 302 L2: 303 store result[i] 304 increment pointers, goto L1 if i<n-1 305 store result[n-1] 282 calculate result into ST(0) 283 increment pointers 284 goto L1 if not done 306 285 pop scalars off FPU stack 307 286 ---- … … 397 376 // the final storage instruction, because of the FST latency). 398 377 char [] mainbody = ""; 399 // char [] firstbody = "";400 // char [] storage = "";401 378 402 379 // We need to keep track of how many things are on the FPU stack. … … 423 400 ++done; 424 401 numOnStack++; 402 } else if (operations[done]==',') { 403 mainbody ~= " " ~ opToX87[operations[done+1]] ~ " ST, ST(0); // dup " ~ operations[done+1] ~ \n; 404 done+=2; 425 405 } else if (ranklist[operations[done]-'A']=='1') { 426 406 // An operation will be performed between the stack top and a vector. … … 429 409 char [] comment = "; // " ~ operations[done..done+2] ~ \n; 430 410 if (operations[done+1]=='=') { 431 next = " fstp " ~ indexedVector(typelist, ranklist, stridelist, operations[$-2] ) ~ comment; 411 // If it's the last operation, pop it from the stack; otherwise, 412 // it chains. 413 next = ((done+2 == operations.length)? " fstp " : " fst ") 414 ~ indexedVector(typelist, ranklist, stridelist, operations[$-2] ) ~ comment; 432 415 } else if (typelist[operations[done]-'A']=="real") { 433 416 // 80-bit vectors must be loaded onto the FPU stack first trunk/blade/PostfixX86.d
r159 r166 65 65 return second ~ first ~ "="; 66 66 } 67 if (second == first) return first ~ "," ~ op; 67 68 68 69 // x87 OPTIMISATION #1
