Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #1831 (closed defect: fixed)

Opened 15 years ago

Last modified 15 years ago

Buggy runtime functions for float array operations (with patch)

Reported by: eriatarka Assigned to: h3r3tic
Priority: major Milestone: 0.99.9
Component: Tango Version: 0.99.8 Sean
Keywords: Cc:

Description

The following code fails with the current runtime unless the array happens to be aligned to a paragraph:

void main()
{
    float[64] array;
    int i = 42;

    auto slice = array[];
    slice[] *= 2f;

    assert(i == 42);
}

The "slice[] *= 2f" runtime function _arrayExpSliceMulass_f has a bug which causes it to overwrite memory following the array. The functions for +=, -= and /= suffer from the same problem.

I am attaching a patch which fixes these problems. In the process of creating it, I took the liberty of using string mixins to reduce much of the code duplication in arrayfloat.d, bringing it down from 2300 to 1400 lines.

All the unit tests still seem to run, so I'm fairly confident I at least haven't made the situation worse.

Attachments

tango-arrayfloat-rewrite.patch (59.3 kB) - added by eriatarka on 01/08/10 21:08:44.
patch for arrayfloat.d

Change History

01/08/10 21:08:44 changed by eriatarka

  • attachment tango-arrayfloat-rewrite.patch added.

patch for arrayfloat.d

01/10/10 11:45:43 changed by larsivi

Although I understand the temptation, it is a bit unfortunate that you refactored at the same time as you fixed the bugs - this makes it difficult to review the actual bug fix.

01/10/10 14:17:22 changed by eriatarka

Ah yes, I can see where you're coming from. Sorry about that. It's just the organizational overhead of getting a patch accepted which led me to do that. Here is the original code as arrayfloat.d stands now:

518 	            // align pointer
519 	            auto n = cast(T*)((cast(uint)aptr + 15) & ~15);
520 	            while (aptr < n)
521 	                *aptr++ += value;
522 	            n = cast(T*)((cast(uint)aend) & ~15);
523 	            if (aptr < n)
524 	
525 	            // Aligned case
526 	            asm
527 	            {
528 	                mov ESI, aptr;
529 	                mov EDI, n;
530 	                movss XMM4, value;
531 	                shufps XMM4, XMM4, 0;
532 	
533 	                align 8;
534 	            startsseloopa:
535 	                movaps XMM0, [ESI];
536 	                movaps XMM1, [ESI+16];
537 	                movaps XMM2, [ESI+32];
538 	                movaps XMM3, [ESI+48];
539 	                add ESI, 64;
540 	                addps XMM0, XMM4;
541 	                addps XMM1, XMM4;
542 	                addps XMM2, XMM4;
543 	                addps XMM3, XMM4;
544 	                movaps [ESI+ 0-64], XMM0;
545 	                movaps [ESI+16-64], XMM1;
546 	                movaps [ESI+32-64], XMM2;
547 	                movaps [ESI+48-64], XMM3;
548 	                cmp ESI, EDI;
549 	                jb startsseloopa;
550 	
551 	                mov aptr, ESI;
552 	            }

Here is my fixed version:

            auto aabeg = cast(T*)((cast(uint)aptr + 15) & ~15); // beginning of paragraph-aligned slice of a
            auto aaend = cast(T*)((cast(uint)aend) & ~15);      // end of paragraph-aligned slice of a

            int numAligned = cast(int)(aaend - aabeg);          // how many floats are in the aligned slice?

            // are there at least 16 floats in the paragraph-aligned slice?
            // otherwise we can't do anything with SSE.
            if (numAligned >= 16)
            {
                aaend = aabeg + (numAligned & ~15);     // make sure the slice is actually a multiple of 16 floats long

                // process values up to aligned slice one by one
                while (aptr < aabeg)
                    *aptr++ += value;

                // process aligned slice with fast SSE operations
                asm
                {
                    mov ESI, aabeg;
                    mov EDI, aaend;
                    movss XMM4, value;
                    shufps XMM4, XMM4, 0;

                    align 8;
                startsseloopa:
                    movaps XMM0, [ESI];
                    movaps XMM1, [ESI+16];
                    movaps XMM2, [ESI+32];
                    movaps XMM3, [ESI+48];
                    add ESI, 64;
                    addps XMM0, XMM4;
                    addps XMM1, XMM4;
                    addps XMM2, XMM4;
                    addps XMM3, XMM4;
                    movaps [ESI+ 0-64], XMM0;
                    movaps [ESI+16-64], XMM1;
                    movaps [ESI+32-64], XMM2;
                    movaps [ESI+48-64], XMM3;
                    cmp ESI, EDI;
                    jb startsseloopa;
                }
                aptr = aaend;
            }

The original code is subtly wrong in that, even if n points only 16 bytes after aptr, still the full SSE loop processing 16 floats, i.e. 64 bytes, is executed.

Hope this helps.

01/12/10 04:12:39 changed by kris

  • owner changed from kris to larsivi.

would you mind owning this one, larsivi?

01/12/10 18:22:50 changed by larsivi

  • owner changed from larsivi to h3r3tic.

I've asked Tom to check it out :)

01/16/10 12:03:28 changed by h3r3tic

Aw, who do I bounce this ticket off to now? ;)

Looks good to me. Unless there are any objections, I can fold it in tomorrow (going to sleep now :P). I was concerned that the usage of string mixins would cause DMD to emit string literals into the object file's data sections. It not only doesn't do that, the object file after this patch is actually smaller due to the reuse of _arrayExpSliceMulass_f in _arrayExpSliceDivass_f. Overall it's a nice patch, thanks :)

<rant>

By the way, I was curious why the object file for the original code contained a reference to the '_memset32' symbol. Turns out that DMD is quite bad at codegen even with -release -inline -O. The following code:

        T[2] w;
        w[0] = w[1] = value;

was being turned into this madness of asm (the commented out lines are parts of surrounding statements' codegen):

        push    2                                       ; 00BD _ 6A, 02
//      lea     eax, [ecx+eax*4]                        ; 00BF _ 8D. 04 81
//      mov     dword ptr [ebp-10H], eax                ; 00C2 _ 89. 45, F0
        lea     eax, [ebp-8H]                           ; 00C5 _ 8D. 45, F8
        push    dword ptr [?_001]                       ; 00C8 _ FF. 35, 00000000(segrel)
        push    eax                                     ; 00CE _ 50
        call    __memset32                              ; 00CF _ E8, 00000000(rel)
        fld     dword ptr [ebp+10H]                     ; 00D4 _ D9. 45, 10
        fst     dword ptr [ebp-4H]                      ; 00D7 _ D9. 55, FC
        fstp    dword ptr [ebp-8H]                      ; 00DA _ D9. 5D, F8
        add     esp, 12                                 ; 00DD _ 83. C4, 0C

The offending code was in _arraySliceExpMinSliceAssign_f, but you changed it to

        ulong w = *cast(uint *) &value;
        ulong v = w | (w << 32L);

which, although not using the memset32 call, is quite insane on its own

//      lea     eax, [ecx+eax*4]                        ; 00BD _ 8D. 04 81
        xor     ecx, ecx                                ; 00C0 _ 31. C9
//      mov     dword ptr [ebp-58H], eax                ; 00C2 _ 89. 45, A8
        mov     eax, dword ptr [ebp+10H]                ; 00C5 _ 8B. 45, 10
        mov     dword ptr [ebp-4CH], ecx                ; 00C8 _ 89. 4D, B4
        mov     edx, dword ptr [ebp-4CH]                ; 00CB _ 8B. 55, B4
        mov     dword ptr [ebp-50H], eax                ; 00CE _ 89. 45, B0
        mov     eax, dword ptr [ebp-50H]                ; 00D1 _ 8B. 45, B0
        mov     edx, eax                                ; 00D4 _ 89. C2
        or      edx, dword ptr [ebp-4CH]                ; 00D6 _ 0B. 55, B4
        xor     eax, eax                                ; 00D9 _ 31. C0
        or      eax, dword ptr [ebp-50H]                ; 00DB _ 0B. 45, B0
        mov     dword ptr [ebp-48H], eax                ; 00DE _ 89. 45, B8
        mov     dword ptr [ebp-44H], edx                ; 00E1 _ 89. 55, BC

Looks like DMD is also bad at longs :P

Using something like this:

        ulong v = void;
        (cast(uint*)&v)[0] = (cast(uint*)&v)[1] = *cast(uint*)&value;

Yields much saner asm:

        mov     eax, dword ptr [ebp+10H]                ; 00BB _ 8B. 45, 10
        mov     dword ptr [ebp-3CH], eax                ; 00BE _ 89. 45, C4
        mov     dword ptr [ebp-40H], eax                ; 00C1 _ 89. 45, C0

</rant>

01/18/10 02:12:38 changed by h3r3tic

  • status changed from new to closed.
  • resolution set to fixed.

(In [5301]) eriatarka fixes #1831 - thanks!

01/18/10 09:47:06 changed by eriatarka

Thanks for folding it in!

Those codegen issue are rather nasty. Maybe it would be worth filing a ticket to DMD about this, just to keep track of such things.

This changed code with the ulongs isn't by me BTW, there were just various ways to do this floating around in the source file, and I had to choose one when unifying. Good catch about the saner version.