Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #571 (closed defect: fixed)

Opened 7 years ago

Last modified 7 years ago

Tango sort has poor performance, may even crash on specially crafted data

Reported by: Deewiant Assigned to: sean
Priority: major Milestone: 0.99.5
Component: Core Functionality Version: 0.99 RC3 Xammy
Keywords: Cc:

Description

I've been benchmarking various sorting algorithms against the built-in sort property and tango.core.Array.sort. Benchmark output follows:

Using arrays of lengths [15,1000,100000,5000000] containing numbers in the range -1000-1000.

Note that odd lengths will be increased by one for the 'median-of-3 killer' sequences used to test quicksort. This means that if a length is odd, one of the arrays will actually have a length one greater than the rest.

Generating 16 arrays................ done.

Each of the 5 algorithms will run 5 times over 4 arrays of each length, for a total of 80 passes for each algorithm, for a total of 400 passes globally.

Heap sort.... .... .... .... ....
     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     1     1     1 |     1     1     1 |     1     1     1
 100000:   277   272   285 |   241   232   259 |   232   227   238 |   255   246   269
5000000: 38341 38303 38393 | 17230 17203 17245 | 15873 15853 15891 | 19273 19230 19372

Built-in sort (UNFAIR: DOESN'T CALL FUNCTION FOR COMPARISON).... .... .... .... ....
     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     1     1     1 |     0     0     0 |     1     1     1
 100000:   255   251   268 |   160   159   161 |   154   154   155 |   347   333   358
5000000: 16492 16439 16552 | 11571 11526 11666 | 11280 11224 11376 | 27067 26990 27180

Tango sort.... .... .... .... ....
     15:     0     0     0 |        0     0     0 |        0     0     0 | XXX
   1000:     1     1     1 |       20    20    20 |       24    24    25 | XXX
 100000:   194   179   233 |     6372  6347  6384 |     6761  6719  6800 | XXX
5000000: 11201  9917 12065 | 330943 330626 331187 | 351592 351474 351658 | XXX

Introspective sort.... .... .... .... ....
     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     0     0     0 |     0     0     0 |     1     1     1
 100000:   190   181   209 |    81    72   115 |    84    70   113 |   244   197   294
5000000: 10088  9916 10251 |  6261  6044  6572 |  5937  5625  6313 | 16955 16393 17550

Merge sort.... .... .... .... ....
     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     2     2     2 |     0     0     0 |     0     0     0
 100000:   256   245   281 |   158   151   172 |   113   106   128 |   133   132   136
5000000: 18696 18442 18771 |  9050  9014  9087 |  9348  9331  9358 | 11141 11028 11423

The numbers on the left represent array lengths. For each length, there are four sets of data. In each set, the first number is the average, the second the minimum, and the third the maximum time taken, measured using StopWatch?.

The first set of data is for randomized data in the shown range, generated with Random. The second set is for a reverse-sorted array. The third set is for a sorted array. The fourth set are "median-of-3 killer" sequences. Tango's sort actually crashes due to stack overflow for the bigger arrays here, and hence I've manually corrected the data above to say "XXX".

For information on introspective sort and the "median-of-3 killers" see this well-written paper.

I suggest that Tango switch to introsort for its core sorting algorithm, as it has the best general-case performance, being on par with the current sort with the randomized data and besting it by a large margin with the other data sets. Note that though merge sort is faster for the median-of-3 killers, it uses O(n) extra storage space.

I also suggest that, to improve performance in the general case, a templated version of sort which does not take a predicate be written. In my benchmarking I've noticed that run times can even halve when "cmp(a, b)" is rewritten to "a < b". It would be simple to, for instance, have sort take as a template argument a boolean which dictates whether to sort in ascending or descending order. (Note that Tango's current sort as well as introsort outperform the built-in sort, even though they prevent the compiler from inlining comparisons. Allowing inlining would give them a far greater advantage.)

For now I'm only attaching the benchmark code, which includes all of the sort algorithms mentioned above, as well as some others. If asked, I can write versions that are more ready to go in Tango, templated as mentioned above.

Attachments

sort.d (13.9 kB) - added by Deewiant on 08/14/07 19:51:06.
Sort benchmarking module
perf_sort.d (3.4 kB) - added by sean on 01/10/08 02:13:40.

Change History

08/14/07 19:51:06 changed by Deewiant

  • attachment sort.d added.

Sort benchmarking module

08/15/07 09:55:56 changed by Deewiant

Here are timings for an introsort with inlined comparisons:

     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     0     0     0 |     0     0     0 |     0     0     0
 100000:   136   129   146 |    34    32    43 |    40    30    62 |   121   117   133
5000000:  6655  6632  6695 |  2924  2903  2964 |  2763  2744  2785 | 10970 10951 10991

Over twice as fast for the sorted and reverse-sorted data, and approximately a 35% improvement for the random data and the median-of-3 killers.

08/15/07 16:19:00 changed by sean

  • status changed from new to assigned.

Thanks for running these tests. I'll read up on introsort a bit.

For what it's worth, the default predicate is a struct, which inlines the comparison in most of the algorithms in Array. Not all of them, though, and I'm not entirely sure why. I'll have to check the sort routine and see whether the comparison is inlined (you suggest it might not be). In my original tests, making opCall static caused it to be inlined in some functions and not in others, and the same for making it non-static. Because of this, I almost wonder if there is something slightly broken with the inlining mechanism.

The tests I ran (sorting all words in the King James Bible for a word count app) had the Tango sort run at the same speed as the built-in sorting routine, and degrading to an insertion sort for small lists had no effect on performance so I left that bit out to save complexity. I suspect that adding it will help with the poorly performing cases, but likely not enough to bring it in line with introsort. I may give it a try anyway, however. I do think it is important for a sorting routine's performance not to degrade when the cost of performing a comparison is high, so I'll likely run introsort through the paces with the word count test as well.

08/15/07 16:43:04 changed by Deewiant

I ran obj2asm on tango-core-Array.obj compiled with -O -inline -release, and there were 4 "call" instructions to the IsLess? struct's opCall. At first, I actually did think it would be inlined, exactly because it is a struct, but I checked to be sure, and nope, it isn't.

It actually seemed to be a little faster to provide a function pointer which just does "a < b". (At least for ints, that is.)

I'd say the one biggest problem with the current sort, regardless of the very poor performance in some cases, is that crafted "median-of-3 killer" data can cause the sort to crash the whole app due to stack overflow. This is a security issue for programs which need to sort data which has been input by the user.

08/16/07 20:46:39 changed by sean

Definitely. I suspect that adding an insertion sort for small lists will correct this case, but that's just a guess. I'm still not sure I like the amount of stack space the current method requires if a stack overflow is even possible.

01/10/08 02:13:40 changed by sean

  • attachment perf_sort.d added.

01/10/08 02:25:34 changed by sean

I've gone ahead and applied the logic of introSort to the current Tango sort routine. This proved to be a bit more effective than just using introSort as-is, because the introSort implementation provided actually performs worse than the current Tango sort routine on certain data. This is either data that is expensive to compare or data containing a large number of duplicates or both--I didn't test thoroughly enough to be sure. Testing for this was done using the attached perf_sort program, which is basically just a modified version of Walter's word count program. I used multiple instances of the King James Bible (a 5mb text file) as the test set. Here are the results for 1-5 instances of the file with the built-in sort routine, the new Tango routine, and introSort:

>perf_sort a.txt
Tango: 0.767433
introSort: 0.973873
Phobos: 0.595833

>perf_sort a.txt b.txt
Tango: 1.712127
introSort: 2.283815
Phobos: 1.392051

>perf_sort a.txt b.txt c.txt
Tango: 2.608806
introSort: 3.337712
Phobos: 2.291393

>perf_sort a.txt b.txt c.txt d.txt
Tango: 3.633293
introSort: 5.164842
Phobos: 3.213683

>perf_sort a.txt b.txt c.txt d.txt e.txt
Tango: 4.452213
introSort: 6.124935
Phobos: 4.167550

What I think is important about this is that introSort scales worse than the others as the data size increases, and I worried what would happen when sorting truly massive arrays. Here are the results of these same three routines using the attached sort program:

Built-in sort.... .... .... .... ....
     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     0     0     0 |     0     0     0 |     1     1     1
 100000:   217   214   221 |   114   114   115 |   111   111   112 |   238   233   249
5000000: 13168 13151 13193 |  7537  7526  7550 |  7275  7268  7280 | 18210 18150 18375

Tango sort.... .... .... .... ....
     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     2     2     2 |     0     0     0 |     1     1     1
 100000:   205   202   208 |    93    92    96 |    62    53    69 |   243   232   256
5000000:  9241  9191  9394 |  4669  4658  4689 |  2838  2757  2912 | 14597 14537 14662

Introspective sort.... .... .... .... ....
     15:     0     0     0 |     0     0     0 |     0     0     0 |     0     0     0
   1000:     1     1     1 |     0     0     0 |     0     0     0 |     0     0     0
 100000:   167   159   188 |    51    50    51 |    51    50    54 |   154   152   158
5000000:  7969  7948  8013 |  3778  3772  3795 |  3795  3785  3805 | 11840 11823 11860

As you can see, the Tango routine now performs on par with introSort in each test, and actually beats it in one of the four. Some tuning could probably speed things up even more (looping instead of using recursion for both branches, like introSort does, for example), but I like that the changes from introSort are purely additive so I don't have to re-test changes to existing code. The commit is forthcoming.

01/10/08 02:29:17 changed by sean

  • status changed from assigned to closed.
  • resolution set to fixed.

(In [3081]) Added some heuristics to the sort routine to address performance issues. It now matches introSort for the "median-of-3 killer" test and matches the built-in sort for the word count test, each of which were previously the respective winners in those tests. This closes #571.

01/10/08 07:26:49 changed by kris

why is the Tango sort slower that the phobos sort?

01/22/08 16:25:53 changed by sean

Interestingly, the built-in sort routine (what you call the phobos sort) uses the same basic algorithm as the Tango sort: a three-element variant of quicksort which falls back on insertion sort for small arrays. It also lacks some of the new optimizations from introsort, which are the reason it's significantly slower than the Tango sort on certain data sets. However, the built-in sort doesn't use recursion, but instead manages its "stack" manually, which I think is the reason it's so fast on average. If you look at the perf_sort test above, you can see that the Tango sort has a small fixed overhead above the built-in sort, and I think that's the cost of pushing full stack frames for function calls instead of this manual stuff. So I suppose we could further tune the Tango sort to do the same thing... it would just mean a lot of messy changes. If you're interested, the built-in sort is defined here:

http://www.dsource.org/projects/tango/browser/trunk/lib/compiler/dmd/qsort.d

01/22/08 23:16:48 changed by larsivi

  • milestone changed from 1.0 to 0.99.5.

01/24/08 02:36:01 changed by sean

(In [3119]) Added an optimization to insertionSort, a part of the Tango sort routine. This refs #571