Re-engineering a qsort function (part 5)
Comparison with some other sorts
musl Smoothsort implementation
I mentioned in part 2 that the musl version of qsort
is an implementation of E.W. Dijkstra’s smoothsort. Dijkstra say “It is of order N log N in the worst case, but of order N in the best case, with a smooth transition between the two. (Hence its name.)” It is a fairly complex algorithm; I have not tried to understand it yet. I commend Valentin Ochs for implementing it, but I find it to be too slow to be preferred over other versions. Here are my results:
Tot.rank Swaps Compares Time Ratio Implementation
1. 318 70596435 344791050 5.151 s 1.000 qs22j
2. 320 70596435 344791050 5.210 s 1.012 qs22k
3. 424 77901675 366759345 5.431 s 1.054 qs22ck
4. 458 82285275 371270505 5.480 s 1.064 qs22bk
5. 556 86727810 408541395 6.085 s 1.181 qs22c
6. 577 90954705 413041065 6.120 s 1.188 qs22b
7. 599 80264325 405888780 6.094 s 1.183 qs22i
8. 660 79030245 412410825 6.209 s 1.205 qs22h
9. 826 79030245 412410825 6.321 s 1.227 illumos
10. 831 0 435643355 6.573 s 1.276 bentmcil
11. 863 79030245 412410825 6.409 s 1.244 qs22f
12. 1017 117202900 435643355 6.749 s 1.310 bentley_mcilroy
13. 1051 0 552277515 8.368 s 1.624 system
14. 1059 88213925 435643355 6.693 s 1.299 bentley_mcilroy2
15. 1241 0 1050856410 25.186 s 4.889 musl
And in the “izabera” tests:
Tot.rank Swaps Compares Time Ratio Implementation
1. 268 50088615 337302450 4.878 s 1.000 qs22ck
2. 274 42158775 301589505 4.506 s 0.924 qs22j
3. 275 42158775 301589505 4.493 s 0.921 qs22k
4. 284 54052335 337394955 4.917 s 1.008 qs22bk
5. 326 55421475 352864440 5.155 s 1.057 qs22b
6. 327 0 287123550 4.505 s 0.924 system
7. 341 52360275 356185275 5.207 s 1.067 qs22c
8. 389 42159225 350948055 5.264 s 1.079 qs22i
9. 433 42131190 369796395 5.545 s 1.137 qs22h
10. 539 0 374882415 5.630 s 1.154 bentmcil
11. 556 42131190 369796395 5.634 s 1.155 qs22f
12. 564 42131190 369796395 5.599 s 1.148 illumos
13. 626 49647700 374882415 5.696 s 1.168 bentley_mcilroy2
14. 630 57704070 374882415 5.742 s 1.177 bentley_mcilroy
15. 648 0 619016520 14.773 s 3.028 musl
The smoothsort is claimed to be much better on nearly-sorted data. The “izabera” tests of Isabella Bosia contain two tests of “one percent” disordered data. The first is labeled “one_percent” and is generated by filling the array with ascending integers and then replacing one percent of them with random numbers. Here is a sample result:
Testing 100000 int elements one_percent (izaberra):
916975 6704220 21.734 ms 1.000 qs22j
1285215 7482825 22.231 ms 1.023 qs22bk
1208420 7456070 22.750 ms 1.047 qs22ck
916975 6704220 23.578 ms 1.085 qs22k
0 6628075 23.818 ms 1.096 system
1246350 7766630 28.176 ms 1.296 qs22c
1309845 7775170 29.408 ms 1.353 qs22b
910335 7768230 30.094 ms 1.385 qs22h
916975 7723695 30.267 ms 1.393 qs22i
910335 7768230 30.880 ms 1.421 illumos
910335 7768230 33.015 ms 1.519 qs22f
0 9057205 41.078 ms 1.890 bentmcil
1376760 9057205 41.965 ms 1.931 bentley_mcilroy
1215205 9057205 42.533 ms 1.957 bentley_mcilroy2
0 16622915 103.899 ms 4.780 musl
The other way is labeled “1% the nsz way”, which Ms. Bosia told me is named for musl contributor Szabolcs Nagy. This test starts by filling the array with ascending integers and then randomly swapping 1 out of 200 pairs of elements, so that 1 out of 100 elements will be permuted. Sample result:
Testing 100000 int elements 1% the nsz way (izaberra):
21990 5542065 11.430 ms 1.000 qs22ck
22220 5542205 12.022 ms 1.052 qs22bk
0 1285450 13.103 ms 1.146 musl
12220 5533135 13.434 ms 1.175 qs22j
12220 5533135 13.989 ms 1.224 qs22k
100070 7117190 19.584 ms 1.713 qs22b
170950 7398520 20.424 ms 1.787 qs22c
0 6248200 22.975 ms 2.010 system
12220 7486910 25.168 ms 2.202 qs22i
12150 7499480 25.804 ms 2.257 qs22f
12150 7499480 26.071 ms 2.281 qs22h
268015 7640370 27.890 ms 2.440 bentley_mcilroy
182730 7640370 27.890 ms 2.440 bentley_mcilroy2
0 7640370 27.891 ms 2.440 bentmcil
12150 7499480 28.510 ms 2.494 illumos
Here musl does much better. I don’t know why one way of creating “one percent” disorder (random values replaced) has a different result than another (random values permuted).
But I added another test, modeled on “1% the nsz way”:
Testing 100000 int elements 5% the nsz way (izaberra):
121520 6720975 16.322 ms 1.000 qs22ck
126860 6727660 17.038 ms 1.044 qs22bk
73740 6680010 19.001 ms 1.164 qs22j
73740 6680010 19.420 ms 1.190 qs22k
224190 7364715 20.785 ms 1.273 qs22c
183330 7203915 20.949 ms 1.283 qs22b
0 2343505 20.953 ms 1.284 musl
73740 7494280 25.897 ms 1.587 qs22i
73350 7501535 26.104 ms 1.599 qs22h
73350 7501535 27.264 ms 1.670 qs22f
0 6825135 27.939 ms 1.712 system
73350 7501535 28.918 ms 1.772 illumos
0 7710185 29.722 ms 1.821 bentmcil
261750 7710185 29.726 ms 1.821 bentley_mcilroy2
356755 7710185 30.047 ms 1.841 bentley_mcilroy
Here, with 5% of the elements permuted, musl is rapidly falling behind other sorts. When I have about half the elements randomly replaced, I get:
Testing 100000 int elements 50% sorted (izaberra):
2033520 7343990 38.587 ms 1.000 qs22bk
1862680 7366740 40.228 ms 1.043 qs22ck
0 7685520 42.298 ms 1.096 system
1890640 8344800 49.058 ms 1.271 qs22c
1750390 7805100 49.348 ms 1.279 qs22j
2058400 8374840 49.921 ms 1.294 qs22b
1751370 8209210 49.981 ms 1.295 qs22h
1750390 7805100 50.161 ms 1.300 qs22k
1751370 8209210 51.946 ms 1.346 illumos
1750390 8245560 53.334 ms 1.382 qs22i
1751370 8209210 58.796 ms 1.524 qs22f
0 8904400 59.894 ms 1.552 bentmcil
2113120 8904400 62.068 ms 1.608 bentley_mcilroy
1974380 8904400 64.630 ms 1.675 bentley_mcilroy2
0 21785295 146.554 ms 3.798 musl
I conclude that musl qsort
, though a tour de force of programming, is not as good as some simpler sorts, even for the cases for which it is to excel.
Yaroslavskiy’s dual-pivot quicksort
Wikipedia says ” A version of dual-pivot quicksort developed by Yaroslavskiy in 2009[11] turned out to be fast enough to warrant implementation in Java 7”. I obtained his paper and implemented qs22ydpq
from it as closely as possible. I used the same swapping strategy as in qs22b.c
and several other sorts. I believe it is a fair version for comparison. Here are some results:
Tot.rank Swaps Compares Time Ratio Implementation
1. 454 1295628260 6360932820 97.789 s 1.000 qs22j
2. 469 1295628260 6360932820 98.138 s 1.004 qs22k
3. 547 1474008500 6888529840 103.094 s 1.054 qs22bk
4. 549 1408957420 6901692000 103.493 s 1.058 qs22ck
5. 624 1616259440 7927606400 117.223 s 1.199 qs22b
6. 746 1565906240 7960573720 118.507 s 1.212 qs22c
7. 794 1459080660 7891557400 118.740 s 1.214 qs22i
8. 844 1448442400 7917118940 119.305 s 1.220 qs22h
9. 1165 1448442400 7917118940 123.061 s 1.258 qs22f
10. 1191 1448442400 7917118940 122.440 s 1.252 illumos
11. 1277 0 8221568620 152.017 s 1.555 bentmcil
12. 1357 1585231370 8221568620 126.020 s 1.289 bentley_mcilroy2
13. 1405 2057439810 8221568620 145.122 s 1.484 bentley_mcilroy
14. 1416 0 11193777940 173.147 s 1.771 system
15. 1562 2746796010 12037545955 192.830 s 1.972 qs22ydpq
It fared a little better in the “izabera” test suite:
Tot.rank Swaps Compares Time Ratio Implementation
1. 363 845862200 5800329540 84.921 s 1.000 qs22b
2. 378 664027320 4882687260 74.632 s 0.879 qs22j
3. 381 664027320 4882687260 74.755 s 0.880 qs22k
4. 381 770930420 5482457360 80.996 s 0.954 qs22ck
5. 394 823141500 5562942420 81.607 s 0.961 qs22bk
6. 441 795927280 5749257820 85.437 s 1.006 qs22c
7. 458 0 4657468760 77.295 s 0.910 system
8. 474 664035420 5713784460 86.189 s 1.015 qs22i
9. 502 663370440 5824862140 87.394 s 1.029 qs22h
10. 703 663370440 5824862140 89.520 s 1.054 illumos
11. 715 663370440 5824862140 90.119 s 1.061 qs22f
12. 816 0 6128745270 100.821 s 1.187 bentmcil
13. 826 777632180 6128745270 93.493 s 1.101 bentley_mcilroy2
14. 868 898924590 6128745270 99.248 s 1.169 bentley_mcilroy
15. 940 2007239550 7892935950 135.267 s 1.593 qs22ydpq
I know that dual-pivot quicksort has done better than some other traditional quicksorts in some tests, but I cannot get such a result myself.
quadsort
I also compared my implementations with Igor van den Hoven’s quadsort
, which is not strictly a replacement for qsort
, because (as far as I can tell) it cannot handle elements of arbitrary size, and also because it generally requires dynamic allocation of additional storage. (I did notice that it apparenly uses local storage if the malloc()
fails.) It has the advantage of being stable, and it is fast for the kinds of data it can sort. Here are some results:
Tot.rank Swaps Compares Time Ratio Implementation
1. 362 70596435 344791050 5.145 s 1.000 qs22k
2. 388 70596435 344791050 5.128 s 0.997 qs22j
3. 453 77901675 366759345 5.328 s 1.035 qs22ck
4. 484 82285275 371270505 5.400 s 1.050 qs22bk
5. 507 0 357547185 5.067 s 0.985 quadsort
6. 589 86727810 408541395 5.973 s 1.161 qs22c
7. 613 90954705 413041065 6.036 s 1.173 qs22b
8. 675 79030245 412410825 6.110 s 1.187 qs22h
9. 680 80264325 405888780 5.997 s 1.165 qs22i
10. 887 0 435643355 6.477 s 1.259 bentmcil
11. 910 79030245 412410825 6.281 s 1.221 illumos
12. 930 79030245 412410825 6.319 s 1.228 qs22f
13. 1091 117202900 435643355 6.605 s 1.284 bentley_mcilroy
14. 1101 0 552277515 8.200 s 1.594 system
15. 1130 88213925 435643355 6.603 s 1.283 bentley_mcilroy2
And on the “izabera” suite:
Tot.rank Swaps Compares Time Ratio Implementation
1. 204 0 170809545 2.701 s 1.000 quadsort
2. 294 54052335 337394955 4.875 s 1.805 qs22bk
3. 297 50088615 337302450 4.878 s 1.806 qs22ck
4. 307 42158775 301589505 4.491 s 1.662 qs22k
5. 312 42158775 301589505 4.516 s 1.672 qs22j
6. 346 55421475 352864440 5.137 s 1.902 qs22b
7. 369 52360275 356185275 5.231 s 1.936 qs22c
8. 372 0 287123550 4.494 s 1.663 system
9. 434 42159225 350948055 5.288 s 1.957 qs22i
10. 438 42131190 369796395 5.503 s 2.037 qs22h
11. 566 0 374882415 5.712 s 2.114 bentmcil
12. 588 42131190 369796395 5.677 s 2.101 qs22f
13. 597 42131190 369796395 5.637 s 2.086 illumos
14. 670 57704070 374882415 5.839 s 2.161 bentley_mcilroy
15. 686 49647700 374882415 5.825 s 2.156 bentley_mcilroy2
Note that here I could only test on int
, double
, and pointers to strings; I had to omit the test of sorting an array of struct
. In the “izabera” tests, it definitely performed the best, much faster and only about half as many compares as other sorts. In the Bentley-McIlroy “certification” tests, it was also fastest, but only by a small margin over qs22j
and qs22k
, and did more compares than the latter two. I would say that it may be the preferred sort for appropriate data, but it is not a replacement for a general qsort
implementation.
Some other sorts
I also implemented heapsort three different ways, as qs22heap1.c
, qs22heap2.c
, and qs22heap3.c
. And I cleaned up an old Shell sort as qs22ss.c
and added qs22ssb.c
, which uses the Ciura increments extended with Tokuda’s increments (roughly increasing by 2.25).
From other sources, I included
- sortix (qsort.c and qsort_r.c), which is a heapsort
- two implementations from Plan9 (the faster, plan9x
, is similar to Bentley-McIlroy)
- klibc
, which is a combsort (based on bubble sort; similar to Shell sort but slower?)
- two versions from Isabella Bosia’s qsortbench: izabera_mini
is a Shell sort and izabera
is Introsort, which switches to heapsort if the stack gets too deep
- mccaughan
, which comes from Gareth McCaughan (thanks Gareth)
- rg91mod
, which is a cleaned-up version of my original qsort
from the C SNIPPETS collection.
Results from the Bentley-McIlroy “certification” tests:
Tot.rank Swaps Compares Time Ratio Implementation
1. 475 70596435 344791050 5.118 s 1.000 qs22j
2. 589 70596435 344791050 5.163 s 1.009 qs22k
3. 800 77901675 366759345 5.429 s 1.061 qs22ck
4. 814 82285275 371270505 5.438 s 1.062 qs22bk
5. 926 80264325 405888780 5.964 s 1.165 qs22i
6. 1026 90954705 413041065 6.049 s 1.182 qs22b
7. 1120 86727810 408541395 6.057 s 1.183 qs22c
8. 1144 83471010 381329220 5.783 s 1.130 reactos_pre_shell
9. 1233 139797720 413041065 6.067 s 1.185 qs22a
10. 1253 80972410 379594140 5.628 s 1.100 bentley_mcilroy_pre
11. 1274 79030245 412410825 6.167 s 1.205 qs22h
12. 1394 83433570 381337545 5.842 s 1.141 bentley_mcilroy_pre_shell
13. 1463 0 381329220 5.991 s 1.171 freebsd_pre_shell
14. 1561 81369090 401815305 6.094 s 1.191 reactos_pre
15. 1604 0 381329220 6.139 s 1.199 bionic_pre_shell
16. 1641 79030245 412410825 6.312 s 1.233 illumos
17. 1658 0 443195895 6.615 s 1.292 netbsd
18. 1717 0 401815305 6.296 s 1.230 freebsd_pre
19. 1720 119535810 404955795 6.317 s 1.234 reactos_shell
20. 1765 0 401815305 6.246 s 1.220 bionic_pre
21. 1767 79030245 412410825 6.347 s 1.240 qs22f
22. 1834 0 443195895 6.646 s 1.298 openbsd
23. 1840 119511645 404966685 6.315 s 1.234 bentley_mcilroy_shell
24. 1844 0 435643355 6.534 s 1.277 bentmcil
25. 2107 0 443195895 6.756 s 1.320 plan9x
26. 2109 125018610 435137595 6.609 s 1.291 izabera
27. 2221 117202900 435643355 6.641 s 1.297 bentley_mcilroy
28. 2309 0 404955795 6.627 s 1.295 freebsd_shell
29. 2323 88213925 435643355 6.641 s 1.297 bentley_mcilroy2
30. 2336 0 404955795 6.732 s 1.315 bionic_shell
31. 2404 116393205 443195895 6.824 s 1.333 reactos_nopt
32. 2533 148089265 608166795 8.449 s 1.651 qs22ydpq
33. 2619 0 443195895 7.076 s 1.382 bionic_nopt
34. 2660 0 552277515 8.240 s 1.610 system
35. 2716 0 443195895 7.110 s 1.389 freebsd_nopt
36. 2942 0 818425530 11.234 s 2.195 mccaughan
37. 3304 246164475 823153905 11.682 s 2.282 rg91mod
38. 3428 0 733947885 12.662 s 2.474 uclibcng
39. 3513 0 646548858 10.947 s 2.139 dietlibc
40. 3545 0 733947885 12.844 s 2.509 uboot
41. 3646 0 895761870 13.755 s 2.687 plan9
42. 3673 0 733947885 13.080 s 2.555 qs22ss
43. 3711 188541285 765917280 13.316 s 2.601 qs22ssb
44. 3806 0 1050856410 24.915 s 4.867 musl
45. 3826 0 790258260 13.790 s 2.694 izabera_mini
46. 3979 0 1116928185 19.585 s 3.826 qs22heap3
47. 4059 0 1116928185 19.950 s 3.898 qs22heap2
48. 4082 0 1116928185 19.981 s 3.904 qs22heap1
49. 4133 0 2826931590 36.694 s 7.169 klibc
50. 4303 0 1667686365 30.445 s 5.948 sortix
And with the “izabera” tests:
Tot.rank Swaps Compares Time Ratio Implementation
1. 524 54052335 337394955 4.666 s 1.000 qs22bk
2. 538 42158775 301589505 4.346 s 0.931 qs22j
3. 579 42158775 301589505 4.332 s 0.928 qs22k
4. 609 55421475 352864440 4.953 s 1.061 qs22b
5. 637 50088615 337302450 4.784 s 1.025 qs22ck
6. 664 62668725 352864440 4.928 s 1.056 qs22a
7. 727 52360275 356185275 5.006 s 1.073 qs22c
8. 756 42159225 350948055 5.048 s 1.082 qs22i
9. 782 0 287123550 4.314 s 0.925 system
10. 938 66062175 341839455 4.978 s 1.067 izabera
11. 998 54595995 360513570 5.300 s 1.136 reactos_pre_shell
12. 1015 42131190 369796395 5.341 s 1.145 qs22h
13. 1018 49320085 341022800 4.824 s 1.034 bentley_mcilroy_pre
14. 1029 0 428830110 5.915 s 1.267 mccaughan
15. 1057 0 379594800 5.435 s 1.165 netbsd
16. 1116 49838775 369120480 5.350 s 1.146 reactos_pre
17. 1133 0 360513570 5.399 s 1.157 freebsd_pre_shell
18. 1147 54598605 360504990 5.332 s 1.143 bentley_mcilroy_pre_shell
19. 1180 0 369120480 5.491 s 1.177 freebsd_pre
20. 1199 0 369120480 5.477 s 1.174 bionic_pre
21. 1213 63545220 345832935 5.230 s 1.121 bentley_mcilroy_shell
22. 1217 0 379594800 5.511 s 1.181 openbsd
23. 1222 63540480 345836550 5.259 s 1.127 reactos_shell
24. 1257 0 360513570 5.500 s 1.179 bionic_pre_shell
25. 1266 0 374882415 5.426 s 1.163 bentmcil
26. 1285 42131190 369796395 5.452 s 1.168 illumos
27. 1345 42131190 369796395 5.472 s 1.173 qs22f
28. 1374 0 379594800 5.609 s 1.202 plan9x
29. 1424 0 345836550 5.392 s 1.155 bionic_shell
30. 1431 0 345836550 5.402 s 1.158 freebsd_shell
31. 1480 57704070 374882415 5.496 s 1.178 bentley_mcilroy
32. 1513 57062130 379594800 5.646 s 1.210 reactos_nopt
33. 1536 0 379594800 5.722 s 1.226 bionic_nopt
34. 1555 127341220 477259515 6.562 s 1.406 qs22ydpq
35. 1561 87352530 609372225 8.158 s 1.748 rg91mod
36. 1564 49647700 374882415 5.539 s 1.187 bentley_mcilroy2
37. 1640 0 379594800 5.771 s 1.237 freebsd_nopt
38. 1652 0 451693395 6.770 s 1.451 plan9
39. 1671 0 464056080 8.670 s 1.858 uclibcng
40. 1731 0 464056080 8.772 s 1.880 uboot
41. 1802 0 464056080 8.918 s 1.911 qs22ss
42. 1948 136362075 445998675 8.309 s 1.781 qs22ssb
43. 1992 0 619016520 14.317 s 3.068 musl
44. 2002 0 453806985 8.580 s 1.839 izabera_mini
45. 2214 0 534551857 8.478 s 1.817 dietlibc
46. 2389 0 724325025 12.949 s 2.775 qs22heap3
47. 2410 0 5416476885 73.914 s 15.838 klibc
48. 2449 0 724325025 13.222 s 2.833 qs22heap1
49. 2452 0 724325025 13.130 s 2.814 qs22heap2
50. 2606 0 1145668620 21.112 s 4.524 sortix
The last word from me on qsort? Not sure…
A more recent development, pdqsort (pattern-defeating quicksort) I believe first surfaced in 2017. The paper has some ideas that may benefit a standard C qsort
, but some may be more useful in settings that permit an inlined comparison routine.
I may work on these some more in the future, but I am done for now. Other projects beckon.