Re-engineering a qsort function (part 2)

(Continued from part 1.)

I was starting to feel pretty good about my qsort implementation. I also downloaded a number of qsort implementations from various sources. Several were from libc and related implementations, including NetBSD, FreeBSD, OpenBSD, dietlibc, u-boot, μClibc and uclibc-ng, Newlib, picolibc, klibc, Bionic, and musl.

Several of these were derived from EASF by way of 4.4BSD-Lite. The u-boot, μClibc and uclibc-ng versions are actually slightly modified versions of a Shell sort I wrote over 30 years ago. The musl version is an implementation of E.W. Dijkstra’s smoothsort, and klibc claims to be combsort, which I believe is a diminishing-increment sort (as in Shell sort) variant bubble sort.

My qs22b and qs22c performed well against all of them. I found that Newlib, picolibc, FreeBSD, Bionic and Reactos all go quadratic on certain inputs (primarily Bentley-McIlroy’s “reverse front half” patterns), so I omitted them from tests here. The musl and klibc and the Shell sorts performed poorly enough that I omitted them from most subsequent tests.

I got results like this:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 204    115637080    544721860   8.111 s   1.000 qs22c
 257    121272940    550721420   8.186 s   1.009 qs22b
 599    118093240    585420940   9.042 s   1.115 bentley_mcilroy2
 601            0    590927860  10.485 s   1.293 openbsd
 640            0    585420940  10.646 s   1.313 bentmcil
 654            0    590927860  10.586 s   1.305 netbsd
 693    158006340    585420940  10.742 s   1.324 bentley_mcilroy
 981            0    978597180  17.215 s   2.122 uclibcng
1087            0    862289174  16.191 s   1.996 dietlibc
1146            0    978597180  17.902 s   2.207 uboot
1151            0   1401141880  32.548 s   4.013 musl
1347            0   3769242120  48.261 s   5.950 klibc

More about the test program

Then I found Isabella Bosia’s qsortbench. She collected several qsort implementations and wrote a nice benchmark to compare their performance against various data patterns. I liked the output format and the tests themselves, so I reimplemented the tests, plus a few more into my test program. I added a swap count to her output format. Her system ranks the results of each test over all the sorts, and then totals the rank for each sort in the summary at the end. I also added a summary of total compares. The ranking technique causes the final rankings to sometimes be different in order than the times or total compares for each sort implementation.

The EASF paper contained an outline of a “certification” program that was to be used more as a torture test than as a benchmark. It generates data in a variety of patterns known to give trouble to some quicksort implementations. The original used C int and double datatypes. I implemented my own version and added sorting of pointers and structs.

For sorting pointers, I represented the original integer values as strings and used pointers to the strings. I also artificially increased the effort of the comparison function by reversing one of the strings twice, to make comparisons more expensive. For structs, I use a 230 byte struct and store the string representation of the integer into the last 20 bytes. I padded the struct to make swaps more expensive.

Except for Bosia’s own implementations, I did not use the sorts in her set. I obtained all the sorts shown here from primary sources. Also, unless otherwise indicated, these tests are using the EASF data patterns, 10,000 elements repeated 5 times, data types int, double, pointers, and structs.

I had not seen the Illumos system qsort before I found Bosia’s qsortbench. I got a copy from the OpenIndiana site. My qs22b and qs22c did well against all the sorts found so far, but the version from the Illumos system did better on swaps:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 217    115637080    544721860   8.019 s   1.000 qs22c
 256    121272940    550721420   8.081 s   1.008 qs22b
 427    105373660    549881100   8.424 s   1.051 illumos
 608            0    590927860  10.254 s   1.279 openbsd
 673    118093240    585420940   8.914 s   1.112 bentley_mcilroy2
 688            0    590927860  10.344 s   1.290 netbsd
 710            0    585420940  10.530 s   1.313 bentmcil
 747    158006340    585420940  10.578 s   1.319 bentley_mcilroy
1074            0    862289174  15.948 s   1.989 dietlibc

Trying to get better

My versions ran a little faster, but I wondered how the Illumos version did so much better on swaps. I didn’t want to look at the code to avoid copying theirs. I did grep for swap routines so I could insert swap counting, and saw that they use separate functions for different element sizes and set up pointers to the functions as needed.

I noticed that there were copious comments, so I grepped the comments out of the Illumos code and read what the author (I believe that is Saso Kiselkov) says about how the code works. Instead of swapping the pivot to a special location, he leaves it in place. The partitioning is done in a way that avoids swapping some elements that compare equal to the pivot, by sometimes changing the pivot pointer to refer to a different but equal-key element. This is rather hard to explain, but the code may make it clearer. I went through several attempts to get my function to match the Illumos code on compares and swaps, without referring to the Illumos code except to note how the pivot selection code and thresholds differed from EASF.

I ultimately got a function qs22f that matched the Illumos sort on compares and swaps, but it’s not quite as fast as qs22b and qs22c, probably due to its more complex partitioning:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 227    115637080    544721860   8.029 s   1.000 qs22c
 287    121272940    550721420   8.131 s   1.013 qs22b
 372    105373660    549881100   8.338 s   1.038 qs22f
 436    105373660    549881100   8.429 s   1.050 illumos
 658            0    585420940  10.512 s   1.309 bentmcil
 679    118093240    585420940   8.922 s   1.111 bentley_mcilroy2
 701    158006340    585420940  10.584 s   1.318 bentley_mcilroy

I modified qs22f to use pointers to swap functions instead of the EASF-style conditional swap code, to get qs22h:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 268    115637080    544721860   8.185 s   1.000 qs22c
 312    121272940    550721420   8.270 s   1.010 qs22b
 382    105373660    549881100   8.401 s   1.026 qs22h
 440    105373660    549881100   8.446 s   1.032 qs22f
 540    105373660    549881100   8.547 s   1.044 illumos
 782    118093240    585420940   9.123 s   1.115 bentley_mcilroy2
 797            0    585420940  10.775 s   1.316 bentmcil
 799    158006340    585420940  10.757 s   1.314 bentley_mcilroy

This runs a little faster than qs22f or the illumos version, with the same number of compares and swaps. But then I tweaked the pivot selection code to get qs22i. I also added tests to skip recursing or iterating on subfiles of less than two elements:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 338    115637080    544721860   8.081 s   1.000 qs22c
 390    107019100    541185040   8.157 s   1.009 qs22i
 391    121272940    550721420   8.161 s   1.010 qs22b
 424    105373660    549881100   8.277 s   1.024 qs22h
 538    105373660    549881100   8.402 s   1.040 qs22f
 629    105373660    549881100   8.549 s   1.058 illumos
 869            0    585420940  10.585 s   1.310 bentmcil
 887    118093240    585420940   9.023 s   1.117 bentley_mcilroy2
 934    158006340    585420940  10.671 s   1.321 bentley_mcilroy

This shows qs22i takes about .984 as many compares as illumos and about 1.016 times as many swaps. The tradeoff looks better with the “izabera” tests (test_sorts -z -r 5 100000), with about .95 as many compares and barely (1.00067 times) more swaps:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 176     73895300    470485920   6.918 s   1.000 qs22b
 202     69813700    474913700   6.956 s   1.005 qs22c
 227     56212300    467930740   7.064 s   1.021 qs22i
 295     56174920    493061860   7.374 s   1.066 qs22h
 366     56174920    493061860   7.470 s   1.080 qs22f
 418     56174920    493061860   7.534 s   1.089 illumos
 508     66704230    503202250   7.717 s   1.116 bentley_mcilroy2
 513            0    503202250   8.189 s   1.184 bentmcil
 535     78515800    503202250   8.263 s   1.194 bentley_mcilroy

But note that despite these improvements, qs22b and qs22c still ran faster.

More work on my version is covered in part 4. Before getting to that, I want to digress a bit on the BSD legacy of EASF in part 3.

(See also: part 1, part 5)