Re-engineering a qsort function (part 3)

The qsort optimization that wasn’t

In 1993, Jon Bentley and Doug McIlroy published a paper Engineering a Sort Function (also here) (Software–Practice and Experience, vol. 23 no. 11, Nov. 1993, pp 1249-1265). Since then, most C stdlib qsort functions have been based on the code and/or the ideas in that paper.

The qsort in the BSD variants and descendants have followed the code in that paper. But someone in 1994 attempted to “improve” on Bentley and McIlroy, with potentially disastrous results.

The NetBSD code prior to 1994, in NetBSD qsort 1.1 (copied from 386BSD), contained a qsort that attempts to follow the quicksort discussion in Donald Knuth’s The Art of Computer Programming, volume 3, Sorting and Searching (1973, first edition). I believe the author would have been better off following the advice in the later paper by Robert Sedgewick, Implementing Quicksort Programs (Communications of the ACM, vol 21 no 10, Oct. 1978 pp 847–857). Sedgewick was Knuth’s PhD student and had done his thesis on quicksort.

The comments in the 386BSD code imply to me that some of Knuth was misunderstood. For example: “Knuth, Vol. 3, page 116, Algorithm Q, step b, argues that a single pass of straight insertion sort after partitioning is complete is better than sorting each small partition as it is created.”

Oddly, Knuth never suggests a single pass of insertion sort instead of sorting each small partition; that was suggested later by Sedgewick. Knuth step b says “Subfiles of M or fewer elements are sorted by straight insertion…” and details doing that in step Q8, page 117.

A later comment says “Insertion sort has the same worst case as most simple sorts (O N^2). It gets used here because it is (O N) in the case of sorted data.” It is true that it is O(N²) worst case and O(N) if the data is sorted, but that is not why it is used with quicksort. It is used because it is faster for small subsections than quicksort due to its lower “bookkeeping” overhead, per Knuth on page 116.

The 386BSD routine departed from Knuth in an important respect: its partitioning code “jumped over” equal elements instead of stopping and swapping them. Bentley and McIlroy noted: “Shopping around for a better qsort, we found that a qsort written at Berkeley in 1983 would consume quadratic time on arrays that contain a few elements repeated many times–in particular arrays of random zeros and ones.” And later: “The Berkeley qsort takes quadratic time on random zeros and ones precisely because it scans over equal elements…”

The author also did something else not found in Knuth, however. There is this comment:

	 * Quicksort behaves badly in the presence of data which is already
	 * sorted (see Knuth, Vol. 3, page 119) going from O N lg N to O N^2.
	 * To avoid this worst case behavior, if a re-partitioning occurs
	 * without swapping any elements, it is not further partitioned and
	 * is insert sorted.  This wins big with almost sorted data sets and
	 * only loses if the data set is very strangely partitioned.  A fix
	 * for those data sets would be to return prematurely if the insertion
	 * sort routine is forced to make an excessive number of swaps, and
	 * continue the partitioning.

(Knuth does not say this on page 119; he does say on page 122 “If the original file is already in order … this situation … makes quicksort anything but quick; its running time becomes proportional to N² instead of N log N.” This refers to a version that chooses the first element as the pivot. But Knuth describes ways of fixing the situation, including choosing the pivot at random or choosing the median of first, middle, and last elements. And the 386BSD routine does use the median-of-3 method.)

So the code includes a flag didswap which is set if the partitioning pass does any swapping, and then uses an insertion sort on the partition if didswap is false. This “optimization” loses very big if the data is “very strangely partitioned”. This is probably the cause of this issue called out in Bentley and McIlroy: “…several other problems with the Berkeley code. (For instance, a mod-4 ‘sawtooth’ with its front half reversed ran a factor of 8 slower than expected.)” Also, the comment blithely suggests “A fix for those data sets would be to return prematurely if the insertion sort routine is forced to make an excessive number of swaps” without indicating how that might be done without too much code or computation.

The Bentley-McIlroy paper reported that “In any event, the new code represents enough of an improvement to have been adopted in our own lab and by Berkeley.” But someone at Berkeley failed to read and understand the paper. The first FreeBSD qsort (1994-05-27) based on the Bentley-McIlroy paper, (also in NetBSD revision 1.1.1.2 (1994-06-16) and OpenBSD revision 1.1 (1995-10-18)) was copied from 4.4BSD-Lite. It has a variable swap_cnt that serves the same purpose as didswap in the earlier qsort. It’s misnamed, because it’s a 0/1 flag, not a counter. Then there is:

	if (swap_cnt == 0) {  /* Switch to insertion sort */
		for (pm = a + es; pm < (char *) a + n * es; pm += es)
			for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0;
			     pl -= es)
				swap(pl, pl - es);
		return;
	}

This causes exactly the problem with the sawtooth pattern mentioned above.

This problem was noticed a long time ago. See, e.g., this thread by postgresql developers. Apparently the “very strangely partitioned” data happened in the real world:

qsort, once again
From:Tom Lane
Date:16 March 2006, 18:37:58

I was just looking at the behavior of src/port/qsort.c on the test case
that Jerry Sievers was complaining about in pgsql-admin this morning.
I found out what the real weak spot is: it's got nothing directly to do
with good or bad pivots, it's this code right here:
   if (swap_cnt == 0)   {                            /* Switch to insertion sort */
   [... code shown above -- rdg ...]
   }
In other words, if qsort hits a subfile for which the chosen pivot is a
perfect pivot (no swaps are necessary), it switches to insertion sort.
Which is O(N^2).  In Jerry's test case this happens on a subfile of
736357 elements, and you can say goodnight to that process ....

This swap_cnt code was in place in NetBSD until qsort revision 1.20, 2009-06-01, about 15 years later. (Commit message: “qsort: remove the “switch to insertion sort” optimization because it causes catastrophic performance for certain inputs.”)

It remained in OpenBSD until qsort revision 1.12, 2014-06-12, about 20 years later. (Commit message: “Disable the “switch to insertion sort” optimization to avoid quadratic behavior for certain inputs. From NetBSD.”

As of this writing, it remains in FreeBSD and in several other libc distributions, e.g. Newlib, picolibc, bionic, reactos.

The postgresql team apparently decided to use their own qsort exclusively. It includes a different optimization: at the beginning of each partitioning operation, it checks to see if the data is already in order, and skips sorting that subfile if it is already sorted. (Some members of the group are skeptical of this check.)

It also occurred to me that perhaps a Shell sort in place of the if (swap_cnt == 0) insertion sort would improve performance.

I took several versions that include the swap_cnt code and removed that code, then tried adding a pre-sorted test, a Shell sort in place of the if (swap_cnt == 0) insertion sort, and both the pre-sorted test and the Shell sort. These versions are named nnn_nopt, nnn_pre, nnn_shell, and nnn_pre_shell for the no-swap-count-optimization, pre-sort-check, Shell sort, and pre-sort-plus-Shellsort versions, where nnn is the name of the original version (freebsd, reactos, bionic). And I added bentley_mcilroy_pre, bentley_mcilroy_shell, and bentley_mcilroy_pre_shell versions as well.

I added a pre-check for sorted data to qs22i to get qs22j, and used that in the following tests.

Here are some results (10000 elements, 5 reps):

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 377     94128580    459721400   6.806 s   1.000 qs22j
 705    107019100    541185040   8.011 s   1.177 qs22i
 817    111294680    508438960   7.933 s   1.165 reactos_pre_shell
 918    111244760    508450060   7.935 s   1.166 bentley_mcilroy_pre_shell
 955            0    508438960   8.064 s   1.185 freebsd_pre_shell
1003            0    535753740   8.258 s   1.213 freebsd_pre
1010    109557610    515009880   7.992 s   1.174 bentley_mcilroy_pre
1053            0    508438960   8.206 s   1.206 bionic_pre_shell
1124    108492120    535753740   8.295 s   1.219 reactos_pre
1305    159348860    539955580  10.402 s   1.528 bentley_mcilroy_shell
1326            0    535753740   8.377 s   1.231 bionic_pre
1380    159381080    539941060  10.103 s   1.484 reactos_shell
1487    118093240    585420940   8.908 s   1.309 bentley_mcilroy2
1536            0    539941060  10.134 s   1.489 freebsd_shell
1545            0    585420940  11.149 s   1.638 bentmcil
1630            0    539941060  10.196 s   1.498 bionic_shell
1680    155190940    590927860  10.575 s   1.554 reactos_nopt
1727    158006340    585420940  10.597 s   1.557 bentley_mcilroy
1754            0    590927860  10.648 s   1.564 freebsd_nopt
1867            0    590927860  11.162 s   1.640 bionic_nopt

And here are results from the “izabera” (Isabella Bosia) tests:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 299     56211700    402119340   5.971 s   1.000 qs22j
 441     56212300    467930740   6.952 s   1.164 qs22i
 591            0    492160640   7.356 s   1.232 freebsd_pre
 594     72794660    480684760   7.396 s   1.239 reactos_pre_shell
 654            0    480684760   7.455 s   1.249 freebsd_pre_shell
 685     72798140    480673320   7.437 s   1.246 bentley_mcilroy_pre_shell
 698     66451700    492160640   7.412 s   1.241 reactos_pre
 706     67278940    465405160   6.992 s   1.171 bentley_mcilroy_pre
 729            0    480684760   7.541 s   1.263 bionic_pre_shell
 771     84726960    461110580   7.793 s   1.305 bentley_mcilroy_shell
 783            0    492160640   7.502 s   1.256 bionic_pre
 786     84720640    461115400   7.700 s   1.290 reactos_shell
 840            0    461115400   7.791 s   1.305 freebsd_shell
 870     76082840    506126400   8.012 s   1.342 reactos_nopt
 900            0    503202250   8.304 s   1.391 bentmcil
 906            0    506126400   8.145 s   1.364 bionic_nopt
 925            0    506126400   8.078 s   1.353 freebsd_nopt
 934            0    461115400   7.878 s   1.319 bionic_shell
 942     66704230    503202250   7.788 s   1.304 bentley_mcilroy2
1065     78515800    503202250   8.271 s   1.385 bentley_mcilroy

Does the swap_cnt code improve the performance if we avoid the cases where it goes quadratic? Here are results including the original versions of bionic, freebsd, newlib, nlopt, picolibc, and reactos:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 345    812425500   3995362140  62.996 s   1.000 qs22j
 579    934827860   4255667160  70.232 s   1.115 reactos_pre_shell
 675    934744100   4255449220  70.785 s   1.124 bentley_mcilroy_pre_shell
 707            0   4255667160  72.269 s   1.147 bionic_pre_shell
 750    942426500   5280619020  80.768 s   1.282 qs22i
 836            0   4462256520  85.932 s   1.364 picolibc
 874            0   4255667160  73.264 s   1.163 freebsd_pre_shell
 875    921501620   4550322480  73.574 s   1.168 reactos_pre
 895            0   4462256520  86.235 s   1.369 newlib
 923            0   4550322480  75.254 s   1.195 freebsd_pre
 952            0   4550322480  75.365 s   1.196 bionic_pre
 957    925527200   4324550240  71.875 s   1.141 bentley_mcilroy_pre
1018   1671482800   4462256520  86.726 s   1.377 reactos
1097            0   4462256520  86.923 s   1.380 netbsd_1_4
1268   1024111730   5490627380  85.876 s   1.363 bentley_mcilroy2
1281            0   4462256520  90.780 s   1.441 bionic
1328   1354974780   5048315340  95.209 s   1.511 reactos_shell
1380   1354903560   5047593740  94.987 s   1.508 bentley_mcilroy_shell
1413            0   4462256520  91.197 s   1.448 freebsd
1431            0   4462256520  95.693 s   1.519 nlopt
1446            0   5490627380 104.098 s   1.652 bentmcil
1477   1334927640   5490627380  98.820 s   1.569 bentley_mcilroy
1507            0   5532021360  99.464 s   1.579 freebsd_nopt
1511   1318873580   5532021360  99.074 s   1.573 reactos_nopt
1530            0   5048315340  97.048 s   1.541 freebsd_shell
1573            0   5532021360 100.056 s   1.588 bionic_nopt
1612            0   5048315340  97.321 s   1.545 bionic_shell

In most cases, adding only the Shell sort resulted in somewhat slower overall performance, but adding the pre-sort check and the Shell sort give better performance than the swap_cnt versions.

But including the cases that cause quadratic performance give clearly bad results for the swap_cnt sorts (ordered here by time rather than finishing rank):

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 578     41547940    200606648   3.256 s   1.000 qs22j
 929     49860980    219465656   3.699 s   1.136 reactos_pre_shell
 973     49843156    219434364   3.717 s   1.141 bentley_mcilroy_pre_shell
1192            0    219465656   3.801 s   1.167 bionic_pre_shell
1346     47973632    222390056   3.874 s   1.190 bentley_mcilroy_pre
1308     47641164    232805696   3.885 s   1.193 reactos_pre
1098     47050340    246166168   3.914 s   1.202 qs22i
1370            0    232805696   4.024 s   1.236 freebsd_pre
1331            0    219465656   4.035 s   1.239 freebsd_pre_shell
1440            0    232805696   4.092 s   1.257 bionic_pre
1738     51969562    263591438   4.345 s   1.334 bentley_mcilroy2
1766     70217484    240903484   4.722 s   1.450 reactos_shell
1847     70204772    240834664   4.805 s   1.475 bentley_mcilroy_shell
2153            0    240903484   4.877 s   1.498 bionic_shell
2128            0    240903484   4.940 s   1.517 freebsd_shell
2079     67699272    266621708   5.046 s   1.550 reactos_nopt
2021     68827308    263591438   5.065 s   1.555 bentley_mcilroy
2257            0    266621708   5.099 s   1.566 bionic_nopt
2181            0    266621708   5.099 s   1.566 freebsd_nopt
2002            0    263591438   5.383 s   1.653 bentmcil
1835            0   1624564520  31.208 s   9.583 netbsd_1_4
1650            0   1624564520  31.601 s   9.704 picolibc
1639            0   1624564520  32.435 s   9.960 newlib
1724   1495494748   1624564520  32.919 s  10.108 reactos
2194            0   1624564520  37.001 s  11.362 bionic
2311            0   1624564520  37.474 s  11.507 freebsd
2270            0   1624564520  38.115 s  11.703 nlopt

Not shown here are the Bosia tests, but they show similar results.

I conclude that if you want to use a qsort derived from Bentley and McIlroy, you should at least replace the swap_cnt == 0 insertion sort with a Shell sort, and probably add the check for already-sorted data.

Another change made by BSD from the Bentley-McIlroy paper was to replace the second of the two recursive calls with goto, with the comment “Iterate rather than recurse to save stack space”. This will save little or no space unless the partitioning is significantly uneven. But it also does not prevent a possible O(N²) stack depth usage. (Also, the goto goes to a point just before SWAPINIT(a, es);, so the SWAPINIT() macro gets called redundantly many times.) Only in 2017 did the major BSD-derived versions change to always recursing on the smaller partition, which eliminates the quadratic stack space problem. (Bentley and McIlroy explicitly decided not to deal with the possibility.)

One other change made by FreeBSD seems odd to me: Just recently (January 2022) they made a change “prevent undefined behavior,” with the following comments:

Mark Milliard has detected a case of undefined behavior with the LLVM
UBSAN. The mandoc program called qsort with a==NULL and n==0, which is
allowed by the POSIX standard. The qsort() in FreeBSD did not attempt
to perform any accesses using the passed pointer for n==0, but it did
add an offset to the pointer value, which is undefined behavior in
case of a NULL pointer. This operation has no adverse effects on any
achitecture supported by FreeBSD, but could be caught in more strict
environments.

After some discussion in the freebsd-current mail list, it was
concluded that the case of a==NULL and n!=0 should still be caught by
UBSAN (or cause a program abort due to an illegal access) in order to
not hide errors in programs incorrectly invoking qsort().

Only the the case of a==NULL and n==0 should be fixed to not perform
the undefined operation on a NULL pointer.

I cannot see how this “is allowed by the POSIX standard,” which says “The qsort() function shall sort an array of nel objects, the initial element of which is pointed to by base.” Nowhere does it say the base pointer can be NULL; it must point to “an array of nel objects.” This is clearly a bug in mandoc. There are many ways to call qsort with invalid arguments that can cause various sorts of trouble.

On the matter of avoiding undefined behavior, many of the BSD variants still use the Bentley-McIlroy “trick” of ((char *)a - (char *)0) to get a usable “address” from the base pointer. This is undefined behavior; both pointers in a pointer subtraction must point within the same array (or one past the end). I think these should be updated to use ((uintptr_t)(void *)a) if uintptr_t is available (it’s an optional <stdint.h> type in C99 and C11).

(See also: part 1, part 2, part 4, part 5)