Aliasing concerns

While looking at commit logs for various BSD qsort versions, I noticed this:

*   libc qsort(3): stop aliasing.   Konstantin Belousov 2018-06-10  1   -44/+17

Which led to some details:

libc qsort(3): stop aliasing.
Qsort swap code aliases the sorted array elements to ints and longs in
order to do swap by machine words.  Unfortunately this breaks with the
full code optimization, e.g. LTO.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201 which seems to
reference code directly copied from libc/stdlib/qsort.c.

Which in turn led to this, at the gcc development site:

505.mcf_f produces incorrect output when built with both LTO/FDO. Using either
option separately is fine. GCC trunk r255207 was used. Following are options
used.

[...]

Likely invalid.  spec_qsort is full of alias violations.  We sort

typedef struct basket
{
    arc_t *a;
    cost_t cost;
    cost_t abs_cost;
    LONG number;
} BASKET;

and spec_qsort does stuff like

[...]

#define swap(a, b)                              \
        if (swaptype_long == 0) {               \
                long t = *(long *)(a);          \
                *(long *)(a) = *(long *)(b);    \
                *(long *)(b) = t;               \
        } else if (swaptype_int == 0) {         \
                int t = *(int *)(a);            \
                *(int *)(a) = *(int *)(b);      \
                *(int *)(b) = t;                \
        } else                                  \
                swapfunc((char *)a, (char *)b, es, swaptype_long, swaptype_int)

eh... (no, swapfunc isn't any better)

[...]

I just looked at the source seeing *(long *) ptr and bells went off.

So the GCC team found that a certain benchmark produced anomalous resuts when both LTO (Link Time Optimization) and FDO (Feedback-Directed Optimization) were used, and attributed it to aliasing issues. The rules in sec. 6.5 of the C99 and C11 standards appear to say that accessing the elements being sorted as long, int, or sequences of long or int violate aliasing rules, at least if the actual data being sorted are not of exactly those types.

As a result, the FreeBSD team removed all the special swapping code (derived from the Bentley-McIlroy EASF paper) from their qsort implementation, leaving only code that swaps elements a byte at a time.

I have made qs22k.c by modifying qs22j.c to use memcpy() to do all the swapping. I also made modified versions qs22bk.c and qs22ck.c from qs22b.c and qs22c.c, and ran some more tests. I also included the system sort, which is apparently from glibc. Here are some results:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 1.  430   1295628260   6360932820  97.304 s   1.000 qs22j
 2.  447   1295628260   6360932820  97.666 s   1.004 qs22k
 3.  553   1474008500   6888529840 102.664 s   1.055 qs22bk
 4.  574   1408957420   6901692000 103.303 s   1.062 qs22ck
 5.  592   1616259440   7927606400 116.397 s   1.196 qs22b
 6.  760   1565906240   7960573720 118.266 s   1.215 qs22c
 7.  776   1459080660   7891557400 118.475 s   1.218 qs22i
 8.  827   1448442400   7917118940 118.951 s   1.222 qs22h
 9. 1157   1448442400   7917118940 121.839 s   1.252 illumos
10. 1177   1448442400   7917118940 122.455 s   1.258 qs22f
11. 1262            0   8221568620 150.855 s   1.550 bentmcil
12. 1321   1585231370   8221568620 125.109 s   1.286 bentley_mcilroy2
13. 1342            0  11193777940 173.175 s   1.780 system
14. 1382   2057439810   8221568620 144.101 s   1.481 bentley_mcilroy

And the corresponding “izabera” (I. Bosia) tests:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 1.  372    845862200   5800329540  84.991 s   1.000 qs22b
 2.  373    664027320   4882687260  74.581 s   0.878 qs22j
 3.  379    770930420   5482457360  81.157 s   0.955 qs22ck
 4.  388    664027320   4882687260  74.881 s   0.881 qs22k
 5.  389    823141500   5562942420  81.829 s   0.963 qs22bk
 6.  440            0   4657468760  77.355 s   0.910 system
 7.  446    795927280   5749257820  85.320 s   1.004 qs22c
 8.  451    664035420   5713784460  86.051 s   1.012 qs22i
 9.  495    663370440   5824862140  87.337 s   1.028 qs22h
10.  675    663370440   5824862140  89.244 s   1.050 illumos
11.  720    663370440   5824862140  90.345 s   1.063 qs22f
12.  789    777632180   6128745270  93.485 s   1.100 bentley_mcilroy2
13.  790            0   6128745270 100.835 s   1.186 bentmcil
14.  853    898924590   6128745270  99.273 s   1.168 bentley_mcilroy

So we see that the memcpy versions are not as fast as the pointers-to-swap-functions versions, but still perform respectably. The memcpy here is intrinsic in GCC on Intel. I suspect it is much slower if it is an external function. But if you need to be concerned about possible aliasing problems, it may be a good option.

On the other hand, though it is undefined according to the standard, I have to wonder if the actual aliasing problem exposed by the GCC benchmark described above was actually triggered by the swapping function being called with pointers to the same location. Recall that bentley_mcilroy and my qs22a both do call for swapping an element with itself, and bentley_mcilroy2 and my own later versions avoid this.

I would welcome some way of testing if GCC can be made to display an actual aliasing problem on qs22b or later versions.

(See also: part 1, part 2, part 3, part 5)