Notice that this is on random longs. Of course branch misprediction and memory bandwidth is going to crush you for a branchy sort... you'll be wrong like half the time, and the comparisons and swapping are trivial! Real world data isn't random, so I'd expect branch predictors to do much better with recognizing patterns. And your sorting isn't going to be as simple as sorting integers according to their values all that often. It's a neat exercise, but absent more realistic benchmarking to the contrary, I wouldn't assume I'll get a faster program by just substituting even a typical quicksort with this...
Sorting is the case study but not the insight. I think the cool/surprising part of the article is that doing more work can be faster... Even more memory work (which per conventional wisdom is not trivial!)
>> Sorting is the case study but not the insight. I think the cool/surprising part of the article is that doing more work can be faster
Yeah, I figured heap sort should be the standard because it has a worst case O(n*log(n)) complexity. But it does tend to do more swaps (a smarter implementation can avoid many writes), and the memory access is very cache-unfriendly. In practice it's just not used.
My comment wasn't inherently about sorting either. Branch-free in general is faster in the worst cases, i.e. if the branch is difficult to predict. Except most real-world branches are predictable... that's why we have branch predictors. So in general you should expect branch-free to be slower, unless you have reason to assume your branches are actually close to random.
> unless you have reason to assume your branches are actually close to random.
I mean, to the degree that you need to sort the data at all, it contains informational entropy.
Many "presents as sorted" data structures trade space for time by keeping track of the internal sortedness of chunks of the data, between sortation passes. Heck, any insert-optimized data structure (B+ trees; LevelDB's sorted-string-tables; N-ary tries generally) will get most of its performance gains by being able to guarantee, at node split time, that the children of any particular node are sorted relative to one-another.
So, in many cases, by the very fact that you don't already have metadata attached to your data asserting that it's sorted, it's highly likely to have no branch-predictability.
(Couple that to the fact that load-balancing in OLTP distributed systems favors writes to evenly-keyspace-distributed partitioning keys, ala Dynamo, and you'll frequently find that your inputs to an index-like data structure can be guaranteed random.)
I'm struggling to follow your comment. I said real-world data have patterns and aren't completely random, so you can in general expect branch predictors to help (unless, obviously, your data is actually known to be random like in the example). You rebutted with "because load balancers evenly spread their key space, you will frequently find that your inputs can be guaranteed random", as if that somehow contradicts the point I was making...? It's also clearly not true that it's "highly likely" your data has "no branch-predictability" just because "you don't already have metadata attached to it asserting it's sorted"... right? That's not only both obviously not the case on its own, but also, why would you even sort data that's already guaranteed to be sorted? I can't make sense of this, so I feel like we must be speaking past each other somehow? But I'm confused how.
The insight I might have not made clear, is that if you have “this node and its children are already sorted” metadata, then it’s cheaper to check whether a given (fine-grained) input chunk is already sorted—and if so, to directly construct a node from it—than it is to feed it blindly to the sorting algorithm.
If such a pre-check is in play, then the sorting algorithm itself will basically never have the sort of “runs” of sorted values that make some algorithms cheaper. Those “runs” were already plucked out and turned into nodes!
Branches also can't be vectorized, so going branch free can lead to a substantial speedup if you can use vector instructions (you might be doing twice the work, but your vector instruction may let you do it 4-16 times faster).
Indeed, AVX512 instructions can sort 16 int32 or well-behaved float values optimally, with no branches. It is a good final pass to any other algorithm.
There is a 28-step 16-input network there. It admits a direct translation to AVX-512 instructions.
What is usually considered minimal differs from what is best for AVX512. AVX512 is happy to do 16 comparisons at each stage; its efficiency is limited only by the number of stages. Thus, a network that did more comparisons in fewer stages would be faster.
Thanks. It's kind of sad that there's no single instruction for horizontally sorting in vectors of integers. Horizontal merge/reduce would also be cool
(Author here.) That is incorrect. The difficult case is and the real benchmark is with unpredictable data. The low-entropy cases will be about as fast with both partitioning schemes.
This is only a smart part of the benchmarks I've run because I wanted to drive one point home within a limited space. I've run many tests on various data types and shapes, and Lomuto does better than Hoare on most. (E.g. its improvement on double is even larger than on integrals.)
I'm confused, what exactly is incorrect? I said if we don't have more realistic benchmarks then I wouldn't just assume it's faster. Now you're saying you did in fact do more realistic benchmarks, and you found lower entropy ones to exhibit the same performance (which is not faster). They all seem consistent with each other?
I also said that sorting is often more complicated than just comparing integers by their values, which you didn't really address except to mention doubles, which missed the point I was making (I was trying to say sorting often needs e.g. satellite information).
In any case though, if you've done other benchmarks, it'd be nice if you could post them on GitHub or something. Maybe I'm missing something.
Lots of people replaced their sorting algorithms with Tim-Sort because runs of alrealdy-sorted data are empirically common (at least in some workloads). I've seen it myself, in use-cases where sort keys change slightly from iteration to iteration in some larger algorithm, and sort items need to be dynamically kept in order. It no doubt also happens when grabbing data from multiple places, each of which is theoretically ordered arbitrarily but is in practice ordered predictably.
So yeah, the author picked the use-cases where this algorithm is going to show in its best light. But that's not bad either, right? Data is often not well-ordered -- if you're counting occurrences of words, and want to get an ordered list (and never mind that you wouldn't use a comparison-based sort for that) you wouldn't expect much order. Likewise sorting any keys/values from a decent hash table.
One worry that I would have applies specifically to C++ and D: sometimes values are big, and copies are expensive. But I guess users can always choose to sort pointers like other languages do, if the standard library uses a copy-heavy, branch-light algorithm.
I think one example of extremely unsorted data is random data in Monte Carlo simulations, in particular if you need to compute statistical summaries that need the data to be sorted - e.g. percentiles. So you may end up repeatedly sorting million-long random arrays many time a second.
Then again, maybe there are algorithms letting you compute percentiles without sorting?
> Then again, maybe there are algorithms letting you compute percentiles without sorting?
I think you may be able to do so probabilistically, which is often good enough? Additionally, one observation about percentiles is that you don't need to sort all of the data; you can just hold on to the top N results in something like a Heap.
Extracting percentile information doesn't require sorting. Even if you're too lazy to implement anything custom, a few invocations of std::nth_element will likely outperform sorting: https://en.cppreference.com/w/cpp/algorithm/nth_element
Yeah, you can histogram the values if you just need an approximate answer (or if you have few enough discrete values, you can just have a bin per value). Once you have an approximate answer, if you need a more precise answer, you can go through the data again and only consider values within the right bin (which hopefully will be a lot fewer, if you've chosen your bins wisely) and then use something like std::nth_element.