Lomuto's Comeback
November 21, 2020 5:40 AM   Subscribe

Yet a very practical--and very surprising--argument does exist, and is the punchline of this article: implemented in a branch-free manner, Lomuto partition is a lot faster than Hoare partition on random data. Given that quicksort spends most of its time partitioning, it follows that we are looking at a hefty improvement of quicksort (yes, I am talking about industrial strength implementations for C++ and D) by replacing its partitioning algorithm with one that literally does more work. posted by smcg (25 comments total) 25 users marked this as a favorite
 
I never thought I’d start the day with D fanfic yet here we are.
posted by geoff. at 6:26 AM on November 21 [5 favorites]


This really hammers home how furious machines modern processors are. Multiple pipelines of instructions tasked with operations predicted ahead of time by a scheduler. If you can trade rote actions for decisions, it can greatly benefit.
posted by nickggully at 6:40 AM on November 21 [2 favorites]


In fact it doesn't actually do more work. It specifies more work.

What's doing more work is modern CPUs, which treat every conditional branch as an opportunity to double their workload (briefly) by executing instructions along both possible post-branch paths until the branch condition has had time to get evaluated, then throwing away the results from the pathway that the condition was supposed to avoid.

This is intended to avoid wasted time due to pipeline stalls when the hardware's inbuilt branch prediction heuristics fail, but if there are very few instructions between conditional branches it can actually tie up enough execution units to waste about a stall's worth of processing time anyway.

The way this particular implementation of Lomuto partitioning is coded removes a conditional branch from inside the inner loop, meaning that the CPU doesn't have to double up in this way. And given that this particular inner loop is so very short, avoiding even a brief doubling-up on every trip around the loop pays off bigtime in freed-up execution unit capacity - substantially more than enough to offset the increased amount of work specified by Lomuto's partitioning algorithm compared to Hoare's.

Fascinating result! Has me wondering whether a similar kind of inner-branch elimination could be devised for Hoare's algorithm.
posted by flabdablet at 6:41 AM on November 21 [17 favorites]


How many different languages out there are called “D”?
posted by mhoye at 8:32 AM on November 21 [2 favorites]


How many different languages out there are called “D”?

My favorite D language is called Rust!
posted by geoff. at 8:51 AM on November 21 [3 favorites]


Has me wondering whether a similar kind of inner-branch elimination could be devised for Hoare's algorithm.

A comment on the article gives a reference for just such an improvement, and links to a GitHub project that explains it pretty clearly!

A similar idea can be used to eliminate a branch while searching a binary tree, if the pointers to the left and right subtrees are stored in a 2-element array that’s indexed by the result of a comparison.
posted by mubba at 8:56 AM on November 21 [5 favorites]


How many different languages out there are called “D”?

Just the one, that I know of. As it’s about twenty years old and has a smallish but vigorous community, I would be surprised if someone in the position of naming another new programming language had not heard of it.
posted by musicinmybrain at 8:58 AM on November 21 [4 favorites]


If this performs a faster qsort, why hasn't it been implemented in the STL?
posted by They sucked his brains out! at 10:06 AM on November 21


Probably some combination of factors...

The observation may be pretty recent. And since it's dependent on branching trade-offs, it sounds like it would be worse in (say) single threaded microcontroller environments, where the trade-off might matter a bit more.
posted by kaibutsu at 10:57 AM on November 21


flabdablet: "if there are very few instructions between conditional branches it can actually tie up enough execution units to waste about a stall's worth of processing time anyway."

Does this suggest that the branchless Lomuto algorithm is probably only viable for arrays of primitives like long or int, where the actual comparison is performed by a single cmp instruction? Would this approach lose its advantage when sorting larger structures (possibly needing dereferencing) or with more complex comparison logic, like strings? And if so, does that mean it would end up worse than Hoare for those cases after all, due to the extra swaps?

That's not to minimize this... sorting arrays of primitives is valuable, and I'm amazed that there's still blood to be squeezed out of that stone after all these decades. This was a great article, thanks smcg!
posted by Riki tiki at 10:59 AM on November 21


Strings are basically a number comparison.
posted by They sucked his brains out! at 12:55 PM on November 21


This is fascinating, and makes a whole lot of sense given how absurdly fast cpus are for best case in-cache branchless processing, but mostly it makes me want to hear about more algorithms that sound like fencing moves after the fashion of The Princess Bride, i.e "You are excellent! I see you have studied Hoare's Partition .
Yes, of course. But surely you would know I would counter with Lomuto's comeback." etc.
posted by Jon Mitchell at 1:20 PM on November 21 [6 favorites]


They sucked his brains out!: Strings are basically a number comparison.

Oh how I wish that were true. But even if you only cared about ASCII and were fine with its particular ordering of case and punctuation, sorting those strings still requires multiple numeric comparisons, probably a variable number of them (meaning you presumably have at least one branching failure in each comparison loop, when you reach the end of the shorter string).

Also, text comparison for real-world purposes is far more complicated than ASCII numeric ordering, sadly (and that spec is essentially about sorting arrays of codepoints, so unless you're using UTF-32 you also have a decoding step to take into account).
posted by Riki tiki at 1:27 PM on November 21 [6 favorites]


smaller = -int(x < pivot);

andralex claims he's eliminating conditionals, but this line is hiding a conditional branch.

In C-like languages, a conditional expression treated like a number -- such as int(x < pivot) -- is supposed to have the value 1 if it's true and 0 if it's false. So down at the basic assembly-language level you need to compare the two values (a CMP instruction) and conditionally branch to code that either sets the accumulator to 0x00000001 or 0x00000000.
posted by Harvey Kilobit at 2:18 PM on November 21


So down at the basic assembly-language level you need to compare the two values (a CMP instruction) and conditionally branch to code that either sets the accumulator to 0x00000001 or 0x00000000.

There was a YouTube video I cannot for the life of me find that was long-ish and had a guy (UK? Australia?) explaining how to take out conditionals in cpp code then would go into assembly to show that the compiler didn't muck it up. The video was meant to show that skipping branching prediction at the processor level is a huge time saver at the expense of some awkwardly written code.

In any case, yeah a lot of times you're hiding conditionals was the point and the only way to know that was look at the compiled assembly.
posted by geoff. at 2:25 PM on November 21


Yeah, I just converted lomuto_partition_branchfree() to C and compiled it (gcc -O3 -c) and disassembled it (objdump -d) and there are indeed some CMP's and conditional jumps in there.
posted by smcameron at 2:30 PM on November 21 [2 favorites]


@geoff. I can’t respond with disassembled code, but the original assertions was limiting conditional branches. A CMP (which affects the Carry flag) followed by ADC (which includes the Carry flag) using immediate zero (or -1, depending on the polarity) on Intel x86 architecture can process the conditional without a branch. I’m sure the resulting code depends on optimization flags during compiling.
posted by dozenal at 2:58 PM on November 21 [2 favorites]


If you hover over that -int(x < pivot) line in the link to the gcc compilation from the article, it highlights these corresponding assembly instructions:
xor esi, esi
cmp r11, r8
setl sil
mov r10d, esi
// and esi, 1
neg r10d
(that and instruction is a reordered part of the following two lines of C++ code)

The setl is conditional, but it's not a jump so it's not obviously creating a branching situation. Based on that and the description in the "explanatio longa, codex brevis est" part of the article, it seems to be taking advantage of the processor's ability to intrinsically do what seems to us like a conditional numeric conversion.

Please correct me if I'm wrong, it's been a while since I've been looking at things this close to the metal (hence my wondering about more high-level structures like strings).
posted by Riki tiki at 3:04 PM on November 21


If it doesn't branch (i.e. change the instruction pointer and go off executing some other code), then it's fast. Mess with the compare flags and such all you want for free.
posted by ryanrs at 4:13 PM on November 21 [1 favorite]


The setl is conditional, but it's not a jump so it's not obviously creating a branching situation.

Even on instruction sets without conditional instructions like setl it's often possible to get similar behaviour using an add with carry or rotate instruction after a comparison.

Superoptimization
posted by flabdablet at 7:54 PM on November 21 [1 favorite]


Would this approach lose its advantage when sorting larger structures (possibly needing dereferencing) or with more complex comparison logic, like strings?

Modern processors have countless design tradeoffs between power consumption and execution speed. I would expect that the only definitive way to answer this question would involve conducting performance tests with the particular processor at hand, preferably under a range of different system loads.
posted by flabdablet at 12:39 AM on November 22


Andrei Alexandrescu is an extremely smart guy who revolutionized template metaprogramming in C++ and that in turn led to many very important improvements in the language itself.

He’s a guy who likes to probe into the dark and obscure corners of things to find hidden gems. This is one of them. But at the same time it’s a “toy problem” in the sense that it does not scale in any practical way. Usually when you sort, you sort, say, employee records by department or hire date or something. So the records that need to be swapped are significantly larger than the sort key itself. Swapping might even be non-trivial. Rest assured that he is well aware of this.

The reason to use quicksort over mergesort is that it uses no auxiliary memory. If you could throw O(n) memory at it (tweeted) mergesort or “timsort” is a clear winner because they can run O(n log(runs(n)) consistently, have no O(n^2) pathological cases, can be stable, and sort “nearly sorted” data in nearly linear time. But for the generic case you can’t allocate O(n) memory like it’s not a thing.

The technique is not at all suitable for the STL because of that, but is a really good example of how we need to start thinking about the cache and speculative execution in our core algorithms. And this is a really nice example of that, suitable for study and to pique people’s interest in the issues. Myself, I think a productive avenue to explore is making hash maps more cache and branch-predict aware. People are definitely on that problem.

Anyway, it’s fascinating and very instructive but not in itself, at this time, practical for general sorting tasks. But very worthy of study and extensions of the idea may well pay off.
posted by sjswitzer at 5:17 PM on November 22 [3 favorites]


I should add that if you find this sort of thing interesting you should absolutely look up YouTube videos of his conference presentations. They are all top notch: entertaining, informative, and likely to change how you think about things you’ve “known” for decades.
posted by sjswitzer at 5:21 PM on November 22 [1 favorite]


^^ “tweeted” should have been “tweaked.” Autocorrect, sigh.

Anyway the typo is an opportunity to footnote that the tweak I meant is to batch the initial merges by preexisting runs, exploiting latent order in many data sets. Sorting an already sorted list should be O(n); you deserve that! Low level optimizations are something we need to focus on now because computers are very different than they were when the classic algorithm texts were written, but in day to day programming you’re probably still better off thinking about the high-level algorithmic issues. Library authors, though, have a lot of room to innovate.
posted by sjswitzer at 5:34 PM on November 22


To put it more succinctly, today’s commodity computers have more L2 cache than mainframes had in total RAM when the algorithm classics were written.
posted by sjswitzer at 5:45 PM on November 22 [1 favorite]


« Older “he convinced the cops it was a promotion for the...   |   Your ideas are intriguing to me and I wish to... Newer »


You are not currently logged in. Log in or create a new account to post comments.