Fastest Branchless Binary Search
You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:
template <class ForwardIt, class T, class Compare>
constexpr ForwardIt sb_lower_bound(
ForwardIt first, ForwardIt last, const T& value, Compare comp) {
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
if (comp(first[length], value)) {
first += length + rem;
}
}
return first;
}
Same function interface as std::lower_bound
, but 2x faster, and shorter. “branchless” because the if
compiles down to a conditional move instruction rather than a branch/conditional jump. We will explore compiler options, even faster versions, fully branchless, and caveats towards the end of the article. There’s a significant update too. Oh and I’m sorry about just thrusting C++ code on you. Rude, I know. You don’t really need to know C++ to understand this article, just iterators (first
and last
), basically pointers to elements in an array, though note they can point one past the last array entry. Ignore template
, class
, constexpr
and &
. If only there was a clean fast bare-metal language to write all this in.. 1 2
Binary search intro
You have a sorted list and want to find where a value
would fit. In C++ you’d use std::lower_bound
which returns the first position (iterator) for which the comparison (usually <
) fails, or last
if the comparison is true for all elements. Let’s write that, so surprise coding interview question! (In C++. Good luck.)
The binary
in binary search comes from splitting up the list into two at some middle item and doing a comparison of the middle against the given value
. Based on the comparison result we pick which of the two lists to keep looking in. We start with a list of some length = last - first
that starts at the given iterator first
. We need to have some loop that keeps going until the list is empty, i.e. while (length > 0)
. Pick the/a middle, at length / 2
do a comparison and update the current list, which we’ll do by updating first
and length
. Here goes:
// std_lower_bound()
auto length = last - first;
while (length > 0) {
auto half = length / 2;
if (comp(first[half], value)) {
first += half + 1;
length -= half + 1;
} else {
length = half;
}
}
return first;
That was straight-forward. What we got was a slightly refactored version of what gcc/libstdc++ uses or what llvm/libc++ uses. Roughly the same speed, even (sometimes) compiles down to the same assembly in the loop.
Branch prediction
“What’s slow about this?” Not much but glad you asked, great question. Processors have gotten faster and faster over the years, and part of the reason why is branch prediction. Short explanation: for speed, the CPU attempts to execute instructions in parallel by dividing each instruction execution into multiple stages, say F-D-E-W (fetch, decode, execute, writeback). With a careful design it can make progress on multiple stages of different instructions (Example: instruction 5 in stage F, 6 in D, 7E, 8F) at the same time. The complication comes from branches in the code, i.e. conditional jump instructions, where depending on some result the next instruction is either X or Y. The CPU predicts one option, X, and starts running through the stages, eventually including the stages of say X+1 and X+2. When the result becomes available and it turns out it should have been Y, all the work on X, X+1 and X+2 is thrown away. Branch mispredictions are expensive because the CPU could have already made progress on Y, Y+1 and Y+2 instead.
Branch prediction is great, but not for binary search. It’s a search, and you only search for something if you don’t know exactly where it is, otherwise you’d just get it. Which means that if (comp(first[half], value))
is unpredictable in common use of std::lower_bound
.
Credit @practicemakespink
Let’s help the processor.
-We:
// bstd_lower_bound()
auto length = last - first;
while (length > 0) {
auto half = length / 2;
bool compare = comp(first[half], value);
// No more ifs
first = compare ? first + half + 1 : first;
length = compare ? length - half - 1 : half;
}
return first;
-Clang compiler: “That’s not how this works!”
-We: -mllvm -x86-cmov-converter=false
-Clang compiler: “Yes, boss."3
The result is 25% faster as it uses 2 conditional move instructions. Not bad. But turns out that -mllvm -x86-cmov-converter=false
, which we’ll shorten to -cmov
, speeds up std::lower_bound
just as much because clang is smart enough to figure out how to convert the if
/else
to conditional moves. gcc doesn’t have such an option and generally just doesn’t care about what you want.
Overall, there’s currently no good way to tell either clang4 or gcc5 to use a conditional move in just a certain situation.I’m trying to not make this article about finicky compilers, so let’s move on.
What started this
Why are we even talking about speeding up binary searches anyway? Why am I roping you into this? Cause someone else roped me into this.
Rick and Morty - Meeseeks and Destroy
I saw Malte Skarupke’s translation of “Shar’s algorithm” into a C++ binary search (branchless_lower_bound
) and I couldn’t help but think “it’s not optimal”. Malte’s version sometimes compares value
against the same element multiple times. So I wondered, what is the optimal ‘branchless’ binary search? That led to sb_lower_bound
which is ~20% faster than Malte’s version of lower_bound that he tested as 2x faster than GCC.
“What’s an ‘optimal’ binary search anyway?” Good question. I think a binary search is ‘optimal’ if it completes by doing the minimum number of comparisons. This is very useful when you have a (relatively) slow comparison function. Malte noted his version is slower than std::lower_bound
for binary searching a large number of strings.
Looking at std::lower_bound
it returns an iterator, which can point to any of the list elements but also one past the end. For a list of size n
there are n+1
possible options. Thus for a list of size 2k-1
there are 2k
possible options. In this case the optimal number of comparisons is k
. Provably so as being able to distinguish between all 2k
options requires k
bits of information, and each comparison gives us 1 bit of information (true vs false). Translating this case into code, we get:
// Really fast but only works when length is a power of 2 minus 1
// When not, it can result in out of bounds accesses like for
// array [0, 1, 2, 3, 4, 5] and value 4
auto length = last - first;
while (length > 0) {
length /= 2;
if (comp(first[length], value)) {
first += length + 1;
}
}
return first;
With clang -cmov
the loop compiles down to 6 instructions, one of which is cmov
. The reason this (and Malte’s code) is so fast is that only updating first
depends on the comparison result.
sb_lower_bound
Now let’s look at sb_lower_bound
(named after simple branchless). It actually took me longer to stumble upon than faster versions (yet to be presented) as it’s not ‘optimal’.
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
if (comp(first[length], value)) {
first += length + rem;
}
}
return first;
Every time we split the list, the number of elements happens to be even, and comp returns true then we don’t skip over enough elements. For n
elements, where 2k <= n < 2k+1
, sb_lower_bound
will always make k+1
comparisons while std::lower_bound
will either make k
or k+1
comparisons. On average std::lower_bound
will make about log2(n+1)
comparisons. Overall sb_lower_bound
is faster as it has significantly fewer instructions in the loop. The comparison function has to be really slow for the difference between k+1
and log2(n+1)
number of comparisons to matter.
Second caveat is that currently, gcc does not emit branchless code for sb_lower_bound
regardless of optimization level. It doesn’t emit branchless code for std::lower_bound
either so they end up about as fast. We can try to write some inline assembly to force gcc to use cmov
but there’s a tradeoff. The simple way results in more instructions than necessary. The alternative requires writing different assembly code for every possible type of value
(int, float, etc.).
// asm_lower_bound, works for x86 only
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
auto firstplus = first + length + rem;
// Does a comparison which sets some x86 FLAGS like CF or ZF
bool compare = comp(first[length], value);
// Inline assembly doesn't support passing bools straight into FLAGS
// so we ask the compiler to copy it from FLAGS into a register
__asm__(
// Then we test the register, which sets ZF=!compare and CF=0
// Reference: https://www.felixcloutier.com/x86/test
"test %[compare],%[compare]\n"
// cmova copies firstplus into first if ZF=0 and CF=0
// Reference: https://www.felixcloutier.com/x86/cmovv
"cmova %[firstplus],%[first]\n"
: [first] "+r"(first)
: [firstplus] "r"(firstplus), [compare]"r"(compare)
);
}
return first;
2x faster than the gcc version of std::lower_bound
, but slightly slower than sb_lower_bound
with clang -cmov
. Presented just for demonstration purposes.
More optimal
I promised an even faster version in the intro. Here it is:
// bb_lower_bound
auto length = last - first;
// while length isn't a power of 2 minus 1
while (length & (length + 1)) {
auto step = length / 8 * 6 + 1; // MAGIC
if (comp(first[step], value)) {
first += step + 1;
length -= step + 1;
} else {
length = step;
break;
}
}
while (length != 0) {
length /= 2;
if (comp(first[length], value)) {
first += length + 1;
}
}
return first;
Let’s quickly go over length & (length + 1)
. Example: length
is 110011
, length+1
is 110100
, length & (length+1)
is 110000
. Note how they always share their most significant 1
except for when length+1
carries over all set bits. That case is when length
is of the form 11..1
, i.e. power of 2 minus 1, in which case length & (length + 1)
will be 0. Slightly adapted from one of the bit twiddling tricks I remembered and there’s a (warning!) rabbit hole full of awesome tricks here.
Let us call lengths ‘nice’ if they are powers of 2 minus 1. Earlier we found that for nice lengths we can have optimal searching. The original idea was that for non-nice length we split the search list not in half but in a way that we’ll quickly end up only searching sub-lists of nice lengths. This idea was not quite fast enough as it meant spending significant time just splitting up the list. The first compromise is that early break
, which means the second while
loop may (non-optimally) search past its original length; no out of bounds accesses though as this sub-list isn’t at the end of the full list. This compromise leads to a second compromise, that bit of MAGIC in selecting the step
size. To be fast we want a large step
, definitely >= length/2
, so we more often break out to the faster while
loop. But not so large a step
that almost equals length
because we lose on what makes a binary search fast. We’d also prefer that the step
is nice or has many 1
s in its binary representation, which makes the second while
loop more optimal. The many 1
s is why the step
is forced to be odd. Last but not least, we’d want length - step - 1
to be nice.
I’ve tried quite a few variations. The most bitter sweet was auto step = length >> 1; step |= step >> 1;
which I thought to be a good fast heuristic that balances the listed compromises, is very close to optimal but ultimately ended up slightly slower than MAGIC. Another issue is the ‘break’ makes it branchy.
One unexplored avenue is exhaustively searching for the fastest (yet still optimal) step
for every length and storing that into a precomputed table. Then either deriving a good heuristic from said table or using it outright. With sb_lower_bound
I’ve reached my good-enough point but you’re welcome to explore further :).
auto length = last - first;
while (length & (length + 1)) {
auto step = precomputed[length];
if (comp(first[step], value)) {
first += step + 1;
length -= step + 1;
} else {
length = step;
}
}
// ...
More branchless, more better?
Short answer: no. This section can be skipped but here’s the long answer: For n
elements where 2k <= n < 2k+1
sb_lower_bound
will always do k+1
comparisons. On a 64-bit machine that means at most 64 iterations of the while (length > 0)
loop. So it’s possible to write a “fully branchless” version that doesn’t have the length
check by using a switch
with intentional fall-through.
size_t length = last - first;
size_t rem;
switch(std::bit_width(length)) {
case 64:
rem = length % 2;
length /= 2;
first += comp(first[length], value) * (length + rem);
case 63:
rem = length % 2;
length /= 2;
first += comp(first[length], value) * (length + rem);
// ...
case 1:
rem = length % 2;
length /= 2;
first += comp(first[length], value) * (length + rem);
}
return first;
If you’re not familiar with switch
, think of it as a jump into code. In our case to the exact place from which there are exactly the right number of comparisons left to do.
“Is it any faster?” No. Modern x86 processors handle loop conditions well as they’re predictable; we’re very likely to remain in the loop. And that’s good especially because it saves us from writing templates or macros or copy-paste-edit the 64 cases.
Spotlight on Performance
Credit @practicemakespink
Average run time (ns):
std::lower_ | branchless_lower_ | asm_lower_ | sb_lower_ | sbm_lower_ | bb_lower_ | |
---|---|---|---|---|---|---|
gcc | 80.68 | 43.65 | 54.92 | 77.22 | 34.19 | 75.64 |
clang | 78.66 | 90.97 | 51.83 | 80.06 | 83.95 | 74.44 |
clang -cmov | 61.30 | 43.43 | 54.32 | 33.24 | 35.54 | 32.73 |
Geometric mean run time (ns):
std::lower_ | branchless_lower_ | asm_lower_ | sb_lower_ | sbm_lower_ | bb_lower_ | |
---|---|---|---|---|---|---|
gcc | 62.44 | 25.55 | 32.35 | 59.67 | 20.62 | 57.93 |
clang | 61.24 | 65.72 | 30.67 | 63.59 | 66.91 | 58.19 |
clang -cmov | 39.17 | 25.14 | 31.21 | 19.81 | 20.91 | 21.33 |
Runtimes in line chart form:
“What’s sbm_lower_bound
?” It’s basically sb_lower_bound
but modified to trick gcc into generating a conditional move. The difference is switching from the if
statement to first += comp(first[length], value) * (length + rem)
. Use with comments and caution as the next version of gcc may undo this optimization.
Benchmarking commands showing compiler options:
g++-10 -std=c++20 -Wall -O2 -march=haswell test.cpp -o test && ./test
clang++-10 -std=c++20 -Wall -O2 -march=haswell test.cpp -o test && ./test
clang++-10 -std=c++20 -Wall -O2 -march=haswell -mllvm -x86-cmov-converter=false test.cpp -o test && ./test
-march=native
or no -march
did not significantly influence the rankings. Benchmarked on an intel i7 kaby lake.
Branch mispredictions
We can look at branch mispredictions/misses using perf
:
# clang
perf stat ./test
# [..]
16,599.95 msec task-clock:u # 1.000 CPUs utilized
# [..]
30,309,755,516 instructions:u # 0.53 insn per cycle
6,941,783,502 branches:u # 418.181 M/sec
1,203,569,540 branch-misses:u # 17.34% of all branches
# clang -cmov
perf stat ./test
# [..]
10,982.97 msec task-clock:u # 1.000 CPUs utilized
# [..]
32,603,123,521 instructions:u # 0.90 insn per cycle
4,070,883,093 branches:u # 370.654 M/sec
35,954,999 branch-misses:u # 0.88% of all branches
-cmov
removes ~2.9B branches and ~1.2B branch misses, so it removes branches that were mispredicted ~41% of the time. It’s close to the 50% we’d expect for purely unpredictable branches, if we had perfect randomness and benchmarking. In which case -cmov
would result in an even higher improvement.
Performance with slower comp()
For somewhat realistic scenarios of binary searching with a slower comp()
function I’ve thought of searching through ids, phone numbers, accounts and keywords. I’ve thus settled on testing searching 8-byte strings.
Average run time (ns):
std::lower_ | branchless_lower_ | sb_lower_ | sbm_lower_ | bb_lower_ | |
---|---|---|---|---|---|
gcc | 160.01 | 205.24 | 165.66 | 193.96 | 163.91 |
clang | 157.71 | 178.77 | 162.68 | 166.00 | 157.22 |
clang -cmov | 156.06 | 193.70 | 164.71 | 181.57 | 157.48 |
In this case std::lower_bound
is very slightly but consistently faster than sb_lower_bound
. To always get the best performance it is possible for libraries to use sb_lower_bound
whenever directly working on primitive types and std::lower_bound
otherwise.
Assembly
No bare-metal code performance optimization is complete without looking at the generated assembly. However, feel free to skip this section.
Time% | instructions
// std::lower_bound, clang -cmov, hottest loop assembly
1.68 │20: mov %rsi,%rcx // rcx = length
3.20 │ shr %rcx // rcx = length / 2
1.08 │ mov %rcx,%rdx // rdx = length / 2
4.42 │ not %rdx // rdx = binary_not(rdx) = -length / 2 - 1
2.85 │ add %rsi,%rdx // rdx = length - length / 2 - 1
63.41 │ vucomiss (%rax,%rcx,4),%xmm0 // Compare first[rcx] with value, sets FLAGS
0.21 │ lea 0x4(%rax,%rcx,4),%rsi // rsi = first + length / 2 + 1
10.14 │ cmova %rsi,%rax // first = compare_res ? first : rsi
4.87 │ cmovbe %rcx,%rdx // rdx = not compare_res ? length / 2 : rdx
0.88 │ mov %rdx,%rsi // rsi = rdx (new length)
0.33 │ test %rdx,%rdx // Set FLAGS based on rdx
3.90 │ ↑ jg 20 // Jump to instruction 20 if rdx not zero
// sb_lower_bound, clang -cmov, hottest loop assembly
4.21 │20: shr %rcx // rcx = length / 2
4.70 │ and $0x1,%esi // esi = length % 2
5.00 │ add %rcx,%rsi // rsi = length / 2 + length % 2
68.86 │ vucomiss (%rax,%rcx,4),%xmm0 // Compare first[rcx] with value, sets FLAGS
0.01 │ lea (%rax,%rsi,4),%rdx // rdx = first + length / 2 + length % 2
11.69 │ cmova %rdx,%rax // first = compare_res ? first : rdx
3.71 │ mov %rcx,%rsi // rsi = length / 2
│ test %rcx,%rcx // Set FLAGS based on rcx
0.02 │ ↑ jne 20 // Jump to instruction 20 if rcx not zero
// branchless_lower_bound, clang -cmov, hottest loop assembly
3.04 │70: lea (%rdi,%rcx,4),%rax // rax = first + length
71.22 │ vucomiss (%rdi,%rcx,4),%xmm0 // Compare first[rcx] with value, sets FLAGS
12.26 │ cmova %rax,%rdi // first = compare_res ? first : rax
1.03 │7d: shr %rcx // length /= 2, but note it also sets FLAGS
1.64 │ ↑ jne 70 // Jump to instruction 20 if length not zero
The branchless_lower_bound
assembly is really short and clean. While that’s a good indicator of speed, sb_lower_bound
wins out in the performance tests due to low overhead.
Conclusion
If the slowest part of your program involves searching and/or comparisons that a processor would not be able to predict then try clang with -mllvm -x86-cmov-converter=false
(if your processor is x86).
If you’d benefit from a faster binary search, try sb_lower_bound
(or for gcc you could also try sbm_lower_bound
). I’ve made it open source, MIT license.
If you want more articles like this, follow me @_mhdm on Twitter (I don’t post often). Alternatively, you can use RSS.
Code, including benchmarking, is available at github.com/mh-dm/sb_lower_bound/.
If you have ideas, thoughts, or something to add you can leave a comment here.
Update
Following a fruitful comment by orlp.net author, sb_lower_bound
can be refactored slightly to reduce the number of assembly instructions in the hot loop from 9 to 8.
// sb_lower_bound refactored
auto length = last - first;
while (length > 0) {
auto half = length / 2;
if (comp(first[half], value)) {
// length - half equals half + length % 2
first += length - half;
}
length = half;
}
return first;
With clang -cmov
there’s a slight speed up from ~33ns to ~32ns for the average run time.
Prefetching
From comments there was a recommended method for a further speed-up, namely prefetching. Prefetching pulls particular parts of memory into cache (usually L1/L2) so that when prefetched data is actually needed the load only incurs L1/L2 cache latency (~4/12 cycles) vs slower cache (L3, ~40 cycles) or memory latency (~200 cycles). Example timings. For this there’s __builtin_prefetch()
supported in both gcc and clang.
Say we’re going to check against the element at length / 2
. We could prefetch at length / 4
and length * 3 / 4
. Or we could also prefetch at length / 8
divisions, which is an extra 4 memory locations. For length / 4
divisions, 1 out of every 2 prefetches will be wasted, doubling cache pressure. If also length / 8
divisions, 5 out of every 6 prefetches will be wasted. There’s overhead in computing the locations and prefetching, overhead that will be significant in the hot loop we worked hard to make short.
Finally, if we’ve already made a few full searches the initial divisions are likely to be in the cache as many elements should fit in modern 256KB+ L2 caches.
When trying various prefetching strategies, none helped for under 256KB arrays, which is about what we expect. Long story short, here’s sb_lower_bound
but with prefetching added in for 256KB+:
// sbp_lower_bound
auto length = last - first;
// Sized to roughly fit in L2 cache
constexpr int entries_per_256KB = 256 * 1024 / sizeof(T);
if (length >= entries_per_256KB) {
constexpr int num_per_cache_line = std::max(64 / int(sizeof(T)), 1);
while (length >= 3 * num_per_cache_line) {
auto half = length / 2;
__builtin_prefetch(&first[half / 2]);
// length - half equals half + length % 2
auto first_half1 = first + (length - half);
__builtin_prefetch(&first_half1[half / 2]);
first = comp(first[half], value) ? first_half1 : first;
length = half;
}
}
while (length > 0) {
auto half = length / 2;
auto first_half1 = first + (length - half);
first = comp(first[half], value) ? first_half1 : first;
length = half;
}
return first;
Tested in the same way as before, for sizes ranging up to ~4 million entries (or 16MB), there’s a further speed up from ~32ns to ~26ns average run time. At this point I should acknowledge that my original size stopping point happened to be too small. A case of a quick ‘should be larger than L3 cache size’ and a mistake in not following up. So let’s go much higher, now to ~128 million entries (or 512MB). We’re way past L3 but still in reasonable data set size.
Runtimes in line chart form:
There’s some interesting stuff going on.
- Branchless
std::lower_bound
that’s generated withclang -cmov
is slower than the branchy version at massive sizes. Modern cpus follow predicted branches including loading from memory (basically a prefetch) and speculatively executing on said data (basically a security nightmare). sbpm_lower_bound
is the prefetching version ofsbm_lower_bound
which does a multiply with a boolean to trick gcc into generating branchless code.- We’re at ~2.3x faster average time comparing
std::lower_bound
(~161ns) to prefetched version (~71ns).
Faster?
Drop in replacement for std::lower_bound
There’s a performance graph bump/weirdness between 1-10 million elements so faster is possible in theory. In practice the prefetching code is getting messy and gaining magic constants. An exciting part for me in writing this post was the (small) probability of contributing back to gcc/libstdc++ and/or llvm/libc++. I’ll stop here as that probability gets even smaller with added complexity.
Breaking std::lower_bound
constraints
Comments also noted the interesting Eytzinger Binary Search variant where the input array is reshaped (into a binary med-“heap”) for lookups to be cache friendly. I could ponder how fast a SIMD optimized K-ary tree could be, but no need to ponder: 7x to 15x faster than std::lower_bound
(for 16-ary tree of ints) as presented by Sergey Slotin at CppCon 2022.
Footnotes
BUT RUST.. Rust is fast but do you really want my first post to be about unsafe code? ↩︎
BUT ZIG.. There’s no
lower_bound()
-like binary search in Zig, just this. ↩︎There’s always a relevant XKCD. ↩︎
With clang we could do
__builtin_unpredictable
(comp(first[half], value))
but it does nothing (tested v10-13). ↩︎gcc has
__builtin_expect_with_probability
(cond, 0, 0.5)
but it does nothing (tested v10). ↩︎
Article metadata
Date:
Categories:
#coding