# Fastest Branchless Binary Search

You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:

```
template <class ForwardIt, class T, class Compare>
constexpr ForwardIt sb_lower_bound(
ForwardIt first, ForwardIt last, const T& value, Compare comp) {
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
if (comp(first[length], value)) {
first += length + rem;
}
}
return first;
}
```

Same function interface as `std::lower_bound`

, but **2x** faster, and shorter. “branchless” because the `if`

compiles down to a conditional move instruction rather than a branch/conditional jump. We will explore compiler options, even faster versions, fully branchless, and caveats towards the end of the article. There’s a significant **update** too. Oh and I’m sorry about just thrusting C++ code on you. Rude, I know. You don’t really need to know C++ to understand this article, just iterators (`first`

and `last`

), basically pointers to elements in an array, though note they can point one past the last array entry. Ignore `template`

, `class`

, `constexpr`

and `&`

. If only there was a clean fast bare-metal language to write all this in.. ^{1} ^{2}

## Binary search intro

You have a sorted list and want to find where a `value`

would fit. In C++ you’d use `std::lower_bound`

which returns the first position (iterator) for which the comparison (usually `<`

) fails, or `last`

if the comparison is true for all elements. Let’s write that, so surprise coding interview question! (In C++. Good luck.)

The `binary`

in binary search comes from splitting up the list into two at some middle item and doing a comparison of the middle against the given `value`

. Based on the comparison result we pick which of the two lists to keep looking in. We start with a list of some `length = last - first`

that starts at the given iterator `first`

. We need to have some loop that keeps going until the list is empty, i.e. `while (length > 0)`

. Pick the/a middle, at `length / 2`

do a comparison and update the current list, which we’ll do by updating `first`

and `length`

. Here goes:

```
// std_lower_bound()
auto length = last - first;
while (length > 0) {
auto half = length / 2;
if (comp(first[half], value)) {
first += half + 1;
length -= half + 1;
} else {
length = half;
}
}
return first;
```

That was straight-forward. What we got was a slightly refactored version of what gcc/libstdc++ uses or what llvm/libc++ uses. Roughly the same speed, even (sometimes) compiles down to the same assembly in the loop.

## Branch prediction

“What’s slow about this?” Not much but glad you asked, great question. Processors have gotten faster and faster over the years, and part of the reason why is *branch prediction*. Short explanation: for speed, the CPU attempts to execute instructions in parallel by dividing each instruction execution into multiple stages, say F-D-E-W (fetch, decode, execute, writeback). With a careful design it can make progress on multiple stages of different instructions (Example: instruction 5 in stage F, 6 in D, 7E, 8F) at the same time. The complication comes from branches in the code, i.e. conditional jump instructions, where depending on some result the next instruction is either X or Y. The CPU *predicts* one option, X, and starts running through the stages, eventually including the stages of say X+1 and X+2. When the result becomes available and it turns out it should have been Y, all the work on X, X+1 and X+2 is thrown away. Branch mispredictions are expensive because the CPU could have already made progress on Y, Y+1 and Y+2 instead.

Branch prediction is great, but not for binary search. It’s a search, and you only search for something if you don’t know exactly where it is, otherwise you’d just get it. Which means that `if (comp(first[half], value))`

is unpredictable in common use of `std::lower_bound`

.

Credit @practicemakespink

Let’s help the processor.

-We:

```
// bstd_lower_bound()
auto length = last - first;
while (length > 0) {
auto half = length / 2;
bool compare = comp(first[half], value);
// No more ifs
first = compare ? first + half + 1 : first;
length = compare ? length - half - 1 : half;
}
return first;
```

-Clang compiler: “That’s not how this works!”

-We: `-mllvm -x86-cmov-converter=false`

-Clang compiler: “Yes, boss."^{3}

The result is 25% faster as it uses 2 conditional move instructions. Not bad. But turns out that `-mllvm -x86-cmov-converter=false`

, which we’ll shorten to `-cmov`

, speeds up `std::lower_bound`

just as much because clang is smart enough to figure out how to convert the `if`

/`else`

to conditional moves. gcc doesn’t have such an option and generally just doesn’t care about what you want.

Overall, there’s currently no good way to tell either clang^{4} or gcc^{5} to use a conditional move in just a certain situation.I’m trying to not make this article about finicky compilers, so let’s move on.

## What started this

Why are we even talking about speeding up binary searches anyway? Why am I roping you into this? Cause someone else roped me into this.

Rick and Morty - Meeseeks and Destroy

I saw Malte Skarupke’s translation of “Shar’s algorithm” into a C++ binary search (`branchless_lower_bound`

) and I couldn’t help but think “it’s not optimal”. Malte’s version sometimes compares `value`

against the same element multiple times. So I wondered, what is the optimal ‘branchless’ binary search? That led to `sb_lower_bound`

which is ~20% faster than Malte’s version of lower_bound that he tested as 2x faster than GCC.

“What’s an ‘optimal’ binary search anyway?” Good question. I think a binary search is ‘optimal’ if it completes by doing the minimum number of comparisons. This is very useful when you have a (relatively) slow comparison function. Malte noted his version is slower than `std::lower_bound`

for binary searching a large number of strings.

Looking at `std::lower_bound`

it returns an iterator, which can point to any of the list elements but also one past the end. For a list of size `n`

there are `n+1`

possible options. Thus for a list of size `2`

there are ^{k}-1`2`

possible options. In this case the optimal number of comparisons is ^{k}`k`

. Provably so as being able to distinguish between all `2`

options requires ^{k}`k`

bits of information, and each comparison gives us 1 bit of information (true vs false). Translating this case into code, we get:

```
// Really fast but only works when length is a power of 2 minus 1
// When not, it can result in out of bounds accesses like for
// array [0, 1, 2, 3, 4, 5] and value 4
auto length = last - first;
while (length > 0) {
length /= 2;
if (comp(first[length], value)) {
first += length + 1;
}
}
return first;
```

With clang `-cmov`

the loop compiles down to 6 instructions, one of which is `cmov`

. The reason this (and Malte’s code) is so fast is that only updating `first`

depends on the comparison result.

`sb_lower_bound`

Now let’s look at `sb_lower_bound`

(named after *simple branchless*). It actually took me longer to stumble upon than faster versions (yet to be presented) as it’s not ‘optimal’.

```
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
if (comp(first[length], value)) {
first += length + rem;
}
}
return first;
```

Every time we split the list, the number of elements happens to be even, and comp returns true then we don’t skip over enough elements. For `n`

elements, where `2`

, ^{k} <= n < 2^{k+1}`sb_lower_bound`

will always make `k+1`

comparisons while `std::lower_bound`

will either make `k`

or `k+1`

comparisons. On average `std::lower_bound`

will make about `log2(n+1)`

comparisons. Overall `sb_lower_bound`

is faster as it has significantly fewer instructions in the loop. The comparison function has to be really slow for the difference between `k+1`

and `log2(n+1)`

number of comparisons to matter.

Second caveat is that currently, gcc does not emit branchless code for `sb_lower_bound`

regardless of optimization level. It doesn’t emit branchless code for `std::lower_bound`

either so they end up about as fast. We can try to write some inline assembly to force gcc to use `cmov`

but there’s a tradeoff. The simple way results in more instructions than necessary. The alternative requires writing different assembly code for every possible type of `value`

(int, float, etc.).

```
// asm_lower_bound, works for x86 only
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
auto firstplus = first + length + rem;
// Does a comparison which sets some x86 FLAGS like CF or ZF
bool compare = comp(first[length], value);
// Inline assembly doesn't support passing bools straight into FLAGS
// so we ask the compiler to copy it from FLAGS into a register
__asm__(
// Then we test the register, which sets ZF=!compare and CF=0
// Reference: https://www.felixcloutier.com/x86/test
"test %[compare],%[compare]\n"
// cmova copies firstplus into first if ZF=0 and CF=0
// Reference: https://www.felixcloutier.com/x86/cmovv
"cmova %[firstplus],%[first]\n"
: [first] "+r"(first)
: [firstplus] "r"(firstplus), [compare]"r"(compare)
);
}
return first;
```

2x faster than the gcc version of `std::lower_bound`

, but slightly slower than `sb_lower_bound`

with clang `-cmov`

. Presented just for demonstration purposes.

## More optimal

I promised an even faster version in the intro. Here it is:

```
// bb_lower_bound
auto length = last - first;
// while length isn't a power of 2 minus 1
while (length & (length + 1)) {
auto step = length / 8 * 6 + 1; // MAGIC
if (comp(first[step], value)) {
first += step + 1;
length -= step + 1;
} else {
length = step;
break;
}
}
while (length != 0) {
length /= 2;
if (comp(first[length], value)) {
first += length + 1;
}
}
return first;
```

Let’s quickly go over `length & (length + 1)`

. Example: `length`

is `110011`

, `length+1`

is `110100`

, `length & (length+1)`

is `110000`

. Note how they always share their most significant `1`

except for when `length+1`

carries over all set bits. That case is when `length`

is of the form `11..1`

, i.e. power of 2 minus 1, in which case `length & (length + 1)`

will be 0. Slightly adapted from one of the bit twiddling tricks I remembered and there’s a (warning!) rabbit hole full of awesome tricks here.

Let us call lengths ‘nice’ if they are powers of 2 minus 1. Earlier we found that for nice lengths we can have optimal searching. The original idea was that for non-nice length we split the search list not in half but in a way that we’ll quickly end up only searching sub-lists of nice lengths. This idea was not quite fast enough as it meant spending significant time just splitting up the list. The first compromise is that early `break`

, which means the second `while`

loop may (non-optimally) search past its original length; no out of bounds accesses though as this sub-list isn’t at the end of the full list. This compromise leads to a second compromise, that bit of MAGIC in selecting the `step`

size. To be fast we want a large `step`

, definitely `>= length/2`

, so we more often break out to the faster `while`

loop. But not so large a `step`

that almost equals `length`

because we lose on what makes a binary search fast. We’d also prefer that the `step`

is nice or has many `1`

s in its binary representation, which makes the second `while`

loop more optimal. The many `1`

s is why the `step`

is forced to be odd. Last but not least, we’d want `length - step - 1`

to be nice.

I’ve tried quite a few variations. The most bitter sweet was `auto step = length >> 1; step |= step >> 1;`

which I thought to be a good fast heuristic that balances the listed compromises, is very close to optimal but ultimately ended up slightly slower than MAGIC. Another issue is the ‘break’ makes it branchy.

One unexplored avenue is exhaustively searching for the fastest (yet still optimal) `step`

for every length and storing that into a precomputed table. Then either deriving a good heuristic from said table or using it outright. With `sb_lower_bound`

I’ve reached my good-enough point but you’re welcome to explore further :).

```
auto length = last - first;
while (length & (length + 1)) {
auto step = precomputed[length];
if (comp(first[step], value)) {
first += step + 1;
length -= step + 1;
} else {
length = step;
}
}
// ...
```

## More branchless, more better?

Short answer: no. This section can be skipped but here’s the long answer: For `n`

elements where `2`

^{k} <= n < 2^{k+1}`sb_lower_bound`

will always do `k+1`

comparisons. On a 64-bit machine that means at most 64 iterations of the `while (length > 0)`

loop. So it’s possible to write a “fully branchless” version that doesn’t have the `length`

check by using a `switch`

with intentional fall-through.

```
size_t length = last - first;
size_t rem;
switch(std::bit_width(length)) {
case 64:
rem = length % 2;
length /= 2;
first += comp(first[length], value) * (length + rem);
case 63:
rem = length % 2;
length /= 2;
first += comp(first[length], value) * (length + rem);
// ...
case 1:
rem = length % 2;
length /= 2;
first += comp(first[length], value) * (length + rem);
}
return first;
```

If you’re not familiar with `switch`

, think of it as a jump into code. In our case to the exact place from which there are exactly the right number of comparisons left to do.

“Is it any faster?” No. Modern x86 processors handle loop conditions well as they’re predictable; we’re very likely to remain in the loop. And that’s good especially because it saves us from writing templates or macros or copy-paste-edit the 64 cases.

## Spotlight on Performance

Credit @practicemakespink

Average run time (ns):

`std::lower_` | `branchless_lower_` | `asm_lower_` | `sb_lower_` | `sbm_lower_` | `bb_lower_` | |
---|---|---|---|---|---|---|

gcc | 80.68 | 43.65 | 54.92 | 77.22 | 34.19 | 75.64 |

clang | 78.66 | 90.97 | 51.83 | 80.06 | 83.95 | 74.44 |

clang -cmov | 61.30 | 43.43 | 54.32 | 33.24 | 35.54 | 32.73 |

Geometric mean run time (ns):

`std::lower_` | `branchless_lower_` | `asm_lower_` | `sb_lower_` | `sbm_lower_` | `bb_lower_` | |
---|---|---|---|---|---|---|

gcc | 62.44 | 25.55 | 32.35 | 59.67 | 20.62 | 57.93 |

clang | 61.24 | 65.72 | 30.67 | 63.59 | 66.91 | 58.19 |

clang -cmov | 39.17 | 25.14 | 31.21 | 19.81 | 20.91 | 21.33 |

Runtimes in line chart form:

“What’s `sbm_lower_bound`

?” It’s basically `sb_lower_bound`

but modified to trick gcc into generating a conditional move. The difference is switching from the `if`

statement to `first += comp(first[length], value) * (length + rem)`

. Use with comments and caution as the next version of gcc may undo this optimization.

Benchmarking commands showing compiler options:

```
g++-10 -std=c++20 -Wall -O2 -march=haswell test.cpp -o test && ./test
clang++-10 -std=c++20 -Wall -O2 -march=haswell test.cpp -o test && ./test
clang++-10 -std=c++20 -Wall -O2 -march=haswell -mllvm -x86-cmov-converter=false test.cpp -o test && ./test
```

`-march=native`

or no `-march`

did not significantly influence the rankings. Benchmarked on an intel i7 kaby lake.

### Branch mispredictions

We can look at branch mispredictions/misses using `perf`

:

```
# clang
perf stat ./test
# [..]
16,599.95 msec task-clock:u # 1.000 CPUs utilized
# [..]
30,309,755,516 instructions:u # 0.53 insn per cycle
6,941,783,502 branches:u # 418.181 M/sec
1,203,569,540 branch-misses:u # 17.34% of all branches
# clang -cmov
perf stat ./test
# [..]
10,982.97 msec task-clock:u # 1.000 CPUs utilized
# [..]
32,603,123,521 instructions:u # 0.90 insn per cycle
4,070,883,093 branches:u # 370.654 M/sec
35,954,999 branch-misses:u # 0.88% of all branches
```

`-cmov`

removes ~2.9B branches and ~1.2B branch misses, so it removes branches that were mispredicted ~41% of the time. It’s close to the 50% we’d expect for purely unpredictable branches, if we had perfect randomness and benchmarking. In which case `-cmov`

would result in an even higher improvement.

### Performance with slower `comp()`

For somewhat realistic scenarios of binary searching with a slower `comp()`

function I’ve thought of searching through ids, phone numbers, accounts and keywords. I’ve thus settled on testing searching 8-byte strings.

Average run time (ns):

`std::lower_` | `branchless_lower_` | `sb_lower_` | `sbm_lower_` | `bb_lower_` | |
---|---|---|---|---|---|

gcc | 160.01 | 205.24 | 165.66 | 193.96 | 163.91 |

clang | 157.71 | 178.77 | 162.68 | 166.00 | 157.22 |

clang -cmov | 156.06 | 193.70 | 164.71 | 181.57 | 157.48 |

In this case `std::lower_bound`

is very slightly but consistently faster than `sb_lower_bound`

. To always get the best performance it is possible for libraries to use `sb_lower_bound`

whenever directly working on primitive types and `std::lower_bound`

otherwise.

## Assembly

No bare-metal code performance optimization is complete without looking at the generated assembly. However, feel free to skip this section.

```
Time% | instructions
// std::lower_bound, clang -cmov, hottest loop assembly
1.68 │20: mov %rsi,%rcx // rcx = length
3.20 │ shr %rcx // rcx = length / 2
1.08 │ mov %rcx,%rdx // rdx = length / 2
4.42 │ not %rdx // rdx = binary_not(rdx) = -length / 2 - 1
2.85 │ add %rsi,%rdx // rdx = length - length / 2 - 1
63.41 │ vucomiss (%rax,%rcx,4),%xmm0 // Compare first[rcx] with value, sets FLAGS
0.21 │ lea 0x4(%rax,%rcx,4),%rsi // rsi = first + length / 2 + 1
10.14 │ cmova %rsi,%rax // first = compare_res ? first : rsi
4.87 │ cmovbe %rcx,%rdx // rdx = not compare_res ? length / 2 : rdx
0.88 │ mov %rdx,%rsi // rsi = rdx (new length)
0.33 │ test %rdx,%rdx // Set FLAGS based on rdx
3.90 │ ↑ jg 20 // Jump to instruction 20 if rdx not zero
// sb_lower_bound, clang -cmov, hottest loop assembly
4.21 │20: shr %rcx // rcx = length / 2
4.70 │ and $0x1,%esi // esi = length % 2
5.00 │ add %rcx,%rsi // rsi = length / 2 + length % 2
68.86 │ vucomiss (%rax,%rcx,4),%xmm0 // Compare first[rcx] with value, sets FLAGS
0.01 │ lea (%rax,%rsi,4),%rdx // rdx = first + length / 2 + length % 2
11.69 │ cmova %rdx,%rax // first = compare_res ? first : rdx
3.71 │ mov %rcx,%rsi // rsi = length / 2
│ test %rcx,%rcx // Set FLAGS based on rcx
0.02 │ ↑ jne 20 // Jump to instruction 20 if rcx not zero
// branchless_lower_bound, clang -cmov, hottest loop assembly
3.04 │70: lea (%rdi,%rcx,4),%rax // rax = first + length
71.22 │ vucomiss (%rdi,%rcx,4),%xmm0 // Compare first[rcx] with value, sets FLAGS
12.26 │ cmova %rax,%rdi // first = compare_res ? first : rax
1.03 │7d: shr %rcx // length /= 2, but note it also sets FLAGS
1.64 │ ↑ jne 70 // Jump to instruction 20 if length not zero
```

The `branchless_lower_bound`

assembly is really short and clean. While that’s a good indicator of speed, `sb_lower_bound`

wins out in the performance tests due to low overhead.

## Conclusion

If the slowest part of your program involves searching and/or comparisons that a processor would not be able to predict then try clang with `-mllvm -x86-cmov-converter=false`

(if your processor is x86).

If you’d benefit from a faster binary search, try `sb_lower_bound`

(or for gcc you could also try `sbm_lower_bound`

). I’ve made it open source, MIT license.

If you want more articles like this, follow me @_mhdm on Twitter (I don’t post often). Alternatively, you can use RSS.

Code, including benchmarking, is available at github.com/mh-dm/sb_lower_bound/.

If you have ideas, thoughts, or something to add you can leave a comment here.

## Update

Following a fruitful comment by orlp.net author, `sb_lower_bound`

can be refactored slightly to reduce the number of assembly instructions in the hot loop from 9 to 8.

```
// sb_lower_bound refactored
auto length = last - first;
while (length > 0) {
auto half = length / 2;
if (comp(first[half], value)) {
// length - half equals half + length % 2
first += length - half;
}
length = half;
}
return first;
```

With `clang -cmov`

there’s a slight speed up from ~33ns to ~32ns for the average run time.

### Prefetching

From comments there was a recommended method for a further speed-up, namely prefetching. Prefetching pulls particular parts of memory into cache (usually L1/L2) so that when prefetched data is actually needed the load only incurs L1/L2 cache latency (~4/12 cycles) vs slower cache (L3, ~40 cycles) or memory latency (~200 cycles). Example timings. For this there’s `__builtin_prefetch()`

supported in both gcc and clang.

Say we’re going to check against the element at `length / 2`

. We could prefetch at `length / 4`

and `length * 3 / 4`

. Or we could also prefetch at `length / 8`

divisions, which is an extra 4 memory locations. For `length / 4`

divisions, 1 out of every 2 prefetches will be wasted, doubling cache pressure. If also `length / 8`

divisions, 5 out of every 6 prefetches will be wasted. There’s overhead in computing the locations and prefetching, overhead that will be significant in the hot loop we worked hard to make short.

Finally, if we’ve already made a few full searches the initial divisions are likely to be in the cache as many elements should fit in modern 256KB+ L2 caches.

When trying various prefetching strategies, none helped for under 256KB arrays, which is about what we expect. Long story short, here’s `sb_lower_bound`

but with prefetching added in for 256KB+:

```
// sbp_lower_bound
auto length = last - first;
// Sized to roughly fit in L2 cache
constexpr int entries_per_256KB = 256 * 1024 / sizeof(T);
if (length >= entries_per_256KB) {
constexpr int num_per_cache_line = std::max(64 / int(sizeof(T)), 1);
while (length >= 3 * num_per_cache_line) {
auto half = length / 2;
__builtin_prefetch(&first[half / 2]);
// length - half equals half + length % 2
auto first_half1 = first + (length - half);
__builtin_prefetch(&first_half1[half / 2]);
first = comp(first[half], value) ? first_half1 : first;
length = half;
}
}
while (length > 0) {
auto half = length / 2;
auto first_half1 = first + (length - half);
first = comp(first[half], value) ? first_half1 : first;
length = half;
}
return first;
```

Tested in the same way as before, for sizes ranging up to ~4 million entries (or 16MB), there’s a further speed up from ~32ns to ~26ns average run time. At this point I should acknowledge that my original size stopping point happened to be too small. A case of a quick ‘should be larger than L3 cache size’ and a mistake in not following up. So let’s go much higher, now to ~128 million entries (or 512MB). We’re way past L3 but still in reasonable data set size.

Runtimes in line chart form:

There’s some interesting stuff going on.

- Branchless
`std::lower_bound`

that’s generated with`clang -cmov`

is slower than the branchy version at massive sizes. Modern cpus follow predicted branches including loading from memory (basically a prefetch) and speculatively executing on said data (basically a security nightmare). `sbpm_lower_bound`

is the prefetching version of`sbm_lower_bound`

which does a multiply with a boolean to trick gcc into generating branchless code.- We’re at ~2.3x faster average time comparing
`std::lower_bound`

(~161ns) to prefetched version (~71ns).

### Faster?

#### Drop in replacement for `std::lower_bound`

There’s a performance graph bump/weirdness between 1-10 million elements so faster is possible in theory. In practice the prefetching code is getting messy and gaining magic constants. An exciting part for me in writing this post was the (small) probability of contributing back to gcc/libstdc++ and/or llvm/libc++. I’ll stop here as that probability gets even smaller with added complexity.

#### Breaking `std::lower_bound`

constraints

Comments also noted the interesting Eytzinger Binary Search variant where the input array is reshaped (into a binary med-“heap”) for lookups to be cache friendly. I could ponder how fast a SIMD optimized K-ary tree could be, but no need to ponder: 7x to 15x faster than `std::lower_bound`

(for 16-ary tree of ints) as presented by Sergey Slotin at CppCon 2022.

#### Footnotes

BUT RUST.. Rust is fast but do you really want my first post to be about unsafe code? ↩︎

BUT ZIG.. There’s no

`lower_bound()`

-like binary search in Zig, just this. ↩︎There’s always a relevant XKCD. ↩︎

With clang we could do

`__builtin_unpredictable`

`(comp(first[half], value))`

but it does nothing (tested v10-13). ↩︎gcc has

`__builtin_expect_with_probability`

`(cond, 0, 0.5)`

but it does nothing (tested v10). ↩︎

#### Article metadata

Date:

Categories:
#coding