# State-of-the-Art Exact Binary Vector Search for RAG in 100 lines of Julia

May 16, 2024

## Updates

Thanks for HN users

`mik1998`

and`borodi`

who brought up the popcnt instruction and the count_ones function in Julia which carries this out, I've updated the timings and it's even faster now.

Whether you use Int8, UInt8, or Int64 the performance the same when performing the actual search (on my M1 air). This might change depending on the architecture you're using. The best representation is probably UInt64s. as this will likely give you the most transferrable result across different architectures.

The best search timings come from either 1) Vector of StaticArrays and StaticArray query vector or 2) Matrix of elements as (num_bytes, num_rows) and StaticArray query. So could only make the query array static and get the same performance.

This is an experiment in how quickly precise RAG lookups could be performed with a binary vector space.

## What is RAG?

"Retrieval Augmented Generation" (RAG) is augmentation to LLM content generation. The basic architecture is to call the LLM API with a prompt and query.

But sometimes we would like to programmatically add context to the prompt so that the query can be better answered. This is especially useful if we want to query internal documents, which the LLM should have no knowledge of, or simply improve the Q/A capabilities of the LLM in a specific domain.

RAG is the process of determining what context to add. This is typically done by encoding the query into a vector space and then finding the closest vectors in the database to the query. Each vector represents hundreds or thousands of tokens.

## Binary Vector Space

Why binary vector space?

It turns out the accuracy is very similar to a full 32-bit vector. But we save a lot in terms of server costs, and it makes in-memory retrieval more feasible.

A database that was once 1TB is now ~32GB, which can easily fit in RAM on a much cheaper setup.

Assuming each row is a 1024-dim float32 vector, that's 4096 bytes per row. or ~244 million rows.

For some added context Wikipedia, which is arguably the largest publicly available data to RAG might only be around 15 million rows.

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

- 6,818,523 articles
- average 668 words per article (4.55B billion words)
- According to OpenAI 750 words is roughly 1000 tokens. So let's say 6B tokens.
- if we use the mixedbread model we can represent 512 tokens in a vector, assuming overlap let’s say 400 tokens new tokens in each vector with 112 being used for overlap

6B / 400 = ~15,000,000 vectors

Chances are internal RAGs are going to be significantly smaller than this.

And so this opens up the question:

"At what point is the dataset too big that exact brute search over binary vectors is not entirely feasible?"

If our dataset is smaller than this, then we can just do the dumb thing in a very small amount of code and be completely fine.

We might not need a fancy vector DB or the newest HSNW library (HNSW is approximate search anyway, we're doing exact).

I'm doing this in Julia because:

- I like it. :-)
- It should be able to get comparable performance for this relative to C, C++ or Rust. So we're not missing out on potential gains.

`julia> versioninfo() Julia Version 1.11.0-beta1 Commit 08e1fc0abb9 (2024-04-10 08:40 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: macOS (arm64-apple-darwin22.4.0) CPU: 8 × Apple M1 WORD_SIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1) Threads: 4 default, 0 interactive, 2 GC (on 4 virtual cores) Environment: JULIA_STACKTRACE_MINIMAL = true DYLD_LIBRARY_PATH = /Users/lunaticd/.wasmedge/lib JULIA_EDITOR = nvim`

Ok, let's begin!

## Implementation

Each vector is going to be n bits, where n in the size of the embedding. This will vary depending on the embedder used, there are several proprietary and open source variations. I'll be assuming we're using the mixedbread model: `mixedbread-ai/mxbai-embed-large-v1`

This model returns a 1024 element float32 vector. Through binarization, this is converted into a 1024 bit vector, which is then represented as a 128 element int8 vector or 128 bytes. Additionally, this can be further reduced to 64 bytes, which is our final representation.

A 512 bit vector represented as 64 int8 or uint8 elements.

`julia> x = rand(Int8, 64) 64-element Vector{Int8}: 26 123 70 -2 ⋮`

Keep in mind that although the representation is in bytes we're operating at the bit level.

Comparing two bit vectors will require using the Hamming distance. Luckily the definition is very straightforward, each time a bit in the vectors doesn't match add 1 to the distance. So two bit vectors of length 512 that are entirely different will have a hamming distance of 512, and if they are the same a distance of 0.

An easy to do this is to turn the integer representation into a bitstring and then compute the distance on that.

`julia> bitstring(Int8(125)) "01111101" julia> bitstring(Int8(33)) "00100001" julia> function hamming_distance(s1::AbstractString, s2::AbstractString)::Int s = 0 for (c1, c2) in zip(s1, s2) if c1 != c2 s += 1 end end s end julia> hamming_distance(bitstring(Int8(125)), bitstring(Int8(33))) # should be 4 4`

Julia has a lovely benchmarking package called `Chairmarks`

. Let's see how fast this implementation is.

`julia> using Chairmarks julia> @be hamming_distance(bitstring(Int8(125)), bitstring(Int8(33))) Benchmark: 6053 samples with 317 evaluations min 36.672 ns (4 allocs: 128 bytes) median 38.514 ns (4 allocs: 128 bytes) mean 46.525 ns (4 allocs: 128 bytes, 0.13% gc time) max 8.737 μs (4 allocs: 128 bytes, 98.30% gc time)`

This is honestly pretty good. For comparison let's do a simple addition:

`julia> @be 125 + 33 Benchmark: 7274 samples with 8775 evaluations min 1.239 ns median 1.339 ns mean 1.453 ns max 6.990 ns`

We're not going to get lower than this for an operation. The nanosecond is basically the atomic measurement unit for time when working with CPUs.

However, this does not mean each operation has to take ~1ns. There are various optimizations that hardware can do such that several operations are done each clock cycle. Each core of a CPU has a number of arithmetic units which are used to process elementary operations (addition, subtraction, xor, or, and, etc), the arithmetic units are fed by what is known as pipelining. The point is there is more than 1 arithmetic unit, so we can conceivably do several additons or bit operations in the same clock cycle as long as they do not depend on each other. Let's rewrite `hamming_distance`

using only elementary operations so that we might take advantage of this.

Let's consider comparing two bits:

- 0, 0 -> 0
- 1, 0 -> 1
- 0, 1 -> 1
- 1, 1 -> 0

This is the XOR operation. We want to XOR all the bits and keep track of which ones are 1, and return the sum.

In order to do this over all the bits we can shift the integer by the number of bits, for an 8-bit number we would perform the shift 7 times.

Here's what this looks like:

`@inline function hamming_distance(x1::T, x2::T)::Int where {T<:Union{Int8,UInt8}} r = x1 ⊻ x2 # xor # how many xored bits are 1 c = 0 for i in 0:7 c += (r >> i) & 1 end return Int(c) end # benchmark again julia> @be hamming_distance(Int8(33), Int8(125)) Benchmark: 4797 samples with 8966 evaluations min 2.175 ns median 2.189 ns mean 2.205 ns max 5.279 ns`

This is huge improvement, almost as fast as doing an addition operation!

There's a builtin function in Julia `count_ones`

lowers to a machine instruction popcnt that counts all the ones so we can just use that instead doing the shifts!

`@inline function hamming_distance(x1::T, x2::T)::Int where {T<:Union{Int8,UInt8}} return Int(count_ones(x1 ⊻ x2)) end`

Now it's the same as doing an addition operation!

`julia> @be hamming_distance(Int8(33), Int8(125)) Benchmark: 5800 samples with 13341 evaluations min 1.243 ns median 1.249 ns mean 1.251 ns max 3.760 ns`

Now we need to do this for an vector:

`@inline function hamming_distance1(x1::AbstractArray{T}, x2::AbstractArray{T})::Int where {T<:Union{Int8,UInt8}} s = 0 for i in 1:length(x1) s += hamming_distance(x1[i], x2[i]) end s end`

We expect this to take around 128ns (~2 * 64).

`julia> @be hamming_distance1(q1, q2) Benchmark: 4469 samples with 256 evaluations min 69.500 ns median 75.035 ns mean 80.240 ns max 189.941 ns`

There are a few more things we can do do further optimize this loop, the compiler sometimes does these automatically but we can manually annotate it to be certain.

`@inline function hamming_distance(x1::AbstractArray{T}, x2::AbstractArray{T})::Int where {T<:Union{Int8,UInt8}} s = 0 @inbounds @simd for i in eachindex(x1, x2) s += hamming_distance(x1[i], x2[i]) end s end`

`@inbounds`

removes any boundary checking.`@simd`

tells the compiler to use SIMD instructions if possible.

`julia> @be hamming_distance(q1, q2) Benchmark: 4441 samples with 323 evaluations min 52.759 ns median 56.245 ns mean 61.721 ns max 163.827 ns`

That's a decent improvement. This benchmark is < 20ns on a fresh REPL session. I think it depends current background usage and how well SIMD can be utilized at that time.

`julia> q1 = rand(Int8, 64); julia> q2 = rand(Int8, 64); julia> hamming_distance(q1, q2); julia> @be hamming_distance1(q1, q2) Benchmark: 4381 samples with 1113 evaluations min 17.932 ns median 18.381 ns mean 19.253 ns max 54.956 ns`

It's good to know in a perfect environment, such as an isolated machine, you could make really good use of the hardware.

Let's say each comparison takes 50ns, then we can search over 20M rows in a second, or 80M rows if we used 4 cores.

1M rows should take 12ms over 4 cores. And Wikipedia at 15M rows 180ms.

In practice it turns out it's faster than this:

`mutable struct MaxHeap const data::Vector{Pair{Int,Int}} current_idx::Int # add pairs until current_idx > length(data) const k::Int function MaxHeap(k::Int) new(fill((typemax(Int) => -1), k), 1, k) end end function insert!(heap::MaxHeap, value::Pair{Int,Int}) if heap.current_idx <= heap.k heap.data[heap.current_idx] = value heap.current_idx += 1 if heap.current_idx > heap.k makeheap!(heap) end elseif value.first < heap.data[1].first heap.data[1] = value heapify!(heap, 1)`

The heap structure can be ignored, it's just a max heap with a maximum size. Each section of the database being searched has it's own heap and then sort and pick the top k results at the end.

## Benchmarks

1M rows benchmark:

`julia> X1 = [rand(Int8, 64) for _ in 1:(10^6)]; julia> k_closest(X1, q1, 1) 1-element Vector{Pair{Int64, Int64}}: 202 => 76839 julia> @be k_closest(X1, q1, 1) Benchmark: 17 samples with 1 evaluation min 5.571 ms (3 allocs: 112 bytes) median 5.618 ms (3 allocs: 112 bytes) mean 5.853 ms (3 allocs: 112 bytes) max 6.996 ms (3 allocs: 112 bytes) julia> k_closest_parallel(X1, q1, 1) 1-element Vector{Pair{Int64, Int64}}: 202 => 76839 julia> @be k_closest_parallel(X1, q1, 1) Benchmark: 32 samples with 1 evaluation min 3.022 ms (54 allocs: 3.672 KiB)`

Each vector comparison is ~6ns rather than ~50ns.

But wait, There's more! We can do even better using `StaticArrays`

:

`julia> using StaticArrays julia> q1 = SVector{64, Int8}(rand(Int8, 64)); julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(10^6)]; julia> @be k_closest(X1, q1, 1) Benchmark: 22 samples with 1 evaluation min 4.049 ms (3 allocs: 112 bytes) median 4.245 ms (3 allocs: 112 bytes) mean 4.453 ms (3 allocs: 112 bytes) max 6.010 ms (3 allocs: 112 bytes) julia> @be k_closest_parallel(X1, q1, 1) Benchmark: 86 samples with 1 evaluation min 1.120 ms (54 allocs: 3.672 KiB) median 1.133 ms (54 allocs: 3.672 KiB) mean 1.144 ms (54 allocs: 3.672 KiB) max 1.311 ms (54 allocs: 3.672 KiB)`

Now each vector comparison is taking ~4ns! Also notice that the parallel implementation looks to be better utilizing the cores available, improvement the runtime by 4x, since we have 4 cores, instead of just 2x.

However, we don't want just closest vector, but the "k" closest ones, so let's set "k" to something more realistic:

`julia> @be k_closest(X1, q1, 20) Benchmark: 22 samples with 1 evaluation min 4.033 ms (5 allocs: 832 bytes) median 4.267 ms (5 allocs: 832 bytes) mean 4.511 ms (5 allocs: 832 bytes) max 5.910 ms (5 allocs: 832 bytes) julia> @be k_closest_parallel(X1, q1, 20) Benchmark: 85 samples with 1 evaluation min 1.126 ms (56 allocs: 7.828 KiB) median 1.137 ms (56 allocs: 7.828 KiB) mean 1.142 ms (56 allocs: 7.828 KiB) max 1.194 ms (56 allocs: 7.828 KiB) julia> @be k_closest(X1, q1, 100) Benchmark: 23 samples with 1 evaluation min 4.108 ms (5 allocs: 3.281 KiB) median 4.270 ms (5 allocs: 3.281 KiB) mean 4.344 ms (5 allocs: 3.281 KiB) max 5.472 ms (5 allocs: 3.281 KiB)`

Similar timings since we're using a heap to order the entries.

Just for fun let's search over 15M rows (Wikipedia).

`julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(15 * 10^6)]; julia> @be k_closest_parallel(X1, q1, 10) Benchmark: 6 samples with 1 evaluation min 16.534 ms (56 allocs: 5.625 KiB) median 16.637 ms (56 allocs: 5.625 KiB) mean 16.628 ms (56 allocs: 5.625 KiB) max 16.762 ms (56 allocs: 5.625 KiB) julia> @be k_closest_parallel(X1, q1, 50) Benchmark: 6 samples with 1 evaluation min 16.506 ms (58 allocs: 14.062 KiB) median 16.672 ms (58 allocs: 14.062 KiB) mean 16.669 ms (58 allocs: 14.062 KiB) max 16.777 ms (58 allocs: 14.062 KiB)`

And over 100,000 rows:

`julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(10^5)]; julia> @be k_closest_parallel(X1, q1, 30) Benchmark: 679 samples with 1 evaluation min 120.250 μs (56 allocs: 10.000 KiB) median 131.792 μs (56 allocs: 10.000 KiB) mean 137.709 μs (56 allocs: 10.000 KiB) max 463.625 μs (56 allocs: 10.000 KiB)`

Under 1ms!

`usearch`

usearch appears to be state of the art for in-memory vector similarity searches.

`In [8]: from usearch.index import search, MetricKind ...: import numpy as np In [12]: X = np.random.randint(-128, 128, size=(10**6, 64), dtype=np.int8) In [13]: q = np.random.randint(-128, 128, 64, dtype=np.int8) In [14]: %timeit search(X, q, 1, MetricKind.Hamming, exact=True) 62.4 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [15]: %timeit search(X, q, 1, MetricKind.Hamming, exact=True, threads=4) 32.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)`

The Julia implementation is ~16x faster single threaded and ~30x faster when using 4 cores.

Maybe there's a way to make `usearch`

faster, and in all fairness exact search is likely not the priority to get as fast a possible.

That being said this is around 100 lines of Julia code and we're not doing anything fancy aside from arguably using StaticArrays.

## Conclusion

With ~100 lines of Julia code we're able to achieve state of the art results in exact search similarity search for binary vector spaces.

I don't know about you but I think this is pretty damn awesome!

## Extra

I realized I never answered the original question (my bad).

"At what point is the dataset too big that exact brute search over binary vectors is not entirely feasible?"

Well it depends, how many RPS do you need to handle? What is the upper limit on how long the query should take?

Right now were at ~1-1.3ms for 1M rows. 10M rows would be ~10-13ms.

`julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(20 * 10^6)]; julia> @be k_closest_parallel(X1, q1, 100) Benchmark: 5 samples with 1 evaluation min 22.198 ms (58 allocs: 23.906 KiB) median 22.232 ms (58 allocs: 23.906 KiB) mean 22.247 ms (58 allocs: 23.906 KiB) max 22.330 ms (58 allocs: 23.906 KiB)`

Even at 20M rows this query is unlikely to be the bottleneck of the application. 20M rows would be 76GB of the original encoded data vectors: 1024 element 32-bit vector. That's a big dataset, do you have that dataset? Probably not.