cuckoofilter

Compact Cuckoo filters for Crystal

Cuckoo Filter for Crystal

Русская версия

A Crystal shard implementing compact Cuckoo filters for approximate set-membership queries.

A Cuckoo filter answers the question: "Have I probably seen this value before?" It is similar to a Bloom filter, but it also supports deletion. The tradeoff is that lookups can return false positives: a value that was never inserted may occasionally be reported as present. False negatives are not expected for successfully inserted values unless the value was later deleted.

Features

  • Fixed-size Filter with 8-bit fingerprints.
  • Fixed-size Filter16 with 16-bit fingerprints and a much lower false positive rate.
  • ScalableFilter and ScalableFilter16, which grow by adding more internal filters.
  • Insert, unique insert, lookup, delete, reset, count, encode, and decode operations.
  • No runtime dependencies.
  • Benchmarks for normal and very large workloads.

How It Works

Each inserted value is hashed into:

  • one fingerprint
  • two possible bucket indices

Each bucket stores four fingerprints. A lookup checks both candidate buckets. Insertions first try the two candidate buckets directly; if both are full, the filter performs bounded Cuckoo relocation.

The implementation uses:

  • BUCKET_SIZE = 4
  • MAX_CUCKOO_COUNT = 500
  • FNV-1a 64-bit hashing by default
  • non-zero fingerprints, so 0 can represent an empty slot

Choosing a Filter

Use Filter when memory is more important and an approximate false positive rate around 3% is acceptable.

Use Filter16 when false positives are expensive. It uses two bytes per fingerprint instead of one, so encoded filter data and fingerprint storage are roughly twice as large, but false positives drop dramatically.

Approximate false positive rates:

Filter Fingerprint Approximate false positive rate Cost
Filter 8-bit ~3.1% smallest
Filter16 16-bit ~0.012% theoretical about 2x fingerprint storage

In the 30,000,000-key benchmark on this machine:

Filter Miss lookups False positives Observed rate
Filter 30,000,000 984,407 ~3.28%
Filter16 30,000,000 1,653 ~0.0055%

Installation

Add the shard to your shard.yml:

dependencies:
  cuckoofilter:
    github: creadone/cuckoofilter

Then install dependencies:

shards install

Basic Usage

require "cuckoofilter"

filter = CuckooFilter::Filter.new(1_000)

filter.insert("alpha")        # => true
filter.lookup("alpha")        # => true
filter.lookup("missing")      # => usually false, but false positives are possible
filter.count                  # => 1_u64

filter.delete("alpha")        # => true
filter.lookup("alpha")        # => false
filter.count                  # => 0_u64

Strings and byte slices are supported:

filter.insert("hello")
filter.insert(Bytes[1, 2, 3, 4])

Unique Inserts

insert allows duplicate fingerprints and can increase the count when the same value is inserted more than once.

Use insert_unique when you want to skip values that already appear to be present:

filter = CuckooFilter::Filter.new(1_000)

filter.insert_unique("alpha") # => true
filter.insert_unique("alpha") # => false
filter.count                  # => 1_u64

Because this is an approximate data structure, insert_unique can also reject a value that was not inserted before if lookup returns a false positive.

Lower False Positives with Filter16

Filter16 has the same API as Filter, but stores 16-bit fingerprints:

filter = CuckooFilter::Filter16.new(1_000)

filter.insert("alpha")
filter.lookup("alpha")   # => true
filter.lookup("missing") # => much less likely to be a false positive

Use it when the additional memory is worth the lower false positive rate.

Scalable Filters

A fixed-size filter can become too full. When insertion pressure is high, use a scalable filter:

filter = CuckooFilter::ScalableFilter.new

100_000.times do |index|
  filter.insert("item-#{index}")
end

filter.lookup("item-42") # => true
filter.count            # => 100_000_u64

For lower false positives, use the 16-bit scalable variant:

filter = CuckooFilter::ScalableFilter16.new
filter.insert("value")
filter.lookup("value") # => true

The default scalable configuration is:

  • initial capacity: 10_000
  • load factor: 0.9
  • growth: roughly doubles the next filter capacity

You can customize it:

filter = CuckooFilter::ScalableFilter.new(
  load_factor: 0.8_f32,
  capacity: 50_000,
  scale_factor: ->(current_bucket_count : UInt64) { current_bucket_count * 8 }
)

Encoding and Decoding

Filters can be serialized to bytes:

filter = CuckooFilter::Filter.new(1_000)
filter.insert("alpha")

encoded = filter.encode
decoded = CuckooFilter::Filter.decode(encoded)

decoded.lookup("alpha") # => true

Filter16 uses a different encoding because each fingerprint is two bytes:

encoded = filter16.encode
decoded = CuckooFilter::Filter16.decode(encoded)

Scalable filters can also be encoded:

encoded = scalable.encode
decoded = CuckooFilter::ScalableFilter.decode(encoded)

Use the matching decode method for the filter type:

  • CuckooFilter::Filter.decode
  • CuckooFilter::Filter16.decode
  • CuckooFilter::ScalableFilter.decode
  • CuckooFilter::ScalableFilter16.decode

Top-Level Helpers

The shard also exposes convenience constructors:

CuckooFilter.new_filter(1_000)
CuckooFilter.decode(bytes)

CuckooFilter.new_filter16(1_000)
CuckooFilter.decode16(bytes)

CuckooFilter.new_scalable_filter
CuckooFilter.decode_scalable_filter(bytes)

CuckooFilter.new_scalable_filter16
CuckooFilter.decode_scalable_filter16(bytes)

There is also a short alias:

filter = Cuckoo::Filter.new(1_000)

Benchmarks

Run the regular benchmark:

CRYSTAL_CACHE_DIR=.crystal-cache crystal run bench/cuckoofilter_bench.cr --release

Adjust the benchmark size:

BENCH_KEYS=100000 BENCH_LOOKUPS=500000 BENCH_CODEC=5000 crystal run bench/cuckoofilter_bench.cr --release

For very large key counts, use the streaming benchmark. It generates compact binary keys on the fly and avoids prebuilding large arrays of strings:

BENCH_KEYS=30000000 crystal run bench/large_filter_bench.cr --release

Compare the 16-bit variant:

BENCH_KEYS=30000000 BENCH_FINGERPRINT_BITS=16 crystal run bench/large_filter_bench.cr --release

Example results from this machine:

keys=30000000 capacity=30000000 fingerprint_bits=8
insert                         30000000     6333727.70     4.737s     30000000
lookup hit                     30000000    11687808.70     2.567s     30000000
lookup miss                    30000000    13551078.33     2.214s       984407
keys=30000000 capacity=30000000 fingerprint_bits=16
insert                         30000000     6350104.88     4.724s     30000000
lookup hit                     30000000    11350244.47     2.643s     30000000
lookup miss                    30000000    13707325.79     2.189s         1653

Benchmark numbers depend on hardware, Crystal version, compiler flags, and current system load.

Development

Run the test suite:

crystal spec

Check shard dependencies:

shards check

Format code:

crystal tool format

Notes and Limitations

  • This is an approximate data structure: false positives are possible.
  • delete removes a matching fingerprint, not the original value.
  • insert_unique depends on lookup, so it can be affected by false positives.
  • Encoded bytes are specific to the selected filter type.

License

MIT

Repository

cuckoofilter

Owner
Statistic
  • 0
  • 0
  • 0
  • 0
  • 0
  • about 1 hour ago
  • May 12, 2026
License

MIT License

Links
Synced at

Tue, 12 May 2026 09:47:18 GMT

Languages