Lexis MinHash

Note: CI enforces a minimum coverage threshold (80%). If coverage falls below the threshold the test job will fail.

Lexis MinHash is a locality-sensitive hashing (LSH) library for detecting similar text documents using the MinHash technique. It uses rolling hash + multiply-shift for O(n) performance.

For advanced usage patterns and client-side recommendations, see API.md.

Features

O(n) MinHash Signatures: Rolling hash + multiply-shift, no intermediate string allocations
Signature Similarity: Fast approximate Jaccard similarity estimation
Weighted MinHash: Optional TF-IDF weights for frequency-biased sampling
Weighted Overlap Coefficient: Similarity measure for weighted document representations
Overlap Coefficient: Measure set similarity using |A ∩ B| / min(|A|, |B|)
Locality-Sensitive Hashing (LSH): Efficient candidate retrieval using banding
LSH Index: In-memory index with linear probing for cache-efficient storage
Thread-Safe: Mutex-protected configuration
Runtime Configuration: Adjust signature size, band count, shingle size at runtime
Reproducible Hashes: Optional seed for consistent signatures across restarts
Signature Struct: Convenient serialization with to_blob/from_blob

What's New in v0.4.2

Memory safety fix: Fixed critical memory safety bug in Engine.configure when using the seed parameter. Previously could cause undefined behavior due to dangling pointers.
Fixed unused @rows instance variable in LSHIndex.
Various CI improvements and bug fixes.

See the full release notes at https://github.com/kritoke/lexis-minhash/releases/tag/v0.4.2

Upgrade Notes

If you previously used non-divisible signature_size/num_bands combos, update your configuration to satisfy signature_size % num_bands == 0. Engine.configure will now raise ArgumentError for invalid combinations instead of silently truncating rows per band.
When using weighted MinHash at scale, prehash your weight map once and use the hashed-weighted API to reduce allocations:

hashed = LexisMinhash::Engine.prehash_weights(base_weights)
sig = LexisMinhash::Engine.compute_signature(text, hashed)

This repo also includes examples/prehash_example.cr demonstrating the pattern.

Installation

Add the dependency to your shard.yml:

dependencies:
  lexis-minhash:
    github: kritoke/lexis-minhash
    version: ~> 0.4.2

Then run shards install.

Usage

Basic Usage

require "lexis-minhash"

# Generate signatures directly from strings
sig1 = LexisMinhash::Engine.compute_signature("Document 1 text here")
sig2 = LexisMinhash::Engine.compute_signature("Document 2 text here")

# Calculate similarity (0.0 to 1.0)
similarity = LexisMinhash::Engine.similarity(sig1, sig2)
puts "Similarity: #{similarity}"

# Or use Signature struct for convenient API
sig = LexisMinhash::Signature.compute("Document text")
similarity = sig.similarity(other_sig)

# Serialize to blob for database storage
bytes = sig.to_blob

# Deserialize from blob
sig2 = LexisMinhash::Signature.from_blob(bytes)

# Generate signatures with optional TF-IDF weights
weights = {
  "important" => 2.5_f64,
  "rareword"  => 3.0_f64,
}
sig_weighted = LexisMinhash::Engine.compute_signature("Important rareword document", weights)

# Reproducible hashes (same seed = same signatures every run)
LexisMinhash::Engine.configure(seed: 12345)

# Generate LSH bands for candidate detection
bands = LexisMinhash::Engine.generate_bands(sig1)
# bands is Array({Int32, UInt64}) with {band_index, band_hash} tuples

Overlap Coefficient

The overlap coefficient measures set similarity as |A ∩ B| / min(|A|, |B|). Unlike Jaccard similarity (|A ∩ B| / |A ∪ B|), it's better at detecting partial overlaps when one set is a subset of another.

# Works with sorted UInt32 or UInt64 slices
a = Slice.new(5) { |i| (i * 2).to_u32 }  # [0, 2, 4, 6, 8]
b = Slice.new(3) { |i| (i * 2 + 2).to_u32 }  # [2, 4, 6]

# Intersection: [2, 4, 6] = 3 elements
# min(5, 3) = 3
# Overlap coefficient = 3/3 = 1.0
coefficient = LexisMinhash::Engine.overlap_coefficient(a, b)
puts "Overlap: #{coefficient}"  # => 1.0

Weighted Overlap Coefficient

The weighted overlap coefficient measures similarity between weighted document representations (e.g., TF-IDF vectors). It computes the sum of minimum weights for intersecting terms, normalized by the smaller total weight.

doc_a = {
  "machine" => 0.8_f64,
  "learning" => 0.9_f64,
  "data" => 0.5_f64,
}

doc_b = {
  "machine" => 0.8_f64,
  "learning" => 0.6_f64,
  "model" => 0.7_f64,
}

similarity = LexisMinhash::Similarity.weighted_overlap(doc_a, doc_b)
puts "Weighted overlap: #{similarity}"  # => ~0.736 (1.4 / min(2.2, 2.1))

Using LSHIndex

# Create index with expected document count for capacity planning
index = LexisMinhash::LSHIndex.new(bands: 20, expected_docs: 1000)

# Add documents with Int32 IDs
index.add(1, "Technology company announces revolutionary product")
index.add(2, "Technology company announces revolutionary update")

# Add documents with TF-IDF weights
weights = {"revolutionary" => 2.5_f64, "product" => 1.8_f64}
index.add_with_weights(3, "Technology company announces revolutionary product", weights)

# Query for similar documents
candidates = index.query("Technology company announces")
# candidates is Set(Int32) of doc IDs

# Query with weights
candidates_weighted = index.query_with_weights("Technology company announces", weights)

# Query with similarity scores
scored = index.query_with_scores("Technology company announces")
scored.each do |doc_id, score|
  puts "#{doc_id}: #{score}"
end

# Find all similar pairs above threshold
pairs = index.find_similar_pairs(threshold: 0.75)

# Monitor table utilization
puts index.load_factors  # Array(Float64) per band

# Storage operations
bytes = LexisMinhash::Engine.signature_to_bytes(sig)
restored = LexisMinhash::Engine.bytes_to_signature(bytes)

Custom Document Types (Backward Compatibility)

# Implement Document interface for custom types
struct MyDocument
  include LexisMinhash::Document

  getter text : String

  def initialize(@text : String)
  end
end

doc = MyDocument.new("Custom document text")
sig = LexisMinhash::Engine.compute_signature(doc)

Configuration

LexisMinhash::Engine.configure(
  signature_size: 100,  # Number of hash functions
  num_bands: 20,        # Number of bands for LSH
  shingle_size: 5,      # Character shingle size
  min_words: 4          # Minimum words to produce signature (below = zeros)
)

Default values:

signature_size: 100
num_bands: 20
rows_per_band: 5 (calculated as signature_size / num_bands)
shingle_size: 5
min_words: 4 (texts with fewer words return zero signature)

Minimum Words Threshold

Texts with fewer than min_words return a zero signature:

LexisMinhash::Engine.compute_signature("Short")        # => [0, 0, 0, ...] (1 word)
LexisMinhash::Engine.compute_signature("Hello world")  # => [0, 0, 0, ...] (2 words)
LexisMinhash::Engine.compute_signature("Bitcoin price surge continues")  # => [...] (4 words, produces signature)

This prevents clustering meaningless short headlines. Adjust based on your use case:

LexisMinhash::Engine.configure(min_words: 6)  # Stricter filtering

Functional Configuration API

For explicit, deterministic, or thread-safe configuration, use Engine::Config:

# Generate a config with optional seed for deterministic hashing
cfg = LexisMinhash::Engine.generate_config(
  signature_size: 50,
  num_bands: 10,
  shingle_size: 4,
  seed: 12345  # Omit for random coefficients
)

# Use the config explicitly (pure function, no global state)
sig = LexisMinhash::Engine.compute_signature_with_config(cfg, "Your text here")

Benefits of Engine::Config:

Deterministic: Pass a seed for reproducible signatures across runs
Thread-safe: No global state; each thread can use its own config
Testable: Easy to create known configurations for tests

The runtime Engine.configure method updates a global default config used by convenience methods like Engine.compute_signature.

Performance

Using rolling hash + multiply-shift:

Engine.compute_signature             22.02µs  9.43kB/op
Engine.compute_signature (weighted)   98.36µs  74.8kB/op
Engine.compute_signature_slice        21.54µs  7.09kB/op  (recommended)
Similarity.weighted_overlap           73.85ns  160B/op

Note: Default methods return Array(UInt32) for backward compatibility (~3% slower, 2x memory). Use compute_signature_slice and bytes_to_signature_slice for maximum performance.

Weighted signature computation is ~4.5x slower due to string key lookups in the weights hash. Use compute_signature_slice for better performance when using weights.

Caching Weights for Better Performance

If you're computing signatures repeatedly for the same documents, pre-compute a shingle-to-weight lookup to avoid repeated string operations:

# Pre-compute all shingle weights for a document once
def precompute_shingle_weights(text : String, base_weights : Hash(String, Float64)) : Hash(String, Float64)
  cache = Hash(String, Float64).new(1.0_f64)
  normalized = text.downcase
  
  (normalized.size - 4).times do |i|
    shingle = normalized[i...i + 5]
    cache[shingle] = base_weights[shingle]? || 1.0_f64
  end
  
  cache
end

# Then reuse the cached weights
weights = precompute_shingle_weights(document_text, tfidf_scores)
sig1 = LexisMinhash::Engine.compute_signature(document_text, weights)
sig2 = LexisMinhash::Engine.compute_signature(document_text, weights)

This moves the string allocation overhead to initialization time rather than signature computation.

Prehashing Weights (Recommended)

When you use the same weights map for many documents, pre-hash the shingle keys once and use the hashed-weighted API to avoid allocating shingle Strings per document:

# Original weights keyed by shingle string
base_weights = {"quick" => 2.0_f64, "brown" => 2.0_f64}

# Prehash the weights once (computes rolling-hash -> weight)
hashed = LexisMinhash::Engine.prehash_weights(base_weights)

# Compute signatures using the hashed-weighted path (no per-shingle String allocations)
sig = LexisMinhash::Engine.compute_signature("The quick brown fox jumps", hashed)

# Alternatively use the convenience method that prehashes and computes in one call
sig2 = LexisMinhash::Engine.compute_signature_with_prehashed_weights("The quick brown fox jumps", base_weights)

Use the hashed path when processing many documents with the same weights map — it typically reduces allocations and improves throughput.

LSH Parameter Tuning

For 100 hashes targeting 0.75 similarity:

Bands (b)	Rows (r)	Threshold	Prob @ Target
10	10	0.794	43.99%
20	5	0.549	99.56%
25	4	0.447	99.99%

Default (20 bands, 5 rows) gives 99.56% detection at 0.75 similarity.

Important: Randomized Hashing

MinHash uses randomized hash functions. This means:

Absolute similarity values vary between runs - The same two documents may get slightly different similarity scores each time the program starts
Use relative comparisons in tests - Prefer similarity(a, b) > similarity(a, c) over similarity(a, b) > 0.5
Deterministic within a run - Once initialized, the same text always produces the same signature

# Good for tests - relative comparison
sim_similar = Engine.similarity(sig1, sig2)
sim_different = Engine.similarity(sig1, sig3)
sim_similar.should be > sim_different

# Brittle in tests - absolute threshold
Engine.similarity(sig1, sig2).should be > 0.3  # May fail due to randomization

Weighted LSH Queries

When using weighted MinHash with LSHIndex, note that:

Documents added with add_with_weights produce different signatures than add
Query with the same weights to find matches: use query_with_weights for weighted-added documents
Mixing weighted and unweighted adds/queries may not find matches

# Consistent approach: use weights for both add and query
weights = {"important" => 2.0_f64, "term" => 1.5_f64}
index.add_with_weights(1, "Important document term", weights)
candidates = index.query_with_weights("Important term", weights)  # Finds doc 1

Run benchmarks: crystal run benchmark/benchmark.cr --release

Examples

There is a small examples folder demonstrating common workflows including prehashing weights to improve weighted MinHash performance.

Run a single example:

# Run the prehash example in release mode for realistic timings
crystal run examples/prehash_example.cr --release

Or use the helper script (intended for CI) to run all examples:

./scripts/run_examples.sh

API Reference

Engine Methods

Method	Description
`compute_signature(text : String)`	Generate MinHash signature → `Array(UInt32)`
`compute_signature(doc : Document)`	Generate from Document interface → `Array(UInt32)`
`compute_signature_slice(text)`	Generate signature → `Slice(UInt32)` (faster)
`similarity(sig1, sig2)`	Compare two signatures (0.0 to 1.0)
`overlap_coefficient(a, b)`	Overlap coefficient: \|A ∩ B\| / min(\|A\|, \|B\|)
`generate_bands(signature)`	Generate LSH bands → `Array({Int32, UInt64})`
`detection_probability(similarity)`	Probability of detecting items at given similarity
`signature_to_bytes(signature)`	Convert signature to bytes for storage
`bytes_to_signature(bytes)`	Convert bytes → `Array(UInt32)`
`bytes_to_signature_slice(bytes)`	Convert bytes → `Slice(UInt32)` (faster)

LSHIndex Methods

Method	Description
`add(doc_id : Int32, text : String)`	Add document to index
`add_with_signature(doc_id, signature)`	Add with pre-computed signature
`query(text : String)`	Find candidate doc IDs
`query_by_signature(signature)`	Query with pre-computed signature
`query_with_scores(text)`	Query with similarity scores
`find_similar_pairs(threshold)`	Find all similar pairs
`get_signature(doc_id)`	Get stored signature
`load_factors`	Table utilization per band
`size`	Number of indexed documents
`clear`	Clear all data

Development

# Run tests
crystal spec

# Run linter
bin/ameba

# Run benchmarks
crystal run benchmark/benchmark.cr --release

Docker-based CI

This project also includes a Docker-friendly CI job pattern (used successfully in my other projects) to run tests inside a reproducible Crystal image and avoid runtime bootstrapping issues on the runner. Example job snippet for GitHub Actions:

docker-test:
  name: Test in Crystal Docker image
  runs-on: ubuntu-latest
  container:
    image: 84codes/crystal:1.18.2-ubuntu-22.04
  steps:
    - uses: actions/checkout@v4
    - name: Install system dependencies
      run: |
        sudo apt-get update
        sudo apt-get install -y --no-install-recommends \
          libmagic-dev libssl-dev libxml2-dev pkg-config ca-certificates curl \
          && rm -rf /var/lib/apt/lists/*
    - name: Install shards
      run: shards install
    - name: Run ameba linter
      run: bin/ameba
    - name: Run specs
      run: crystal spec

Use this job alongside the existing runner-based jobs to get both fast native checks and a reproducible container execution.

Contributing

Fork it (https://github.com/kritoke/lexis-minhash/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

License

MIT

Repository

lexis-minhash

Owner

kritoke

Statistic

0
0
0
1
1
5 days ago
January 18, 2026

License

Links

Synced at

Fri, 27 Feb 2026 10:02:06 GMT

Languages

Crystal 96.1% Nix 2.95% Shell 0.94%

lexis-minhash v0.4.0

Lexis MinHash

Features

What's New in v0.4.2

Upgrade Notes

Installation

Usage

Basic Usage

Overlap Coefficient

Weighted Overlap Coefficient

Using LSHIndex

Custom Document Types (Backward Compatibility)

Configuration

Minimum Words Threshold

Functional Configuration API

Performance

Caching Weights for Better Performance

Prehashing Weights (Recommended)

LSH Parameter Tuning

Important: Randomized Hashing

Weighted LSH Queries

Examples

API Reference

Engine Methods

LSHIndex Methods

Development

Docker-based CI

Contributing

License