lexis-minhash v0.4.0

A fast, efficient clustering engine for text documents using MinHash and Locality-Sensitive Hashing (LSH) algorithms

Lexis MinHash

CI Release License

Coverage

Note: CI enforces a minimum coverage threshold (80%). If coverage falls below the threshold the test job will fail.

Lexis MinHash is a locality-sensitive hashing (LSH) library for detecting similar text documents using the MinHash technique. It uses rolling hash + multiply-shift for O(n) performance.

For advanced usage patterns and client-side recommendations, see API.md.

Features

  • O(n) MinHash Signatures: Rolling hash + multiply-shift, no intermediate string allocations
  • Signature Similarity: Fast approximate Jaccard similarity estimation
  • Weighted MinHash: Optional TF-IDF weights for frequency-biased sampling
  • Weighted Overlap Coefficient: Similarity measure for weighted document representations
  • Overlap Coefficient: Measure set similarity using |A ∩ B| / min(|A|, |B|)
  • Locality-Sensitive Hashing (LSH): Efficient candidate retrieval using banding
  • LSH Index: In-memory index with linear probing for cache-efficient storage
  • Thread-Safe: Mutex-protected configuration
  • Runtime Configuration: Adjust signature size, band count, shingle size at runtime
  • Reproducible Hashes: Optional seed for consistent signatures across restarts
  • Signature Struct: Convenient serialization with to_blob/from_blob

What's New in v0.4.2

  • Memory safety fix: Fixed critical memory safety bug in Engine.configure when using the seed parameter. Previously could cause undefined behavior due to dangling pointers.
  • Fixed unused @rows instance variable in LSHIndex.
  • Various CI improvements and bug fixes.

See the full release notes at https://github.com/kritoke/lexis-minhash/releases/tag/v0.4.2

Upgrade Notes

  • If you previously used non-divisible signature_size/num_bands combos, update your configuration to satisfy signature_size % num_bands == 0. Engine.configure will now raise ArgumentError for invalid combinations instead of silently truncating rows per band.
  • When using weighted MinHash at scale, prehash your weight map once and use the hashed-weighted API to reduce allocations:
hashed = LexisMinhash::Engine.prehash_weights(base_weights)
sig = LexisMinhash::Engine.compute_signature(text, hashed)

This repo also includes examples/prehash_example.cr demonstrating the pattern.

Installation

Add the dependency to your shard.yml:

dependencies:
  lexis-minhash:
    github: kritoke/lexis-minhash
    version: ~> 0.4.2

Then run shards install.

Usage

Basic Usage

require "lexis-minhash"

# Generate signatures directly from strings
sig1 = LexisMinhash::Engine.compute_signature("Document 1 text here")
sig2 = LexisMinhash::Engine.compute_signature("Document 2 text here")

# Calculate similarity (0.0 to 1.0)
similarity = LexisMinhash::Engine.similarity(sig1, sig2)
puts "Similarity: #{similarity}"

# Or use Signature struct for convenient API
sig = LexisMinhash::Signature.compute("Document text")
similarity = sig.similarity(other_sig)

# Serialize to blob for database storage
bytes = sig.to_blob

# Deserialize from blob
sig2 = LexisMinhash::Signature.from_blob(bytes)

# Generate signatures with optional TF-IDF weights
weights = {
  "important" => 2.5_f64,
  "rareword"  => 3.0_f64,
}
sig_weighted = LexisMinhash::Engine.compute_signature("Important rareword document", weights)

# Reproducible hashes (same seed = same signatures every run)
LexisMinhash::Engine.configure(seed: 12345)

# Generate LSH bands for candidate detection
bands = LexisMinhash::Engine.generate_bands(sig1)
# bands is Array({Int32, UInt64}) with {band_index, band_hash} tuples

Overlap Coefficient

The overlap coefficient measures set similarity as |A ∩ B| / min(|A|, |B|). Unlike Jaccard similarity (|A ∩ B| / |A ∪ B|), it's better at detecting partial overlaps when one set is a subset of another.

# Works with sorted UInt32 or UInt64 slices
a = Slice.new(5) { |i| (i * 2).to_u32 }  # [0, 2, 4, 6, 8]
b = Slice.new(3) { |i| (i * 2 + 2).to_u32 }  # [2, 4, 6]

# Intersection: [2, 4, 6] = 3 elements
# min(5, 3) = 3
# Overlap coefficient = 3/3 = 1.0
coefficient = LexisMinhash::Engine.overlap_coefficient(a, b)
puts "Overlap: #{coefficient}"  # => 1.0

Weighted Overlap Coefficient

The weighted overlap coefficient measures similarity between weighted document representations (e.g., TF-IDF vectors). It computes the sum of minimum weights for intersecting terms, normalized by the smaller total weight.

doc_a = {
  "machine" => 0.8_f64,
  "learning" => 0.9_f64,
  "data" => 0.5_f64,
}

doc_b = {
  "machine" => 0.8_f64,
  "learning" => 0.6_f64,
  "model" => 0.7_f64,
}

similarity = LexisMinhash::Similarity.weighted_overlap(doc_a, doc_b)
puts "Weighted overlap: #{similarity}"  # => ~0.736 (1.4 / min(2.2, 2.1))

Using LSHIndex

# Create index with expected document count for capacity planning
index = LexisMinhash::LSHIndex.new(bands: 20, expected_docs: 1000)

# Add documents with Int32 IDs
index.add(1, "Technology company announces revolutionary product")
index.add(2, "Technology company announces revolutionary update")

# Add documents with TF-IDF weights
weights = {"revolutionary" => 2.5_f64, "product" => 1.8_f64}
index.add_with_weights(3, "Technology company announces revolutionary product", weights)

# Query for similar documents
candidates = index.query("Technology company announces")
# candidates is Set(Int32) of doc IDs

# Query with weights
candidates_weighted = index.query_with_weights("Technology company announces", weights)

# Query with similarity scores
scored = index.query_with_scores("Technology company announces")
scored.each do |doc_id, score|
  puts "#{doc_id}: #{score}"
end

# Find all similar pairs above threshold
pairs = index.find_similar_pairs(threshold: 0.75)

# Monitor table utilization
puts index.load_factors  # Array(Float64) per band

# Storage operations
bytes = LexisMinhash::Engine.signature_to_bytes(sig)
restored = LexisMinhash::Engine.bytes_to_signature(bytes)

Custom Document Types (Backward Compatibility)

# Implement Document interface for custom types
struct MyDocument
  include LexisMinhash::Document

  getter text : String

  def initialize(@text : String)
  end
end

doc = MyDocument.new("Custom document text")
sig = LexisMinhash::Engine.compute_signature(doc)

Configuration

LexisMinhash::Engine.configure(
  signature_size: 100,  # Number of hash functions
  num_bands: 20,        # Number of bands for LSH
  shingle_size: 5,      # Character shingle size
  min_words: 4          # Minimum words to produce signature (below = zeros)
)

Default values:

  • signature_size: 100
  • num_bands: 20
  • rows_per_band: 5 (calculated as signature_size / num_bands)
  • shingle_size: 5
  • min_words: 4 (texts with fewer words return zero signature)

Minimum Words Threshold

Texts with fewer than min_words return a zero signature:

LexisMinhash::Engine.compute_signature("Short")        # => [0, 0, 0, ...] (1 word)
LexisMinhash::Engine.compute_signature("Hello world")  # => [0, 0, 0, ...] (2 words)
LexisMinhash::Engine.compute_signature("Bitcoin price surge continues")  # => [...] (4 words, produces signature)

This prevents clustering meaningless short headlines. Adjust based on your use case:

LexisMinhash::Engine.configure(min_words: 6)  # Stricter filtering

Functional Configuration API

For explicit, deterministic, or thread-safe configuration, use Engine::Config:

# Generate a config with optional seed for deterministic hashing
cfg = LexisMinhash::Engine.generate_config(
  signature_size: 50,
  num_bands: 10,
  shingle_size: 4,
  seed: 12345  # Omit for random coefficients
)

# Use the config explicitly (pure function, no global state)
sig = LexisMinhash::Engine.compute_signature_with_config(cfg, "Your text here")

Benefits of Engine::Config:

  • Deterministic: Pass a seed for reproducible signatures across runs
  • Thread-safe: No global state; each thread can use its own config
  • Testable: Easy to create known configurations for tests

The runtime Engine.configure method updates a global default config used by convenience methods like Engine.compute_signature.

Performance

Using rolling hash + multiply-shift:

Engine.compute_signature             22.02µs  9.43kB/op
Engine.compute_signature (weighted)   98.36µs  74.8kB/op
Engine.compute_signature_slice        21.54µs  7.09kB/op  (recommended)
Similarity.weighted_overlap           73.85ns  160B/op

Note: Default methods return Array(UInt32) for backward compatibility (~3% slower, 2x memory). Use compute_signature_slice and bytes_to_signature_slice for maximum performance.

Weighted signature computation is ~4.5x slower due to string key lookups in the weights hash. Use compute_signature_slice for better performance when using weights.

Caching Weights for Better Performance

If you're computing signatures repeatedly for the same documents, pre-compute a shingle-to-weight lookup to avoid repeated string operations:

# Pre-compute all shingle weights for a document once
def precompute_shingle_weights(text : String, base_weights : Hash(String, Float64)) : Hash(String, Float64)
  cache = Hash(String, Float64).new(1.0_f64)
  normalized = text.downcase
  
  (normalized.size - 4).times do |i|
    shingle = normalized[i...i + 5]
    cache[shingle] = base_weights[shingle]? || 1.0_f64
  end
  
  cache
end

# Then reuse the cached weights
weights = precompute_shingle_weights(document_text, tfidf_scores)
sig1 = LexisMinhash::Engine.compute_signature(document_text, weights)
sig2 = LexisMinhash::Engine.compute_signature(document_text, weights)

This moves the string allocation overhead to initialization time rather than signature computation.

Prehashing Weights (Recommended)

When you use the same weights map for many documents, pre-hash the shingle keys once and use the hashed-weighted API to avoid allocating shingle Strings per document:

# Original weights keyed by shingle string
base_weights = {"quick" => 2.0_f64, "brown" => 2.0_f64}

# Prehash the weights once (computes rolling-hash -> weight)
hashed = LexisMinhash::Engine.prehash_weights(base_weights)

# Compute signatures using the hashed-weighted path (no per-shingle String allocations)
sig = LexisMinhash::Engine.compute_signature("The quick brown fox jumps", hashed)

# Alternatively use the convenience method that prehashes and computes in one call
sig2 = LexisMinhash::Engine.compute_signature_with_prehashed_weights("The quick brown fox jumps", base_weights)

Use the hashed path when processing many documents with the same weights map — it typically reduces allocations and improves throughput.

LSH Parameter Tuning

For 100 hashes targeting 0.75 similarity:

Bands (b) Rows (r) Threshold Prob @ Target
10 10 0.794 43.99%
20 5 0.549 99.56%
25 4 0.447 99.99%

Default (20 bands, 5 rows) gives 99.56% detection at 0.75 similarity.

Important: Randomized Hashing

MinHash uses randomized hash functions. This means:

  • Absolute similarity values vary between runs - The same two documents may get slightly different similarity scores each time the program starts
  • Use relative comparisons in tests - Prefer similarity(a, b) > similarity(a, c) over similarity(a, b) > 0.5
  • Deterministic within a run - Once initialized, the same text always produces the same signature
# Good for tests - relative comparison
sim_similar = Engine.similarity(sig1, sig2)
sim_different = Engine.similarity(sig1, sig3)
sim_similar.should be > sim_different

# Brittle in tests - absolute threshold
Engine.similarity(sig1, sig2).should be > 0.3  # May fail due to randomization

Weighted LSH Queries

When using weighted MinHash with LSHIndex, note that:

  • Documents added with add_with_weights produce different signatures than add
  • Query with the same weights to find matches: use query_with_weights for weighted-added documents
  • Mixing weighted and unweighted adds/queries may not find matches
# Consistent approach: use weights for both add and query
weights = {"important" => 2.0_f64, "term" => 1.5_f64}
index.add_with_weights(1, "Important document term", weights)
candidates = index.query_with_weights("Important term", weights)  # Finds doc 1

Run benchmarks: crystal run benchmark/benchmark.cr --release

Examples

There is a small examples folder demonstrating common workflows including prehashing weights to improve weighted MinHash performance.

Run a single example:

# Run the prehash example in release mode for realistic timings
crystal run examples/prehash_example.cr --release

Or use the helper script (intended for CI) to run all examples:

./scripts/run_examples.sh

API Reference

Engine Methods

Method Description
compute_signature(text : String) Generate MinHash signature → Array(UInt32)
compute_signature(doc : Document) Generate from Document interface → Array(UInt32)
compute_signature_slice(text) Generate signature → Slice(UInt32) (faster)
similarity(sig1, sig2) Compare two signatures (0.0 to 1.0)
overlap_coefficient(a, b) Overlap coefficient: |A ∩ B| / min(|A|, |B|)
generate_bands(signature) Generate LSH bands → Array({Int32, UInt64})
detection_probability(similarity) Probability of detecting items at given similarity
signature_to_bytes(signature) Convert signature to bytes for storage
bytes_to_signature(bytes) Convert bytes → Array(UInt32)
bytes_to_signature_slice(bytes) Convert bytes → Slice(UInt32) (faster)

LSHIndex Methods

Method Description
add(doc_id : Int32, text : String) Add document to index
add_with_signature(doc_id, signature) Add with pre-computed signature
query(text : String) Find candidate doc IDs
query_by_signature(signature) Query with pre-computed signature
query_with_scores(text) Query with similarity scores
find_similar_pairs(threshold) Find all similar pairs
get_signature(doc_id) Get stored signature
load_factors Table utilization per band
size Number of indexed documents
clear Clear all data

Development

# Run tests
crystal spec

# Run linter
bin/ameba

# Run benchmarks
crystal run benchmark/benchmark.cr --release

Docker-based CI

This project also includes a Docker-friendly CI job pattern (used successfully in my other projects) to run tests inside a reproducible Crystal image and avoid runtime bootstrapping issues on the runner. Example job snippet for GitHub Actions:

docker-test:
  name: Test in Crystal Docker image
  runs-on: ubuntu-latest
  container:
    image: 84codes/crystal:1.18.2-ubuntu-22.04
  steps:
    - uses: actions/checkout@v4
    - name: Install system dependencies
      run: |
        sudo apt-get update
        sudo apt-get install -y --no-install-recommends \
          libmagic-dev libssl-dev libxml2-dev pkg-config ca-certificates curl \
          && rm -rf /var/lib/apt/lists/*
    - name: Install shards
      run: shards install
    - name: Run ameba linter
      run: bin/ameba
    - name: Run specs
      run: crystal spec

Use this job alongside the existing runner-based jobs to get both fast native checks and a reproducible container execution.

Contributing

  1. Fork it (https://github.com/kritoke/lexis-minhash/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

License

MIT

Repository

lexis-minhash

Owner
Statistic
  • 0
  • 0
  • 0
  • 1
  • 1
  • 5 days ago
  • January 18, 2026
License

Links
Synced at

Fri, 27 Feb 2026 10:02:06 GMT

Languages