bm25
bm25
Crystal port of Michael-JB/bm25, a BM25 library for sparse text embeddings, scoring, and lightweight in-memory keyword search.
This shard exposes three layers:
- Embedder: turns text into sparse BM25-weighted embeddings.
- Scorer: scores query embeddings against indexed document embeddings.
- SearchEngine: stores documents and searches them through the embedder and scorer.
Documentation
- Architecture: modules, data flow, tokenizer behavior, builders, and parity notes
- Development: setup, upstream source of truth, commands, and inventory checks
- Coding Guidelines: Crystal patterns used by this port
- Testing: spec layout, upstream fixtures, snapshots, and parity gates
- PR Workflow: review expectations for parity changes
Features
- Sparse BM25 embeddings with configurable
k1,b, andavgdl - Direct scoring API for callers that manage documents themselves
- In-memory search engine with
upsert,remove,get,each, andsearch - Default tokenizer with unicode normalization, punctuation splitting, stopword removal, and stemming
- Fixed language modes for English and German parity coverage, plus
Bm25.detect UInt32andUInt64fxhash-compatible token embedders- Custom tokenizer and token embedder extension points
- Upstream Rust fixture and snapshot parity specs
Installation
Add this shard to shard.yml:
dependencies:
bm25:
github: dsisnero/bm25
Then install dependencies:
shards install
BM25 Parameters
BM25 scores query relevance using term frequency, inverse document frequency, and document length normalization.
avgdlis the average meaningful token count for the corpus. Fit it from a known corpus when possible.bcontrols document length normalization.0.75is the default;0disables length normalization.k1controls the effect of repeated terms.1.2is the default.
When you do not have the full corpus available, use with_avgdl with a reasonable estimate.
Embed
Fit an embedder to a corpus and embed text:
require "bm25"
corpus = [
"The sky blushed pink as the sun dipped below the horizon.",
"Apples, oranges, papayas, and more papayas.",
"She found a forgotten letter tucked inside an old book.",
"A single drop of rain fell, followed by a thousand more.",
]
embedder = Bm25::EmbedderBuilder(UInt32, Bm25::DefaultTokenizer)
.with_fit_to_corpus(Bm25.fixed(Bm25::Language::English), corpus, Bm25::U32Embedder.new)
.build
embedding = embedder.embed(corpus[1])
embedding.indices # => Array(UInt32)
embedding.values # => Array(Float32)
Build from explicit BM25 parameters:
embedder = Bm25::EmbedderBuilder(UInt32, Bm25::DefaultTokenizer)
.with_avgdl(7.0_f32, Bm25::U32Embedder.new)
.b(0.75_f32)
.k1(1.2_f32)
.build
Language Mode
Use a fixed tokenizer language when the corpus language is known:
embedder = Bm25::EmbedderBuilder(UInt32, Bm25::DefaultTokenizer)
.with_avgdl(256.0_f32, Bm25::U32Embedder.new)
.language_mode(Bm25.fixed(Bm25::Language::German))
.build
Use detection when input can be English or German:
tokenizer = Bm25::DefaultTokenizer.new(Bm25.detect)
tokens = tokenizer.tokenize("Slices of pizza")
Bm25.detect is implemented locally with an English/German stopword heuristic to satisfy current upstream parity tests. It is not a general replacement for upstream Rust's optional whichlang integration.
Tokenizer
The default tokenizer normalizes unicode, lowercases text, splits on word-like boundaries, removes stopwords, and stems tokens.
tokenizer = Bm25::DefaultTokenizer.builder
.language_mode(Bm25::Language::English)
.normalization(true)
.stopwords(true)
.stemming(true)
.build
tokenizer.tokenize("Slices of 🍕") # => ["slice", "pizza"]
Provide a custom tokenizer by subclassing Bm25::Tokenizer:
class SplitOnTTokenizer < Bm25::Tokenizer
def tokenize(input_text : String) : Array(String)
input_text.split("T").reject(&.empty?)
end
end
embedder = Bm25::EmbedderBuilder(UInt32, SplitOnTTokenizer)
.with_avgdl(1.0_f32, Bm25::U32Embedder.new)
.tokenizer(SplitOnTTokenizer.new)
.build
embedder.embed("CupTofTtea").indices
Embedding Space
The embedding index type is the D type parameter. Use the built-in fxhash-compatible embedders for UInt32 or UInt64:
u32_embedder = Bm25::EmbedderBuilder(UInt32, Bm25::DefaultTokenizer)
.with_avgdl(2.0_f32, Bm25::U32Embedder.new)
.build
u64_embedder = Bm25::EmbedderBuilder(UInt64, Bm25::DefaultTokenizer)
.with_avgdl(2.0_f32, Bm25::U64Embedder.new)
.build
Use a custom embedding space by implementing Bm25::TokenEmbedder(D):
struct ConstantEmbedder
include Bm25::TokenEmbedder(String)
def embed(token : String) : String
"constant"
end
end
embedder = Bm25::EmbedderBuilder(String, Bm25::DefaultTokenizer)
.with_avgdl(2.0_f32, ConstantEmbedder.new)
.build
Score
Use Scorer directly when you want BM25 scoring but own document storage yourself:
corpus = [
"The sky blushed pink as the sun dipped below the horizon.",
"She found a forgotten letter tucked inside an old book.",
"Apples, oranges, pink grapefruits, and more pink grapefruits.",
"A single drop of rain fell, followed by a thousand more.",
]
embedder = Bm25::EmbedderBuilder(UInt32, Bm25::DefaultTokenizer)
.with_fit_to_corpus(Bm25::Language::English, corpus, Bm25::U32Embedder.new)
.build
scorer = Bm25::Scorer(UInt32, UInt32).new
corpus.each_with_index do |document, index|
scorer.upsert(index.to_u32, embedder.embed(document))
end
query_embedding = embedder.embed("pink")
matches = scorer.matches(query_embedding)
score = scorer.score(0_u32, query_embedding)
Search
Use SearchEngine for document storage plus search:
corpus = [
"The rabbit munched the orange carrot.",
"The snake hugged the green lizard.",
"The hedgehog impaled the orange orange.",
"The squirrel buried the brown nut.",
]
engine = Bm25::SearchEngineBuilder(UInt32, UInt32, Bm25::DefaultTokenizer)
.with_corpus(Bm25::Language::English, corpus, Bm25::U32Embedder.new)
.build
results = engine.search("orange", 3)
Build with custom document IDs:
documents = [
Bm25::Document(String).new("Guacamole", "avocado, lime juice, salt, onion, tomatoes, coriander."),
Bm25::Document(String).new("Hummus", "chickpeas, tahini, olive oil, garlic, lemon juice, salt."),
]
engine = Bm25::SearchEngineBuilder(String, UInt32, Bm25::DefaultTokenizer)
.with_documents(Bm25::Language::English, documents, Bm25::U32Embedder.new)
.build
Mutate an existing engine:
engine = Bm25::SearchEngineBuilder(UInt32, UInt32, Bm25::DefaultTokenizer)
.with_avgdl(10.0_f32, Bm25::U32Embedder.new)
.language_mode(Bm25::Language::English)
.build
document = Bm25::Document(UInt32).new(42_u32, "A breeze carried jasmine through the open window.")
engine.upsert(document)
engine.get(42_u32)
engine.remove(42_u32)
Changing the corpus after build time changes the true average document length. If many documents are added or removed, rebuild or choose an avgdl that reflects the expected corpus.
Development
Run the main gates:
crystal tool format --check src spec
ameba src spec
crystal spec
Run the ported upstream search benchmark:
crystal run benchmarks/search.cr
More details live in Development, Testing, and Architecture.
License
bm25
- 0
- 0
- 0
- 1
- 0
- 20 days ago
- April 25, 2026
MIT License
Wed, 06 May 2026 17:01:28 GMT