Nucleoc - Fuzzy Matcher for Crystal

Nucleoc is a Crystal port of the nucleo fuzzy matcher from Rust. It provides high-performance fuzzy matching algorithms for text search and filtering.

Status

✅ Production Ready - This is a complete port of the Rust nucleo library with full fuzzy matching functionality implemented and tested.

Features

Exact string matching
Case-sensitive and case-insensitive matching
Configurable scoring parameters
Fuzzy matching (greedy and optimal algorithms)
Substring matching
Prefix/Postfix matching
Pattern parsing
Unicode normalization
High-performance optimizations

Installation

Add the dependency to your shard.yml:

dependencies:
  nucleoc:
    github: dsisnero/nucleoc

Run shards install

Tutorial: Complete Guide to Using Nucleoc

1. Basic Usage

Simple Fuzzy Matching

require "nucleoc"

# Create a matcher with default configuration
matcher = Nucleoc::Matcher.new

# Fuzzy match with score
if score = matcher.fuzzy_match("hello world", "hlo")
  puts "Match found! Score: #{score}"
end

# Fuzzy match with indices (character positions)
indices = [] of UInt32
if score = matcher.fuzzy_indices("hello world", "hlo", indices)
  puts "Score: #{score}, Indices: #{indices}"  # => [0, 2, 3]
end

2. Matching Algorithms

Exact Matching

matcher = Nucleoc::Matcher.new

# Returns score if strings match exactly
score = matcher.exact_match("hello", "hello")  # => 140
score = matcher.exact_match("Hello", "hello")  # => 140 (case-insensitive by default)

# With indices
indices = [] of UInt32
score = matcher.exact_indices("crystal", "crystal", indices)
# score = 140, indices = [0, 1, 2, 3, 4, 5, 6]

Substring Matching

matcher = Nucleoc::Matcher.new

# Find needle as contiguous substring
score = matcher.substring_match("hello world", "world")  # => 96
score = matcher.substring_match("hello world", "lo wo")   # => 96

# With indices
indices = [] of UInt32
score = matcher.substring_indices("hello world", "world", indices)
# score = 96, indices = [6, 7, 8, 9, 10]

Prefix/Postfix Matching

matcher = Nucleoc::Matcher.new

# Prefix: needle must match start of haystack
score = matcher.prefix_match("hello world", "hello")  # => 140
score = matcher.prefix_match("  hello world", "hello") # => 140 (ignores leading whitespace)

# Postfix: needle must match end of haystack
score = matcher.postfix_match("hello world", "world")  # => 96
score = matcher.postfix_match("hello world  ", "world") # => 96 (ignores trailing whitespace)

Greedy Fuzzy Matching

matcher = Nucleoc::Matcher.new

# Greedy algorithm (faster but may not find optimal score)
score = matcher.fuzzy_match_greedy("hello world", "hlo")  # => 140

3. Configuration

Custom Configuration

# Default configuration (case-insensitive, normalized)
config = Nucleoc::Config::DEFAULT

# Custom configuration
config = Nucleoc::Config.new(
  ignore_case: false,      # Case-sensitive matching
  normalize: true,         # Unicode normalization
  prefer_prefix: false,    # Don't give bonus to matches near start
  delimiter_chars: "/,:;|", # Characters that act as word boundaries
  bonus_boundary_white: Nucleoc::BONUS_BOUNDARY + 2_u16,
  bonus_boundary_delimiter: Nucleoc::BONUS_BOUNDARY + 1_u16,
  initial_char_class: Nucleoc::CharClass::Whitespace
)

matcher = Nucleoc::Matcher.new(config)

File Path Configuration

# Optimized for matching file paths
config = Nucleoc::Config::DEFAULT.match_paths
matcher = Nucleoc::Matcher.new(config)

# On Unix: delimiter_chars = "/"
# On Windows: delimiter_chars = "/\\"

4. Pattern Parsing

Basic Patterns

# Parse a pattern with multiple atoms (space-separated)
pattern = Nucleoc::Pattern.parse("hello world")
# Matches "hello" AND "world" (both must match)

# Parse with specific case handling
pattern = Nucleoc::Pattern.parse("Hello World",
  case_matching: Nucleoc::CaseMatching::Smart,
  normalization: Nucleoc::Normalization::Smart
)

Pattern Syntax

# ^ = prefix match
pattern = Nucleoc::Pattern.parse("^hello")  # Must start with "hello"

# ' = substring match
pattern = Nucleoc::Pattern.parse("'world")  # Must contain "world" as substring

# $ = postfix match (or exact if combined with ^)
pattern = Nucleoc::Pattern.parse("world$")  # Must end with "world"
pattern = Nucleoc::Pattern.parse("^hello$") # Must be exactly "hello"

# ! = negative match
pattern = Nucleoc::Pattern.parse("!error")  # Must NOT contain "error"

# Escaping special characters
pattern = Nucleoc::Pattern.parse("\\!hello")  # Literal "!hello"
pattern = Nucleoc::Pattern.parse("\\^start")  # Literal "^start"
pattern = Nucleoc::Pattern.parse("\\'quote")  # Literal "'quote"

Using Patterns with Matcher

matcher = Nucleoc::Matcher.new
pattern = Nucleoc::Pattern.parse("hello world")

# Match pattern against haystack
score = pattern.match(matcher, "hello beautiful world")
# score = combined score of "hello" and "world" matches

# With indices
indices = [] of Array(UInt32)
score = pattern.match(matcher, "hello beautiful world", indices)
# indices = [[0, 1, 2, 3, 4], [16, 17, 18, 19, 20]]

5. Advanced Features

Parallel Matching

# Match multiple haystacks against single needle in parallel
haystacks = ["hello", "world", "foo", "bar"]
needle = "lo"

# Returns array of scores in same order as input
scores = Nucleoc.parallel_fuzzy_match(haystacks, needle)
# scores = [score_for_hello, score_for_world, nil, nil]

# With indices
results = Nucleoc.parallel_fuzzy_indices(haystacks, needle)
# results = [{score, indices}, {score, indices}, nil, nil]

# Force a strategy (:sequential, :fiber, :spawn, :fiber_pool, :cml_pool, :pool, :auto)
scores = Nucleoc.parallel_fuzzy_match(haystacks, needle, strategy: :spawn)
# :pool is an alias for :cml_pool; :auto picks based on batch size.

Notes:

CRYSTAL_WORKERS=1 tends to favor sequential/fiber paths; pools rarely help.
For CRYSTAL_WORKERS=2, spawn/fiber often wins at mid-sized batches.
MultiPattern score_parallel is usually slower than score at typical sizes.

Custom Worker Pool

# Create worker pools with custom size
cml_pool = Nucleoc::CMLWorkerPool.new(4)
fiber_pool = Nucleoc::FiberWorkerPool.new(4)

# Batch matching
scores, indices = cml_pool.match_many(haystacks, needle, compute_indices: true)
scores, indices = fiber_pool.match_many(haystacks, needle, compute_indices: true)

Direct API Functions

# Static convenience methods
score = Nucleoc.fuzzy_match("hello world", "hlo")
score = Nucleoc.substring_match("hello world", "world")
score = Nucleoc.prefix_match("hello world", "hello")
score = Nucleoc.postfix_match("hello world", "world")

# With indices
result = Nucleoc.fuzzy_match_indices("hello world", "hlo")
# result = {score, indices}

6. Nucleo API (Advanced)

Managing Collections

# Create a Nucleo instance for managing collections
nucleo = Nucleoc.new_matcher

# Optionally cap results for faster snapshot builds
nucleo = Nucleoc.new_matcher(max_results: 100)

# Add items
nucleo.add("hello")
nucleo.add_all(["world", "foo", "bar"])

# Update pattern
nucleo.pattern = "lo"  # Sets pattern to "lo"

# Schedule snapshot recompute (async in this Crystal port)
status = nucleo.tick(0)
puts "changed=#{status.changed?} running=#{status.running?}"

# Get matches
snapshot = nucleo.match
snapshot.items.each do |result|
  puts "#{result.item}: #{result.score}"
end

# Clear items
nucleo.clear

Notes:

tick schedules background matching and reports whether a run is still in progress.
Use parallel_fuzzy_match or CMLWorkerPool for parallel batch matching.

Incremental Updates with Injector

nucleo = Nucleoc.new_matcher

# Get injector for batch operations
injector = nucleo.injector

# Add items through injector
injector.inject(0, "hello")
injector.extend(["world", "foo", "bar"])

# Injector automatically unregisters when done

7. Multi-Column Matching (MultiPattern)

matcher = Nucleoc::Matcher.new
pattern = Nucleoc::MultiPattern.new(2)
pattern.reparse(0, "foo")
pattern.reparse(1, "bar")

haystacks = ["foo.txt", "bar.log"]
score = pattern.score(haystacks, matcher)
puts score

8. UI Tick Loop Usage

The high-level Nucleo API is designed to be called from your UI loop. Each UI tick updates the pattern, calls tick, and reads the latest snapshot. In this Crystal port, tick schedules background matching and returns quickly.

config = Nucleoc::Config.new
nucleo = Nucleoc::Nucleo(Int32).new(config, -> { nil }, 1, 1)
injector = nucleo.injector
injector.extend(["alpha", "beta", "gamma", "delta"])

loop do
  # Replace this with your input handling
  query = "ga"
  nucleo.pattern = query

  status = nucleo.tick(0)
  if status.changed?
    snapshot = nucleo.match
    snapshot.items.each do |match|
      puts "#{match.item}: #{match.score}"
    end
  end

  break
end

Debouncing Redraws

tick is non-blocking in this port, so avoid scheduling new runs more frequently than your UI needs. A common approach is to debounce redraws to ~16ms (60 FPS).

last_tick = Time.monotonic
loop do
  now = Time.monotonic
  if now - last_tick >= 16.milliseconds
    last_tick = now
    nucleo.pattern = "query"
    status = nucleo.tick(0)
    puts "changed=#{status.changed?}"
  end

  break
end

Streaming Updates

Inject items over time and call tick regularly to keep results up to date.

injector = nucleo.injector
items = ["alpha", "beta", "gamma", "delta", "epsilon"]

items.each do |item|
  injector.inject(0, item)
  nucleo.tick(0)
end

9. Debugging and Logging

# Enable debug logging
Log.setup(:debug)

matcher = Nucleoc::Matcher.new
score = matcher.fuzzy_match("hello world", "hlo")
# Logs include matrix layout, scoring steps, and reconstruction

# Or set environment variable
# LOG_LEVEL=DEBUG crystal run your_script.cr

10. Performance Tips

Reuse Matchers: Create matcher once and reuse for multiple matches
Use Appropriate Algorithm: Choose exact/substring when possible instead of fuzzy
Batch Operations: Use parallel_fuzzy_match for multiple haystacks
Pre-compile Patterns: Parse patterns once and reuse
Configure Delimiters: Set appropriate delimiter_chars for your use case

11. Common Use Cases

File Search

config = Nucleoc::Config::DEFAULT.match_paths
matcher = Nucleoc::Matcher.new(config)

# Match file paths
files = ["src/nucleoc/api.cr", "spec/nucleoc_spec.cr", "README.md"]
pattern = Nucleoc::Pattern.parse("nucleoc cr")

files.each do |file|
  if score = pattern.match(matcher, file)
    puts "#{file}: #{score}"
  end
end

Autocomplete

config = Nucleoc::Config.new(prefer_prefix: true)
matcher = Nucleoc::Matcher.new(config)

options = ["create", "read", "update", "delete", "config"]
query = "cr"

options.each do |option|
  if score = matcher.fuzzy_match(option, query)
    puts "#{option}: #{score}"
  end
end
# "create" gets bonus for starting with "cr"

Filtering Lists

matcher = Nucleoc::Matcher.new
items = ["apple", "banana", "cherry", "date", "elderberry"]
pattern = Nucleoc::Pattern.parse("!e a")  # No 'e', contains 'a'

filtered = items.select do |item|
  pattern.match(matcher, item)
end
# filtered = ["banana", "date"]

Configuration Reference

Config Struct Fields

Field	Type	Default	Description
`delimiter_chars`	`String`	`"/,:;	"`
`bonus_boundary_white`	`UInt16`	`BONUS_BOUNDARY + 2`	Bonus for boundary after whitespace
`bonus_boundary_delimiter`	`UInt16`	`BONUS_BOUNDARY + 1`	Bonus for boundary after delimiter
`initial_char_class`	`CharClass`	`CharClass::Whitespace`	Class for start of string
`normalize?`	`Bool`	`true`	Enable Unicode normalization
`ignore_case?`	`Bool`	`true`	Case-insensitive matching
`prefer_prefix?`	`Bool`	`false`	Give bonus to matches near start

Character Classes

Nucleoc uses character classification for scoring bonuses:

enum CharClass
  Whitespace  # Space, tab, newline
  Delimiter   # Characters in delimiter_chars
  NonWord     # Symbols like @, #, $
  Number      # 0-9
  Lower       # a-z
  Upper       # A-Z
end

Scoring Constants

SCORE_MATCH = 16                 # Base score for each match
PENALTY_GAP_START = 3           # Penalty for starting a gap
PENALTY_GAP_EXTENSION = 1       # Penalty for extending a gap
BONUS_BOUNDARY = 8              # Bonus for word boundary
BONUS_CONSECUTIVE = 4           # Bonus for consecutive matches
BONUS_FIRST_CHAR_MULTIPLIER = 2 # Multiplier for first character bonus

Bonus Calculation

Bonuses are awarded for:

Word boundaries (after whitespace/delimiter)
CamelCase transitions (lower → upper)
Number boundaries (non-number → number)
Consecutive matches

Debugging

Nucleoc uses Crystal's Log system. Set LOG_LEVEL=DEBUG to see detailed matcher traces, including matrix layout, scoring, and reconstruction steps:

LOG_LEVEL=DEBUG crystal spec

Development Status

This project is a complete port of the Rust nucleo library. The implementation includes:

Complete matching algorithms - Fuzzy (greedy and optimal), exact, substring, prefix/postfix
Pattern parsing - Full pattern syntax with operators and escaping
Unicode support - Full Unicode normalization and character classification
Performance optimizations - Compressed matrix representation, prefiltering, efficient scoring
Configuration - Flexible scoring parameters and matching options

Feature Parity with Rust Nucleo

✅ Core matching algorithms - All algorithms from Rust implementation
✅ Scoring system - Exact scoring constants and bonus calculations
✅ Unicode handling - Full Unicode normalization and case folding
✅ Pattern parsing - Complete pattern syntax with operators
✅ Core data structures - BoxcarVector, worker pool, parallel sort
✅ Test coverage - 125/125 tests passing with exact behavior matching

Feature Status

Feature	Status	Notes
Core Matching Algorithms	✅ Complete	Fuzzy (greedy/optimal), exact, substring, prefix/postfix
Pattern Parsing	✅ Complete	Full syntax with operators and escaping
Unicode Support	✅ Complete	Normalization and case folding
Configuration System	✅ Complete	Custom scoring, delimiters, case handling
Boxcar Data Structure	✅ Complete	Lock-free vector with snapshots (`src/nucleoc/boxcar.cr`)
Worker Pool	✅ Complete	Thread pool for concurrent matching (`src/nucleoc/worker_pool.cr`)
CML Worker Pool	✅ Complete	Concurrent ML-based agent system (`src/nucleoc/worker_pool_cml.cr`)
Parallel Sorting	🔄 In Progress	Parallel quicksort with cancellation (`src/nucleoc/par_sort.cr`) - bug fixes needed (issue nucleoc-8f6)
MultiPattern	🔄 Planned	Incremental pattern updates (issue nucleoc-efu)
Advanced CML Patterns	🔄 Planned	choose, wrap, with_nack, guard (issue nucleoc-fep)
Parallel Matcher	🔄 Planned	Intra-task parallelism like Rayon's par_iter (issue nucleoc-aa2)
Concurrency Tests	🔄 Planned	Comprehensive race condition testing (issue nucleoc-2gq)
Error Handling & Recovery	🔄 Planned	Supervisor patterns, circuit breaker (issue nucleoc-bsy)

Legend: ✅ = Implemented, 🔄 = In Progress/Planned, ❌ = Not Started

Development

Prerequisites

Crystal 1.18.2 or later
Git

Setup

git clone https://github.com/dsisnero/nucleoc.git
cd nucleoc
shards install

Examples

crystal run examples/basic.cr
crystal run examples/nucleo_worker.cr
crystal run examples/multi_pattern.cr
crystal run examples/worker_pool.cr

Running Tests

crystal spec

Code Quality

# Format code
crystal tool format src/ spec/

# Run linter
ameba

Benchmarks

Benchmark harnesses live under bench/. Run with --release for meaningful results:

CRYSTAL_CACHE_DIR=.crystal-cache crystal run bench/src/main.cr --release -- all

To target specific benchmarks or tune the dataset:

BENCH_DATASET=20000 BENCH_CORES=1,2,4 crystal run bench/src/main.cr --release -- worker_pool

See PERFORMANCE.md for how to capture results and compare against Rust nucleo benchmarks.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'feat: add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Contributors

Dominic Sisneros - creator and maintainer

See the full list of contributors who participated in this project.

Acknowledgments

Based on the nucleo Rust library by Pascal Kuthe and the Helix editor team
Inspired by fzf and skim fuzzy matching algorithms
Uses CML for concurrent ML patterns

API Quick Reference

Core Classes

Class	Purpose	Key Methods
`Matcher`	Main matching engine	`fuzzy_match`, `exact_match`, `substring_match`, `prefix_match`, `postfix_match`
`Pattern`	Parsed query pattern	`parse`, `match`
`Config`	Matching configuration	`new`, `match_paths`, `bonus_for`
`WorkerPool`	Parallel matching	`match_many`
`Nucleo`	Collection manager	`add`, `clear`, `match`, `injector`

Static Convenience Methods

Nucleoc.fuzzy_match(haystack, needle) → UInt16?
Nucleoc.substring_match(haystack, needle) → UInt16?
Nucleoc.prefix_match(haystack, needle) → UInt16?
Nucleoc.postfix_match(haystack, needle) → UInt16?
Nucleoc.parallel_fuzzy_match(haystacks, needle) → Array(UInt16?)
Nucleoc.parallel_fuzzy_indices(haystacks, needle) → Array(Tuple(UInt16, Array(UInt32))?)

Pattern Syntax Cheat Sheet

Syntax	Meaning	Example
`text`	Fuzzy match	`"hello"` matches `"hlo"`
`'text`	Substring match	`"'world"` matches `"hello world"`
`^text`	Prefix match	`"^hello"` matches `"hello world"`
`text$`	Postfix match	`"world$"` matches `"hello world"`
`^text$`	Exact match	`"^hello$"` matches only `"hello"`
`!text`	Negative match	`"!error"` excludes `"error.log"`
`a b`	AND (space)	`"hello world"` matches both
`\`	Escape	`"\!hello"` matches literal `"!hello"`

Parallel Performance

Nucleoc includes parallel implementations for bulk matching operations. However, for true parallel speedup across multiple CPU cores, you need to compile with Crystal's multithreading preview flag:

# Compile with multithreading enabled for parallel speedup
crystal build -Dpreview_mt your_app.cr

# Or run directly
crystal run -Dpreview_mt your_app.cr

Performance Characteristics

Without -Dpreview_mt: Fibers run on a single OS thread. Parallel operations still provide ~1.4x speedup from algorithmic optimizations.
With -Dpreview_mt: Fibers can run on multiple OS threads. Expect 2.5-4x speedup for large datasets (10k+ items).

Parallel APIs

# Parallel fuzzy match (returns scores only)
scores = Nucleoc.parallel_fuzzy_match(items, "pattern")

# Parallel fuzzy match with indices
results = Nucleoc.parallel_fuzzy_indices(items, "pattern")

# Parallel top-k match (optimized for limited results)
top_results = Nucleoc.parallel_top_k_match(items, "pattern", k=100)

The library automatically detects if multithreading is enabled and adjusts its parallel strategy accordingly. Use strategy: :auto (default) for automatic selection, or explicitly specify :sequential, :spawn, :fiber, or :fiber_pool.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Repository

nucleoc

Owner

dsisnero

Statistic

0
0
0
2
2
16 days ago
December 18, 2025

License

MIT License

Links

Synced at

Sun, 01 Mar 2026 08:08:50 GMT

Languages

Crystal 83.5% Makefile 10.11% Rust 5.18% Shell 1.21%