crystal-text-splitter v0.2.0
Text::Splitter
Intelligent text chunking for RAG (Retrieval-Augmented Generation) and LLM applications in Crystal.
Text::Splitter provides flexible, production-tested text chunking with configurable overlap to preserve context between chunks. Perfect for building semantic search, RAG pipelines, and LLM applications.
Features
- 🎯 Character-based splitting - Split by character count with sentence awareness
- 📝 Word-based splitting - Split by word count for more semantic chunking
- 🔗 Configurable overlap - Preserve context between chunks for better retrieval
- 🔄 Iterator API - Memory-efficient streaming with lazy evaluation
- 🛡️ Edge case handling - Handles long sentences, empty text, and boundary conditions
- ⚡ Zero dependencies - Pure Crystal implementation, no external dependencies
- 🚀 Production-tested - Battle-tested in production RAG systems
- ⚡ High performance - Process 1MB in ~7ms with only 18MB memory
Installation
- Add the dependency to your
shard.yml:
dependencies:
text-splitter:
github: wevote-project/crystal-text-splitter
version: ~> 0.1.0
- Run
shards install
Usage
Character-Based Splitting (Default)
Ideal for fixed-size chunks where character count matters:
require "text-splitter"
# Create splitter with character-based chunking
splitter = Text::Splitter.new(
chunk_size: 1000, # Max 1000 characters per chunk
chunk_overlap: 200 # 200 character overlap between chunks
)
text = File.read("long_document.txt")
chunks = splitter.split_text(text)
chunks.each_with_index do |chunk, i|
puts "Chunk #{i + 1}: #{chunk.size} characters"
puts chunk
puts "-" * 50
end
Word-Based Splitting
Better for semantic chunking and natural language processing:
require "text-splitter"
# Create splitter with word-based chunking
splitter = Text::Splitter.new(
chunk_size: 280, # Max 280 words per chunk
chunk_overlap: 50, # 50 word overlap between chunks
mode: :words
)
text = "Your long document text here..."
chunks = splitter.split_text(text)
# Process chunks for embedding generation
chunks.each do |chunk|
embedding = generate_embedding(chunk)
store_in_vector_db(chunk, embedding)
end
Memory-Efficient Iterator API
For processing large documents without loading all chunks into memory:
require "text-splitter"
splitter = Text::Splitter.new(chunk_size: 1000, chunk_overlap: 200)
# Method 1: Block syntax (most efficient - no array allocation)
splitter.each_chunk(text) do |chunk|
# Process chunk immediately
embedding = generate_embedding(chunk)
store_in_db(embedding)
end
# Method 2: Iterator with lazy evaluation
splitter.each_chunk(text)
.first(10) # Only process first 10 chunks
.each { |chunk| process(chunk) }
# Method 3: Transform without materializing all chunks
large_chunks = splitter.each_chunk(text)
.select { |c| c.size > 500 }
.map { |c| c.upcase }
.to_a
Performance: Processing 1MB of text uses only ~18MB memory with iterators vs ~42MB with arrays.
RAG Pipeline Example
Typical usage in a Retrieval-Augmented Generation system:
require "text-splitter"
class DocumentProcessor
def initialize
@splitter = Text::Splitter.new(
chunk_size: 500,
chunk_overlap: 100,
mode: :words
)
end
def process_document(doc : String, metadata : Hash)
# Split document into chunks
chunks = @splitter.split_text(doc)
chunks.map_with_index do |chunk, index|
{
text: chunk,
metadata: metadata.merge({
chunk_index: index,
total_chunks: chunks.size
})
}
end
end
end
# Usage
processor = DocumentProcessor.new
bill_text = File.read("bill_text.txt")
chunks = processor.process_document(
bill_text,
{bill_id: "HB-123", title: "Example Bill"}
)
# Store in vector database
chunks.each do |chunk|
embedding = OpenAI.embed(chunk[:text])
VectorDB.store(chunk[:text], embedding, chunk[:metadata])
end
API Reference
Text::Splitter.new
Creates a new text splitter instance.
Parameters:
chunk_size(Int32, required) - Maximum size of each chunk (characters or words depending on mode)chunk_overlap(Int32, required) - Overlap between chunks for context preservationmode(Symbol, optional) - Splitting mode::characters(default) or:words
Raises:
ArgumentErrorifchunk_sizeis not positiveArgumentErrorifchunk_overlapis negativeArgumentErrorifchunk_overlap>=chunk_size
#split_text(text : String) : Array(String)
Splits the input text into chunks based on the configured mode.
Parameters:
text(String) - The text to split into chunks
Returns:
Array(String)- Array of text chunks (empty array if input is empty/whitespace)
Why Overlap Matters
Overlap between chunks is crucial for RAG systems to maintain context:
# Without overlap
chunks = ["The bill was introduced in 2024.", "It aims to reduce emissions."]
# ❌ Lost context: What bill? What aims?
# With 50-character overlap
splitter = Text::Splitter.new(chunk_size: 100, chunk_overlap: 50)
chunks = splitter.split_text("The bill was introduced in 2024. It aims to reduce emissions.")
# ✅ Chunks:
# "The bill was introduced in 2024."
# "The bill was introduced in 2024. It aims to reduce emissions."
# Context preserved!
Chunking Strategies
When to use Character-Based Splitting
- Fixed embedding model limits (e.g., 512 tokens)
- Uniform chunk sizes required
- Processing code or structured data
When to use Word-Based Splitting
- Natural language documents
- Semantic search applications
- Better alignment with human understanding
Performance
Text::Splitter is highly optimized for production use:
- Fast: Processes 1MB of text in ~7ms (147 ops/sec)
- Memory efficient: Only 18MB memory per operation with iterator API
- Streaming capable: Process chunks without loading entire document
- Type-safe: Crystal's compile-time type checking prevents runtime errors
Benchmark Results (1MB text, release build)
| Metric | Value |
|---|---|
| Throughput | 147 ops/sec |
| Latency | 6.79ms per 1MB |
| Memory | 17.9MB per operation |
| Chunks generated | 1,249 chunks |
Comparison with Other Solutions
| Feature | Text::Splitter | LangChain (Python) | Manual String.split |
|---|---|---|---|
| Sentence-aware | ✅ | ✅ | ❌ |
| Configurable overlap | ✅ | ✅ | ❌ |
| Word/char modes | ✅ | ✅ | ❌ |
| Iterator API | ✅ | ❌ | ❌ |
| Zero dependencies | ✅ | ❌ | ✅ |
| Type-safe | ✅ | ❌ | ✅ |
| Edge case handling | ✅ | ✅ | ❌ |
| Performance | 7ms/MB | ~100ms/MB | N/A |
Real-World Usage
Text::Splitter is production-tested in:
- Bills RAG System: Processing legislative documents for semantic search
- Document chunking for embedding generation (OpenAI, local models)
- Building vector databases with proper context preservation
- RAG pipelines for question-answering systems
Contributing
- Fork it (https://github.com/wevote-project/text-splitter/fork)
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request
Development
# Install dependencies
shards install
# Run tests
crystal spec
# Run linter
bin/ameba
# Format code
crystal tool format
Contributors
- Antarr Byrd - creator and maintainer
- Osama Saeed - creator and maintainer
License
This project is licensed under the MIT License - see the LICENSE file for details.
Related Projects
- LangChain - Python framework for LLM applications
- llama_index - Data framework for LLM applications
- Pinecone - Vector database for similarity search
Acknowledgments
Inspired by text splitting patterns from LangChain and best practices from the RAG/LLM community. Built with ❤️ in Crystal for high-performance text processing.
crystal-text-splitter
- 0
- 0
- 0
- 0
- 1
- about 1 hour ago
- November 24, 2025
MIT License
Mon, 24 Nov 2025 06:44:56 GMT