tokens
tokens
A Crystal port of huggingface/tokenizers — provides implementations of today's most used tokenizers in pure Crystal.
Upstream pinned ref: 3992692 (main, 2025-04-24)
Features
Models (4 families)
| Model | Training | Serialization | Description |
|---|---|---|---|
| BPE | BpeTrainer |
JSON, vocab+merges files | Byte-Pair Encoding (GPT-2, RoBERTa) |
| WordPiece | WordPieceTrainer |
JSON, vocab file | Greedy longest-match (BERT) |
| WordLevel | WordLevelTrainer |
JSON | Simple word-to-id mapping |
| Unigram | UnigramTrainer (EM) |
JSON | Viterbi lattice + n-best sampling |
Pipeline Components (33 types)
| Stage | Types | Wrapper |
|---|---|---|
| Normalizer (10) | NFC, NFD, NFKC, NFKD, Nmt, Lowercase, Strip, StripAccents, Replace, Prepend, BertNormalizer, ByteLevel, Precompiled, Sequence | NormalizerWrapper |
| PreTokenizer (11) | Whitespace, WhitespaceSplit, ByteLevel, Metaspace, Digits, Punctuation, Split, Delimiter, FixedLength, BertPreTokenizer, UnicodeScripts, Sequence | PreTokenizerWrapper |
| PostProcessor (4) | BertProcessing, RobertaProcessing, TemplateProcessing, SequenceProcessor | PostProcessorWrapper |
| Decoder (8) | BPEDecoder, ByteLevel, ByteFallback, CTC, Fuse, Strip, WordPiece, Metaspace, Sequence | DecoderWrapper |
Utilities
- from_pretrained — download tokenizers from HuggingFace Hub
- train_from_files — train models from text files
- encode_batch — batch encoding with padding
- LinesWithEnding — line reading preserving
\n/\r - ProgressBar — training progress reporting
- Parallelism — parallelism configuration API
Serialization
- JSON round-trip for all pipeline components (compatible with upstream format)
- Tagged/untagged JSON dispatch
- Cross-model tokenizer save/load
Installation
Add to your shard.yml:
dependencies:
tokens:
github: dsisnero/tokens
Then:
shards install
Quick Example
require "tokens"
# Build a BPE tokenizer from scratch
model = Tokens::Models::BPE::BpeBuilder.new
.unk_token("[UNK]")
.build
tokenizer = Tokens::TokenizerImpl.new(model)
.with_normalizer(Tokens::Normalizers::NFC.new)
.with_pre_tokenizer(Tokens::PreTokenizers::ByteLevel.default)
.with_post_processor(Tokens::PostProcessors::BertProcessing.default)
.with_decoder(Tokens::Decoders::BPEDecoder.default)
# Train from files
trainer = Tokens::Models::BPE::BpeTrainerBuilder.new
.show_progress(false)
.vocab_size(1000)
.build
tokenizer.train_from_files(trainer, ["data/small.txt"])
# Encode
encoding = tokenizer.encode("Hello there!", add_special_tokens: true)
encoding.tokens # => ["[CLS]", "Hello", "there", "!", "[SEP]"]
# Decode
tokenizer.decode(encoding.ids) # => "Hello there !"
# Serialize
File.write("tokenizer.json", tokenizer.to_json)
loaded = Tokens::TokenizerImpl.from_json(File.read("tokenizer.json"))
# Download from HuggingFace Hub
tokenizer = Tokens::TokenizerImpl.from_pretrained("bert-base-cased")
Pipeline Details
Normalizers
# Unicode normalization
tokenizer.with_normalizer(Tokens::Normalizers::NFC.new)
# BERT-style normalization (NFD + Lowercase + StripAccents)
seq = Tokens::Normalizers::Sequence.new([
Tokens::Normalizers::NFD.new,
Tokens::Normalizers::Lowercase.new,
Tokens::Normalizers::StripAccents.new,
])
tokenizer.with_normalizer(seq)
# Byte-level normalizer (GPT-2 style)
tokenizer.with_normalizer(Tokens::Normalizers::ByteLevel.new)
Pre-tokenizers
# Byte-level pre-tokenization (GPT-2 style)
tokenizer.with_pre_tokenizer(Tokens::PreTokenizers::ByteLevel.default)
# Whitespace splitting
tokenizer.with_pre_tokenizer(Tokens::PreTokenizers::Whitespace.new)
# Combined: Whitespace + split individual digits
tokenizer.with_pre_tokenizer(Tokens::PreTokenizers::Sequence.new([
Tokens::PreTokenizers::Whitespace.new,
Tokens::PreTokenizers::Digits.new(true),
]))
Post-processors
# BERT-style [CLS] ... [SEP]
tokenizer.with_post_processor(Tokens::PostProcessors::BertProcessing.default)
# RoBERTa-style <s> ... </s> with offset trimming
tokenizer.with_post_processor(Tokens::PostProcessors::RobertaProcessing.default)
# Template-based (fully customizable)
tokenizer.with_post_processor(
Tokens::PostProcessors::TemplateProcessing.build(
Tokens::PostProcessors::ProcTemplate.parse("[CLS] $A [SEP]"),
Tokens::PostProcessors::ProcTemplate.parse("[CLS] $A [SEP] $B:1 [SEP]:1"),
Tokens::PostProcessors::TokensMap.from_tuples([
{"[CLS]", 1_u32},
{"[SEP]", 2_u32},
]),
)
)
Decoders
# BPE decoder
tokenizer.with_decoder(Tokens::Decoders::BPEDecoder.default)
# Byte-level decoder (doubles as PostProcessor)
tokenizer.with_decoder(Tokens::PreTokenizers::ByteLevel.default)
# WordPiece decoder (strips ## prefixes)
tokenizer.with_decoder(Tokens::Decoders::WordPiece.default)
Training
# BPE training
trainer = Tokens::Models::BPE::BpeTrainerBuilder.new
.vocab_size(30000)
.min_frequency(2_u64)
.special_tokens([
Tokens::AddedToken.new("[UNK]", true),
Tokens::AddedToken.new("[CLS]", true),
Tokens::AddedToken.new("[SEP]", true),
])
.build
tokenizer.train_from_files(trainer, ["data/corpus.txt"])
# WordPiece training (delegates to BPE)
trainer = Tokens::Models::WordPieceTrainer.new(show_progress: false)
tokenizer.train_from_files(trainer, ["data/corpus.txt"])
# Unigram EM training
trainer = Tokens::Models::Unigram::UnigramTrainer.new(
show_progress: false,
unk_token: "<UNK>",
)
sentences = [{"word1", 5_u64}, {"word2", 3_u64}]
trainer.do_train(sentences, model)
Truncation & Padding
# Truncate sequences longer than 512 tokens
tokenizer.with_truncation(Tokens::TruncationParams.new(
max_length: 512_u64,
strategy: Tokens::TruncationStrategy::LongestFirst,
))
# Pad batches to match longest sequence
tokenizer.with_padding(Tokens::PaddingParams.new(
pad_id: 0_u32,
pad_token: "[PAD]",
))
Documentation
- Architecture — codebase structure, pipeline design, core types
- Quicktour — walkthrough: building, training, encoding, decoding, pipeline components
- Reference — all models, normalizers, pre-tokenizers, post-processors, decoders with examples
- Development — setup, quality gates, parity checks
- Testing — test categories, test data, running tests
- Parity Tracking — inventory manifests, status vocabulary, parity checks
- Coding Guidelines — porting conventions
- PR Workflow — pull request process
- Porting Plan — feature checklist and completion status
Development
make install # Install dependencies
make download-data # Download test model files
make format # Format Crystal code
make lint # Run Ameba linter
make test # Run specs (299 examples)
make test-all # download-data + crystal spec
Parity
| Metric | Status |
|---|---|
| Specs | 299 examples, 0 failures, 0 pending |
| Tests ported | 241/242 (1 partial — Unigram esaxx) |
| Source API tracked | 757 items, 0 missing |
| Adversarial verification | PASSED |
Upstream
This is a behavior-faithful port of huggingface/tokenizers. The upstream Rust implementation is vendored at vendor/tokenizers/ (pinned at 3992692) and serves as the source of truth for all porting decisions.
Contributing
- Fork it (https://github.com/dsisnero/tokens/fork)
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request
Contributors
- Dominic Sisneros — creator and maintainer
Repository
tokens
Owner
Statistic
- 0
- 0
- 0
- 0
- 1
- about 1 month ago
- May 9, 2026
License
MIT License
Links
Synced at
Wed, 13 May 2026 16:29:22 GMT
Languages