tokens

A Crystal port of huggingface/tokenizers.

Pinned upstream ref: 3992692 (main, 2025-04-24)

Provides implementations of today's most used tokenizers in pure Crystal, ported from the upstream Rust crate at vendor/tokenizers/tokenizers/.

What is a Tokenizer

A tokenizer works as a pipeline — raw text goes in, an Encoding comes out. The pipeline has five stages:

Stage	Role	Crystal module
Normalizer	Unicode normalization, lowercasing, stripping	`Tokens::Normalizer`
PreTokenizer	Split text into initial word-level chunks	`Tokens::PreTokenizer`
Model	Tokenize chunks into sub-word IDs	`Tokens::Model` (BPE)
PostProcessor	Add special tokens ([CLS], [SEP])	`Tokens::PostProcessor`
Decoder	Convert token IDs back to text	`Tokens::Decoder`

Features

BPE model — train, save/load, encode, decode
Normalizers — NFC, NFD, NFKC, NFKD, Lowercase, Strip, StripAccents, Replace, Prepend, BertNormalizer, ByteLevel, Sequence
Pre-tokenizers — Whitespace, ByteLevel, Metaspace, Digits, Punctuation, Split, Delimiter, FixedLength, BertPreTokenizer, UnicodeScripts, Sequence
Post-processors — BertProcessing, RobertaProcessing, TemplateProcessing, ByteLevel, Sequence
Decoders — BPE, ByteLevel, ByteFallback, CTC, Fuse, Strip, WordPiece, Metaspace, Sequence
Serialization — JSON round-trip for all pipeline components (compatible with upstream format)
Alignment tracking — map tokens back to original character offsets
Truncation & padding — with direction and strategy control

Installation

Add to your shard.yml:

dependencies:
  tokens:
    github: dsisnero/tokens

Then:

shards install

Quick example

require "tokens"

# Build a tokenizer from JSON (compatible with upstream format)
tokenizer = Tokens::TokenizerImpl.new(Tokens::Models::BPE.default)
  .with_normalizer(Tokens::NormalizerWrapper.from(Tokens::Normalizers::NFC.new))
  .with_pre_tokenizer(Tokens::PreTokenizerWrapper.from(Tokens::PreTokenizers::ByteLevel.default))
  .with_post_processor(Tokens::PostProcessorWrapper.from(Tokens::PostProcessors::BertProcessing.default))
  .with_decoder(Tokens::DecoderWrapper.from(Tokens::Decoders::BPEDecoder.default))

# Encode text
encoding = tokenizer.encode("Hello there!", add_special_tokens: true)
encoding.tokens # => ["[CLS]", "Hello", "there", "!", "[SEP]"]

# Decode back
tokenizer.decode(encoding.ids) # => "Hello there !"

Usage

require "tokens"

# Create a tokenizer with a BPE model
bpe = Tokens::Models::BPE.from_files("vocab.json", "merges.txt")
tokenizer = Tokens::Tokenizer.new(bpe)

# Encode
encoding = tokenizer.encode("Hello world!")
puts encoding.tokens   # => ["Hello", "Ġworld", "!"]
puts encoding.ids      # => [15496, 2159, 0]

# Encode a pair
encoding = tokenizer.encode({"Hello", "world"})
puts encoding.type_ids # => [0, 1]

# Decode
text = tokenizer.decode(encoding.ids)
puts text # => "Hello world"

JSON serialization

All pipeline components serialize to/from the upstream JSON format:

# Serialize a normalizer
normalizer = Tokens::NormalizerWrapper.from(Tokens::Normalizers::NFC.new)
normalizer.to_json # => {"type":"NFC"}

# Deserialize
copy = Tokens::NormalizerWrapper.from_json(%({"type":"NFC"}))

Pipeline details

Normalizers

# Unicode normalization
tokenizer.with_normalizer(Tokens::Normalizers::NFC.new)

# Sequence of normalizers
seq = Tokens::Normalizers::Sequence.new([
  Tokens::NormalizerWrapper.from(Tokens::Normalizers::Strip.new(true, true)),
  Tokens::NormalizerWrapper.from(Tokens::Normalizers::NFC.new),
])
tokenizer.with_normalizer(seq)

Pre-tokenizers

# Byte-level pre-tokenization (GPT-2 style)
tokenizer.with_pre_tokenizer(Tokens::PreTokenizers::ByteLevel.default)

# Whitespace splitting
tokenizer.with_pre_tokenizer(Tokens::PreTokenizers::Whitespace.new)

Post-processors

# BERT-style [CLS] ... [SEP]
tokenizer.with_post_processor(Tokens::PostProcessors::BertProcessing.default)

# RoBERTa-style <s> ... </s> with offset trimming
tokenizer.with_post_processor(Tokens::PostProcessors::RobertaProcessing.default)

# Template-based (fully customizable)
template = Tokens::PostProcessors::TemplateProcessing.build(
  single: Tokens::PostProcessors::ProcTemplate.parse("[CLS] $0 [SEP]"),
  pair: Tokens::PostProcessors::ProcTemplate.parse("[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1"),
  special_tokens: Tokens::PostProcessors::TokensMap.from_tuples([
    {"[CLS]", 1_u32},
    {"[SEP]", 0_u32},
  ])
)

Decoders

# BPE decoder
tokenizer.with_decoder(Tokens::Decoders::BPEDecoder.default)

# Byte-level decoder
tokenizer.with_decoder(Tokens::PreTokenizers::ByteLevel.default)

Documentation

Architecture — codebase structure and design
Development — setup and quality gates
Testing — testing strategy
Coding Guidelines — porting conventions
PR Workflow — pull request process

Development

make install    # Install dependencies
make format     # Format Crystal code
make lint       # Run Ameba linter
make test       # Run specs

Upstream

This is a behavior-faithful port of huggingface/tokenizers. The upstream Rust implementation is vendored at vendor/tokenizers/ (pinned at 3992692) and serves as the source of truth for all porting decisions.

Contributing

Fork it (https://github.com/dsisnero/tokens/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

Dominic Sisneros — creator and maintainer

Repository

tokens

Owner

dsisnero

Statistic

0
0
0
0
1
about 6 hours ago
May 9, 2026

License

MIT License

Links

Synced at

Sat, 09 May 2026 17:48:01 GMT

Languages

Crystal 91.13% Ruby 5.31% Shell 3.52% Makefile 0.04%