models
Cadmium Models
Pre-trained machine learning models for the Cadmium NLP library.
Overview
This repository contains pre-trained models that can be easily loaded and used with Cadmium classifiers and other NLP components. Models are stored in efficient binary format (MessagePack) for fast loading and minimal storage.
Model Formats
Each model may include:
.model- Binary MessagePack format (recommended, fastest loading).model.json- JSON format (human-readable, fallback)metadata.yml- Model information, training data stats, accuracy metrics
Usage
Loading a Model
require "cadmium_classifier"
# Load from binary format (fastest)
bytes = File.read("path/to/model.model")
classifier = Cadmium::Classifier::Bayes.from_msgpack(bytes)
# Or load from JSON
classifier = Cadmium::Classifier::Bayes.from_json(File.read("path/to/model.model.json"))
# Use the classifier
result = classifier.classify("Your text here")
Available Models
Sentiment Analysis
- sentiment_twitter - English sentiment classification (positive/negative)
- Trained on: Sentiment140 Twitter dataset (50K samples, 40K for training)
- Accuracy: 75.05%
- Precision: 79.04% (positive), 71.04% (negative)
- Recall: 73.29% (positive), 77.12% (negative)
- F1 Score: 76.06% (positive), 73.95% (negative)
- Categories: 2 (positive, negative)
- Vocabulary: 59,982 words
- Training time: 1.12 seconds
- Source: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
Example Usage
require "cadmium_classifier"
# Load the model
bytes = File.read("models/sentiment/sentiment_twitter.model")
classifier = Cadmium::Classifier::Bayes.from_msgpack(bytes)
# Classify text
result = classifier.classify("I love this new feature!")
# => {"positive" => 96.67, "negative" => 3.33}
# Get just the top category
category = classifier.classify_category("This is terrible!")
# => "negative"
Creating Models
Quick Start
The simplest way to train a new model:
# Prepare your training data (tab-separated: text<TAB>category)
cat > data/training_data.txt << EOF
I love this product! positive
This is terrible. negative
Amazing experience! positive
Worst service ever negative
EOF
# Train the model
crystal run src/cadmium_models.cr train data/training_data.txt my_model
# Move generated files to models directory
mv my_model.model my_model.model.json metadata.yml models/<category>/
Training Data Format
Training data must be tab-separated with the text content first and the category last:
Text content here category1
Another text sample category2
Multi-word text here category1
Important: Use tabs (\t) to separate the text from the category, not commas or spaces.
Training CLI
The CLI provides a simple interface for training models:
crystal run src/cadmium_models.cr train <data_file> <model_name>
Features:
- Automatic 80/20 train/test split
- Computes accuracy, precision, recall, F1 scores
- Generates confusion matrix
- Auto-creates metadata.yml with all metrics
- Exports both MessagePack and JSON formats
Example:
crystal run src/cadmium_models.cr train data/spam_data.txt spam_detector
Output:
๐ Starting model training...
๐ Total samples: 10000
๐ Training samples: 8000
๐งช Test samples: 2000
โณ Training...
โ
Training completed in 0.45 seconds
๐ Accuracy: 98.2%
๐ฆ Generated files:
- spam_detector.model (245 KB)
- spam_detector.model.json (312 KB)
- metadata.yml
Testing Your Model
After training, verify your model works correctly:
crystal run src/cadmium_models.cr test models/sentiment/sentiment_twitter.model
This will load the model and run sample predictions to verify it's working.
Model File Organization
Organize trained models by category:
models/
โโโ sentiment/
โ โโโ sentiment_twitter.model
โ โโโ sentiment_twitter.model.json
โ โโโ metadata.yml
โโโ spam/
โ โโโ email_spam.model
โ โโโ email_spam.model.json
โ โโโ metadata.yml
โโโ language/
โโโ lang_detector.model
โโโ lang_detector.model.json
โโโ metadata.yml
Best Practices
- Data Quality - Use clean, labeled data. Remove duplicates and obvious errors.
- Dataset Size - More data is generally better, but 10K-50K samples is often sufficient for good results.
- Balanced Classes - Try to have roughly equal samples per category for best accuracy.
- Test Split - Always reserve 20-30% of data for testing to validate performance.
- Metadata - Keep metadata.yml accurate and complete for model discoverability.
Contributing
See CONTRIBUTING.md for guidelines on:
- Submitting models for inclusion
- Model versioning and release process
- Code quality standards
License
Models are released under the MIT license unless otherwise noted in individual model metadata.
Links
models
- 1
- 0
- 0
- 0
- 2
- 5 days ago
- January 5, 2026
MIT License
Fri, 09 Jan 2026 19:44:02 GMT