whisper-cry
whisper-cry
Crystal bindings for whisper.cpp, providing local speech-to-text transcription using OpenAI's Whisper models. Version tracks whisper.cpp releases (currently v1.8.3).
Installation
-
Add the dependency to your
shard.yml:dependencies: whisper-cry: github: robacarp/whisper-cry -
Run
shards install -
Build the native libraries:
cd lib/whisper-cry && makeThis clones whisper.cpp v1.8.3, builds it as a static library, and copies the
.afiles intovendor/lib/. Requirescmakeand a C++ compiler. See the whisper.cpp build documentation for platform-specific details and options. -
Download a Whisper model (e.g. the base English model):
curl -L -o ggml-base.en.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.binSee the whisper.cpp models directory for all available models.
-
Optimize the model for your hardware (optional but recommended):
The Whisper.cpp project has documentation and scripting support for optimizing models for different hardware, quantization, etc.
Usage
require "whisper-cry"
whisper = Whisper.new("/path/to/ggml-base.en.bin")
segments = whisper.transcribe_file("audio.wav")
segments.each do |segment|
puts "#{segment.start_timestamp} --> #{segment.end_timestamp}"
puts segment.text
end
whisper.close
Audio files must be 16-bit PCM WAV, mono, 16kHz. Convert with ffmpeg:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f wav output.wav
API
Whisper.new(model_path, use_gpu = false)
Loads a GGML-format model file and initializes the inference context. Set use_gpu: true to enable Metal acceleration on macOS. Raises Whisper::Error if the model file is missing or fails to load.
#transcribe_file(path, language = "en", n_threads = 4, translate = false)
Transcribes a WAV file and returns an Array(Whisper::Segment). The file must be 16-bit signed PCM, mono, 16kHz.
#transcribe(samples, language = "en", n_threads = 4, translate = false)
Transcribes pre-loaded Float32 audio samples (normalized to [-1.0, 1.0], mono, 16kHz). Useful when you already have audio data in memory.
Options:
- language: BCP-47 code (e.g.
"en","es"), ornilfor auto-detection - n_threads: CPU threads for inference
- translate: when
true, translates to English regardless of source language
#close
Frees the underlying whisper context. Safe to call multiple times. Also called automatically by #finalize.
#version, #model_type, #multilingual?, #system_info
Query the whisper.cpp version string, loaded model type (e.g. "base"), multilingual support, and available CPU features.
Whisper::Segment
Each segment represents a span of recognized speech:
| Method | Returns |
|---|---|
#text |
Transcribed text |
#start_ms / #end_ms |
Timing in milliseconds |
#start_seconds / #end_seconds |
Timing in seconds |
#duration_ms |
Segment duration in milliseconds |
#start_timestamp / #end_timestamp |
Formatted as "HH:MM:SS.mmm" |
#no_speech_probability |
Float32 (0.0-1.0), higher = likely not speech |
#speaker_turn_next |
true if next segment is a different speaker |
Development
Run tests:
crystal spec
Tests cover Segment formatting/conversion, WAV file parsing and validation, and Whisper initialization error handling. No model file is needed to run the test suite.
License
whisper-cry
- 0
- 0
- 0
- 0
- 0
- about 6 hours ago
- March 7, 2026
MIT License
Sat, 07 Mar 2026 18:11:49 GMT