llamero
Llamero
A Crystal library for interacting with AI/LLM providers with automatic failover and structured output support.
Supported Providers
| Provider | Features | Best For |
|---|---|---|
| OpenAI | Chat, Structured Output, Streaming, Embeddings, Vision | General purpose, GPT-4o |
| Anthropic | Chat, Structured Output, Streaming, Vision | Claude models, long context |
| Groq | Chat, Structured Output, Streaming, Vision | Ultra-fast inference |
| OpenRouter | All features (model-dependent) | Access to 400+ models |
Native Apple/MLX Track
Llamero ships an Apple-first native runtime for local inference from Crystal applications: keep an MLX-backed base model resident on Apple Silicon, stream chat responses through Crystal, parse structured JSON into Crystal objects, and hot-swap LoRA adapters without reloading the base model.
runtime = Llamero::Native::MLXRuntime.new(
model_id: "mlx-community/gemma-4-e2b-it-4bit"
)
session = runtime.start_session
session.load_model
session.chat_stream([Llamero::Message.user("Hello!")]) do |chunk|
print chunk
end
# Hot-swap a LoRA adapter while the base model stays resident
runtime.adapters.register("sql", Path["adapters/sql"])
session.activate_adapters(
Llamero::Native::AdapterStack.additive([Llamero::Native::AdapterSlot.new("sql")])
)
# Or train your own adapter on the resident model (QLoRA on 4-bit models),
# from a golden dataset of prompt/completion pairs - no Python required
dataset = Llamero::Native::TrainingDataset.new(system_prompt: "You are an LX-900 expert.")
dataset.add("What injectors does the LX-900 use?", "BR-7741 injectors at 2,150 PSI.")
session.train_adapter("lx900-manual", dataset) do |progress|
puts "iter #{progress.iteration}: loss=#{progress.loss}"
end
session.activate_adapters(
Llamero::Native::AdapterStack.additive([Llamero::Native::AdapterSlot.new("lx900-manual")])
)
The runtime talks to a small Swift bridge (native/llamero-mlx) built on mlx-swift-lm, loaded at runtime via dlopen - apps without the bridge built automatically fall back to a deterministic mock bridge, so specs and non-Apple development keep working. Build the real bridge with:
cd native/llamero-mlx && ./build.sh
crystal run examples/native_smoke_test.cr # real on-device inference
Audio (experimental)
The native track also ships an on-device speech runtime: speech-to-text with NVIDIA Parakeet and text-to-speech with Kokoro, running through a second Swift bridge (native/llamero-audio) built on FluidAudio - CoreML on the Neural Engine, so transcription and synthesis never compete with the MLX LLM for the GPU. Models download lazily on first use.
audio = Llamero::Native::AudioRuntime.new # Parakeet v3 + Kokoro defaults
result = audio.transcribe(Path["meeting.wav"])
result.text # full transcript
result.segments # word-level [{text, start_ms, end_ms}]
spoken = audio.speak("I found three problems in that file.", voice: "af_heart")
spoken.path # wav file, ready to play
Streaming speech-to-text turns the same runtime into a live dictation engine: push 16kHz mono Float32 samples from your capture layer and llamero streams text back — partial hypotheses while a phrase is being spoken, and one completed utterance per detected end of utterance (Parakeet EOU 120M, confirmed after a configurable silence debounce):
stream = audio.start_stream # chunk_ms: 160, eou_debounce_ms: 1280
stream.on_partial { |text| print "\r#{text}" } # live ghost text
stream.on_utterance { |utterance| handle(utterance.text) } # completed phrases
while samples = capture.next_chunk # Slice(Float32), 16kHz mono
stream.push(samples)
end
result = stream.finish # flushes + returns the full session transcript
result.text # everything said
result.segments # one {text, start_ms, end_ms} per utterance
Without the built audio bridge the same deterministic mock-fallback rule applies (gate real-audio code on audio.real_bridge?). Build and verify with:
cd native/llamero-audio && ./build.sh
crystal run examples/native_audio_test.cr -- /path/to/speech.wav # file STT + TTS (verified on-device)
crystal run examples/native_dictation_test.cr -- /path/to/speech.wav # streaming STT
Status: file transcription and TTS are verified on-device; streaming STT is implemented and spec-covered, pending on-device verification (see the multimodal roadmap below). PCM-streaming TTS is a planned follow-up.
Design docs:
- Native MLX roadmap
- Native MLX architecture
- Multimodal roadmap (vision, speech-to-text, text-to-speech)
- Llamero v2 roadmap
Documentation for AI Coding Agents
Llamero ships its documentation in forms coding assistants can actually use, so even small models can build with the library:
- Skills (
.claude/skills/): task recipes forcloud-providers,local-inference, andadapter-training, written as complete programs with error→fix tables. With the Ashard fork of shards,shards installcopies them into your project as.claude/skills/llamero--<name>/. - CLAUDE.md and AGENTS.md: the condensed API contract for any agent harness.
- A golden training dataset (training_data/llamero_api_qa.jsonl): the API as prompt/completion pairs. Train a local model its own llamero adapter with
examples/train_llamero_docs_adapter.cr- the library teaching a model to use the library:
dataset = Llamero::Native::TrainingDataset.from_pairs_jsonl(
"lib/llamero/training_data/llamero_api_qa.jsonl"
)
session.train_adapter("llamero-docs", dataset, config)
Installation
Add the dependency to your shard.yml:
dependencies:
llamero:
github: crimson-knight/llamero
Then run:
shards install
Quick Start
Define Your AI Client
require "llamero"
# Create your application's AI client with failover
class MyAIClient < Llamero::Client
def initialize
super(
primary: :openai,
fallbacks: [:anthropic, :groq]
)
end
end
client = MyAIClient.new
Basic Chat
response = client.chat([
Llamero::Message.user("What is the capital of France?")
])
puts response.content
# => "The capital of France is Paris."
puts "Provider: #{response.provider_used}"
# => "Provider: openai"
Structured Output
Define a response schema using BaseGrammar:
class PersonInfo < Llamero::BaseGrammar
property name : String = ""
property age : Int32 = 0
property occupation : String = ""
end
response = client.chat_structured(
[Llamero::Message.user("Generate a random person's info")],
PersonInfo
)
person = response.parsed.not_nil!
puts "Name: #{person.name}, Age: #{person.age}"
Streaming
client.chat_stream([
Llamero::Message.user("Tell me a short story")
]) do |chunk|
print chunk
end
Configuration
Environment Variables
Set API keys as environment variables:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."
export OPENROUTER_API_KEY="sk-or-..."
Configuration File
Create .llamero/config.yml in your project directory:
providers:
openai:
api_key: "sk-..."
organization: "org-..." # optional
anthropic:
api_key: "sk-ant-..."
groq:
api_key: "gsk_..."
openrouter:
api_key: "sk-or-..."
defaults:
provider: openai
model: gpt-4o
temperature: 0.7
max_tokens: 4096
Priority order: Explicit constructor values > Environment variables > Config file > Defaults
Provider Failover
The unified Client automatically handles failover:
class ResilientClient < Llamero::Client
def initialize
super(
primary: :openai,
fallbacks: [:anthropic, :groq],
retry_config: Llamero::RetryConfig.new(
max_retries: 3,
base_delay: 1.second
)
)
# Optional: Monitor failovers
on_fallback do |from, to, error|
Log.warn { "Failing over from #{from} to #{to}: #{error.message}" }
end
on_retry do |provider, attempt, error|
Log.info { "Retry #{attempt} for #{provider}" }
end
end
end
Retry Behavior
| Error Type | Behavior |
|---|---|
| Rate Limit (429) | Retry with exponential backoff |
| Server Error (5xx) | Retry with backoff |
| Auth Error (401/403) | Immediate failover (no retry) |
| Quota Exceeded (402) | Immediate failover |
Direct Provider Access
For advanced use cases, access provider clients directly:
# OpenAI
client = Llamero::OpenAIClient.new
response = client.chat([Llamero::Message.user("Hello!")])
# Anthropic
client = Llamero::AnthropicClient.new
response = client.chat([Llamero::Message.user("Hello!")])
# With custom settings
client = Llamero::OpenAIClient.new(
api_key: "sk-...",
default_model: "gpt-4o-mini",
timeout: 5.minutes
)
API Reference
Message
Llamero::Message.system("You are a helpful assistant")
Llamero::Message.user("Hello!")
Llamero::Message.assistant("Hi there!")
Llamero::Message.tool(content, tool_call_id, name)
ChatResponse
response.content # String - the response text
response.model # String - model used
response.usage # Usage - token counts
response.finish_reason # String - why generation stopped
response.parsed # T? - parsed structured output
response.provider_used # Symbol - which provider was used
response.attempts # Int32 - total attempt count
BaseGrammar
Inherit from BaseGrammar to define structured response schemas:
class Analysis < Llamero::BaseGrammar
property sentiment : String = ""
property confidence : Float32 = 0.0
property keywords : Array(String) = [] of String
end
# Get JSON Schema for the grammar
schema = Analysis.to_json_schema
RetryConfig
# Default configuration
Llamero::RetryConfig.new
# Aggressive retries
Llamero::RetryConfig.aggressive
# Conservative (fewer retries)
Llamero::RetryConfig.conservative
# No retries
Llamero::RetryConfig.no_retry
# Custom
Llamero::RetryConfig.new(
max_retries: 5,
base_delay: 500.milliseconds,
max_delay: 30.seconds,
exponential_base: 2.0,
jitter: 0.1
)
Development
# Run tests
crystal spec
# Type check
crystal build src/llamero.cr --no-codegen
Contributing
Open an issue to discuss features before developing.
Branch naming:
- Bug fixes:
issue/1234-description - Features:
feature/1234-description
- Fork it (https://github.com/crimson-knight/llamero/fork)
- Create your feature branch (
git checkout -b feature/description) - Commit your changes (
git commit -am 'Add feature') - Push to the branch (
git push origin feature/description) - Create a Pull Request
Contributors
- Seth Tucker - creator and maintainer
llamero
- 19
- 0
- 6
- 0
- 2
- about 2 hours ago
- April 2, 2024
MIT License
Thu, 11 Jun 2026 23:59:52 GMT