Llamero

A Crystal library for interacting with AI/LLM providers with automatic failover and structured output support.

Supported Providers

Provider	Features	Best For
OpenAI	Chat, Structured Output, Streaming, Embeddings, Vision	General purpose, GPT-4o
Anthropic	Chat, Structured Output, Streaming, Vision	Claude models, long context
Groq	Chat, Structured Output, Streaming, Vision	Ultra-fast inference
OpenRouter	All features (model-dependent)	Access to 400+ models

Native Apple/MLX Track

Llamero ships an Apple-first native runtime for local inference from Crystal applications: keep an MLX-backed base model resident on Apple Silicon, stream chat responses through Crystal, parse structured JSON into Crystal objects, and hot-swap LoRA adapters without reloading the base model.

runtime = Llamero::Native::MLXRuntime.new(
  model_id: "mlx-community/gemma-4-e2b-it-4bit"
)

session = runtime.start_session
session.load_model

session.chat_stream([Llamero::Message.user("Hello!")]) do |chunk|
  print chunk
end

# Hot-swap a LoRA adapter while the base model stays resident
runtime.adapters.register("sql", Path["adapters/sql"])
session.activate_adapters(
  Llamero::Native::AdapterStack.additive([Llamero::Native::AdapterSlot.new("sql")])
)

# Or train your own adapter on the resident model (QLoRA on 4-bit models),
# from a golden dataset of prompt/completion pairs - no Python required
dataset = Llamero::Native::TrainingDataset.new(system_prompt: "You are an LX-900 expert.")
dataset.add("What injectors does the LX-900 use?", "BR-7741 injectors at 2,150 PSI.")

session.train_adapter("lx900-manual", dataset) do |progress|
  puts "iter #{progress.iteration}: loss=#{progress.loss}"
end
session.activate_adapters(
  Llamero::Native::AdapterStack.additive([Llamero::Native::AdapterSlot.new("lx900-manual")])
)

The runtime talks to a small Swift bridge (native/llamero-mlx) built on mlx-swift-lm, loaded at runtime via dlopen - apps without the bridge built automatically fall back to a deterministic mock bridge, so specs and non-Apple development keep working. Build the real bridge with:

cd native/llamero-mlx && ./build.sh
crystal run examples/native_smoke_test.cr   # real on-device inference

Supported models & where they come from

The native track loads MLX-compatible checkpoints straight from the Hugging Face Hub: any repo with a config.json and .safetensors weights whose architecture is supported by the bundled mlx-swift-lm loader. The mlx-community conversions are the recommended source — they are pre-quantized, generally ungated, and tested against the loader. Plain (non-MLX-converted) safetensors repos of supported architectures also work — for example HuggingFaceTB/SmolLM2-135M-Instruct loads and runs fine. GGUF and .bin-only repos are not supported. Local inference requires an Apple Silicon Mac.

Models download on first load to ~/.llamero/models/<org>--<name> (override the root with $LLAMERO_HOME). A .llamero-complete marker file gates cache validity; to force a re-download, delete that model's folder.

A model id may also pin a revision — org/name@revision, where the revision is any git sha, tag, or branch on the Hub. Pinned revisions cache separately:

runtime = Llamero::Native::MLXRuntime.new(
  model_id: "mlx-community/gemma-4-e2b-it-4bit@2c3e507453b4f218d05fe3cc97bea5c5a654257e"
)

Why this exists: Hub repos are sometimes re-converted and re-uploaded in place, and a new upload can use a checkpoint layout newer than the bundled loader understands. Pinning a known-good revision keeps you working while the loader catches up. The example above is real: mlx-community/gemma-4-e2b-it-4bit was re-uploaded on 2026-07-06 with a tensor layout the current loader can't read yet, so the examples pin the last-good revision.

Audio (experimental)

The native track also ships an on-device speech runtime: speech-to-text with NVIDIA Parakeet and text-to-speech with Kokoro, running through a second Swift bridge (native/llamero-audio) built on FluidAudio - CoreML on the Neural Engine, so transcription and synthesis never compete with the MLX LLM for the GPU. Models download lazily on first use.

audio = Llamero::Native::AudioRuntime.new   # Parakeet v3 + Kokoro defaults

result = audio.transcribe(Path["meeting.wav"])
result.text       # full transcript
result.segments   # word-level [{text, start_ms, end_ms}]

spoken = audio.speak("I found three problems in that file.", voice: "af_heart")
spoken.path       # wav file, ready to play

Streaming speech-to-text turns the same runtime into a live dictation engine: push 16kHz mono Float32 samples from your capture layer and llamero streams text back — partial hypotheses while a phrase is being spoken, and one completed utterance per detected end of utterance (Parakeet EOU 120M, confirmed after a configurable silence debounce):

stream = audio.start_stream # chunk_ms: 160, eou_debounce_ms: 1280

stream.on_partial { |text| print "\r#{text}" }              # live ghost text
stream.on_utterance { |utterance| handle(utterance.text) }  # completed phrases

while samples = capture.next_chunk # Slice(Float32), 16kHz mono
  stream.push(samples)
end

result = stream.finish # flushes + returns the full session transcript
result.text            # everything said
result.segments        # one {text, start_ms, end_ms} per utterance

Without the built audio bridge the same deterministic mock-fallback rule applies (gate real-audio code on audio.real_bridge?). Build and verify with:

cd native/llamero-audio && ./build.sh
crystal run examples/native_audio_test.cr -- /path/to/speech.wav  # file STT + TTS (verified on-device)
crystal run examples/native_dictation_test.cr -- /path/to/speech.wav  # streaming STT

Status: file transcription and TTS are verified on-device; streaming STT is implemented and spec-covered, pending on-device verification (see the multimodal roadmap below). PCM-streaming TTS is a planned follow-up.

Design docs:

Documentation for AI Coding Agents

Llamero ships its documentation in forms coding assistants can actually use, so even small models can build with the library:

Skills (.claude/skills/): task recipes for cloud-providers, local-inference, and adapter-training, written as complete programs with error→fix tables. With the Ashard fork of shards, shards install copies them into your project as .claude/skills/llamero--<name>/.
CLAUDE.md and AGENTS.md: the condensed API contract for any agent harness.
A golden training dataset (training_data/llamero_api_qa.jsonl): the API as prompt/completion pairs. Train a local model its own llamero adapter with examples/train_llamero_docs_adapter.cr - the library teaching a model to use the library:

dataset = Llamero::Native::TrainingDataset.from_pairs_jsonl(
  "lib/llamero/training_data/llamero_api_qa.jsonl"
)
session.train_adapter("llamero-docs", dataset, config)

Installation

Add the dependency to your shard.yml:

dependencies:
  llamero:
    github: crimson-knight/llamero

Then run:

shards install

Quick Start

Define Your AI Client

require "llamero"

# Create your application's AI client with failover
class MyAIClient < Llamero::Client
  def initialize
    super(
      primary: :openai,
      fallbacks: [:anthropic, :groq]
    )
  end
end

client = MyAIClient.new

Basic Chat

response = client.chat([
  Llamero::Message.user("What is the capital of France?")
])

puts response.content
# => "The capital of France is Paris."

puts "Provider: #{response.provider_used}"
# => "Provider: openai"

Structured Output

Define a response schema using BaseGrammar:

class PersonInfo < Llamero::BaseGrammar
  property name : String = ""
  property age : Int32 = 0
  property occupation : String = ""
end

response = client.chat_structured(
  [Llamero::Message.user("Generate a random person's info")],
  PersonInfo
)

person = response.parsed.not_nil!
puts "Name: #{person.name}, Age: #{person.age}"

Streaming

client.chat_stream([
  Llamero::Message.user("Tell me a short story")
]) do |chunk|
  print chunk
end

Configuration

Environment Variables

Set API keys as environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."
export OPENROUTER_API_KEY="sk-or-..."

Configuration File

Create .llamero/config.yml in your project directory:

providers:
  openai:
    api_key: "sk-..."
    organization: "org-..."  # optional
  anthropic:
    api_key: "sk-ant-..."
  groq:
    api_key: "gsk_..."
  openrouter:
    api_key: "sk-or-..."

defaults:
  provider: openai
  model: gpt-4o
  temperature: 0.7
  max_tokens: 4096

Priority order: Explicit constructor values > Environment variables > Config file > Defaults

Provider Failover

The unified Client automatically handles failover:

class ResilientClient < Llamero::Client
  def initialize
    super(
      primary: :openai,
      fallbacks: [:anthropic, :groq],
      retry_config: Llamero::RetryConfig.new(
        max_retries: 3,
        base_delay: 1.second
      )
    )

    # Optional: Monitor failovers
    on_fallback do |from, to, error|
      Log.warn { "Failing over from #{from} to #{to}: #{error.message}" }
    end

    on_retry do |provider, attempt, error|
      Log.info { "Retry #{attempt} for #{provider}" }
    end
  end
end

Retry Behavior

Error Type	Behavior
Rate Limit (429)	Retry with exponential backoff
Server Error (5xx)	Retry with backoff
Auth Error (401/403)	Immediate failover (no retry)
Quota Exceeded (402)	Immediate failover

Direct Provider Access

For advanced use cases, access provider clients directly:

# OpenAI
client = Llamero::OpenAIClient.new
response = client.chat([Llamero::Message.user("Hello!")])

# Anthropic
client = Llamero::AnthropicClient.new
response = client.chat([Llamero::Message.user("Hello!")])

# With custom settings
client = Llamero::OpenAIClient.new(
  api_key: "sk-...",
  default_model: "gpt-4o-mini",
  timeout: 5.minutes
)

API Reference

Message

Llamero::Message.system("You are a helpful assistant")
Llamero::Message.user("Hello!")
Llamero::Message.assistant("Hi there!")
Llamero::Message.tool(content, tool_call_id, name)

ChatResponse

response.content        # String - the response text
response.model          # String - model used
response.usage          # Usage - token counts
response.finish_reason  # String - why generation stopped
response.parsed         # T? - parsed structured output
response.provider_used  # Symbol - which provider was used
response.attempts       # Int32 - total attempt count

BaseGrammar

Inherit from BaseGrammar to define structured response schemas:

class Analysis < Llamero::BaseGrammar
  property sentiment : String = ""
  property confidence : Float32 = 0.0
  property keywords : Array(String) = [] of String
end

# Get JSON Schema for the grammar
schema = Analysis.to_json_schema

RetryConfig

# Default configuration
Llamero::RetryConfig.new

# Aggressive retries
Llamero::RetryConfig.aggressive

# Conservative (fewer retries)
Llamero::RetryConfig.conservative

# No retries
Llamero::RetryConfig.no_retry

# Custom
Llamero::RetryConfig.new(
  max_retries: 5,
  base_delay: 500.milliseconds,
  max_delay: 30.seconds,
  exponential_base: 2.0,
  jitter: 0.1
)

Development

# Run tests
crystal spec

# Type check
crystal build src/llamero.cr --no-codegen

Troubleshooting

Common failures when loading or running local models, and what to do about them:

Symptom	What it means	What to do
HTTP 404 while listing model files	The model id is typo'd or the repo doesn't exist	Check the spelling; browse https://huggingface.co/mlx-community for the exact id
HTTP 401/403 during download	The repo is gated and needs authentication	Set `HF_TOKEN` (or `HUGGING_FACE_HUB_TOKEN`) to a Hugging Face token with access, or use an ungated `mlx-community` conversion
"has no .safetensors weights on the Hugging Face Hub"	The repo is GGUF-only, `.bin`-only, or has no weights at all	Use an `mlx-community` conversion of the model, or convert it yourself with `mlx_lm.convert`
"Model load failed ... mismatchedSize/keyNotFound ... checkpoint's layout doesn't match the bundled MLX loader"	Version skew between the checkpoint and the loader — often an upstream re-upload of the repo	Update llamero and rebuild the bridge (`native/llamero-mlx/build.sh`), or pin a known-good revision with `model@revision` (see "Supported models" above)
stderr warning: "llamero: MLX bridge not found — using MOCK inference"	The native bridge isn't built, so you're getting canned fake output	Run `native/llamero-mlx/build.sh`
`StructuredParseError` (model emitted prose instead of JSON)	A model-capability issue, not a load problem	Use a larger / better instruction-following model, or loosen the schema; retries are cheap since the model stays loaded

Note: Gemma-3 multimodal 4B/12B/27B conversions ship with an incomplete text_config; llamero auto-patches this at download time (restoring num_attention_heads, num_key_value_heads, and head_dim), so these repos load without manual fixes.

Hardware: local inference requires an Apple Silicon Mac running a recent macOS. As a rough guide, budget the checkpoint size plus 1-2GB of overhead in RAM — a 3.4GB 4-bit model peaked around 2.5GB of GPU memory in our tests.

Contributing

Open an issue to discuss features before developing.

Branch naming:

Bug fixes: issue/1234-description
Features: feature/1234-description

Fork it (https://github.com/crimson-knight/llamero/fork)
Create your feature branch (git checkout -b feature/description)
Commit your changes (git commit -am 'Add feature')
Push to the branch (git push origin feature/description)
Create a Pull Request

Contributors

Seth Tucker - creator and maintainer

Repository

llamero

Owner

crimson-knight

Statistic

20
0
6
3
2
6 days ago
April 2, 2024

License

MIT License

Links

Synced at

Tue, 21 Jul 2026 20:28:22 GMT

Languages

Crystal 83.65% Swift 15.77% Shell 0.58%