llamero

A wrapper shard for llama.cpp that acts as a client to work directly with AI models through llama.cpp from within Crystal applications

Llamero

A Crystal library for interacting with AI/LLM providers with automatic failover and structured output support.

Supported Providers

Provider Features Best For
OpenAI Chat, Structured Output, Streaming, Embeddings, Vision General purpose, GPT-4o
Anthropic Chat, Structured Output, Streaming, Vision Claude models, long context
Groq Chat, Structured Output, Streaming, Vision Ultra-fast inference
OpenRouter All features (model-dependent) Access to 400+ models

Native Apple/MLX Track

Llamero ships an Apple-first native runtime for local inference from Crystal applications: keep an MLX-backed base model resident on Apple Silicon, stream chat responses through Crystal, parse structured JSON into Crystal objects, and hot-swap LoRA adapters without reloading the base model.

runtime = Llamero::Native::MLXRuntime.new(
  model_id: "mlx-community/gemma-4-e2b-it-4bit"
)

session = runtime.start_session
session.load_model

session.chat_stream([Llamero::Message.user("Hello!")]) do |chunk|
  print chunk
end

# Hot-swap a LoRA adapter while the base model stays resident
runtime.adapters.register("sql", Path["adapters/sql"])
session.activate_adapters(
  Llamero::Native::AdapterStack.additive([Llamero::Native::AdapterSlot.new("sql")])
)

# Or train your own adapter on the resident model (QLoRA on 4-bit models),
# from a golden dataset of prompt/completion pairs - no Python required
dataset = Llamero::Native::TrainingDataset.new(system_prompt: "You are an LX-900 expert.")
dataset.add("What injectors does the LX-900 use?", "BR-7741 injectors at 2,150 PSI.")

session.train_adapter("lx900-manual", dataset) do |progress|
  puts "iter #{progress.iteration}: loss=#{progress.loss}"
end
session.activate_adapters(
  Llamero::Native::AdapterStack.additive([Llamero::Native::AdapterSlot.new("lx900-manual")])
)

The runtime talks to a small Swift bridge (native/llamero-mlx) built on mlx-swift-lm, loaded at runtime via dlopen - apps without the bridge built automatically fall back to a deterministic mock bridge, so specs and non-Apple development keep working. Build the real bridge with:

cd native/llamero-mlx && ./build.sh
crystal run examples/native_smoke_test.cr   # real on-device inference

Audio (experimental)

The native track also ships an on-device speech runtime: speech-to-text with NVIDIA Parakeet and text-to-speech with Kokoro, running through a second Swift bridge (native/llamero-audio) built on FluidAudio - CoreML on the Neural Engine, so transcription and synthesis never compete with the MLX LLM for the GPU. Models download lazily on first use.

audio = Llamero::Native::AudioRuntime.new   # Parakeet v3 + Kokoro defaults

result = audio.transcribe(Path["meeting.wav"])
result.text       # full transcript
result.segments   # word-level [{text, start_ms, end_ms}]

spoken = audio.speak("I found three problems in that file.", voice: "af_heart")
spoken.path       # wav file, ready to play

Streaming speech-to-text turns the same runtime into a live dictation engine: push 16kHz mono Float32 samples from your capture layer and llamero streams text back — partial hypotheses while a phrase is being spoken, and one completed utterance per detected end of utterance (Parakeet EOU 120M, confirmed after a configurable silence debounce):

stream = audio.start_stream # chunk_ms: 160, eou_debounce_ms: 1280

stream.on_partial { |text| print "\r#{text}" }              # live ghost text
stream.on_utterance { |utterance| handle(utterance.text) }  # completed phrases

while samples = capture.next_chunk # Slice(Float32), 16kHz mono
  stream.push(samples)
end

result = stream.finish # flushes + returns the full session transcript
result.text            # everything said
result.segments        # one {text, start_ms, end_ms} per utterance

Without the built audio bridge the same deterministic mock-fallback rule applies (gate real-audio code on audio.real_bridge?). Build and verify with:

cd native/llamero-audio && ./build.sh
crystal run examples/native_audio_test.cr -- /path/to/speech.wav  # file STT + TTS (verified on-device)
crystal run examples/native_dictation_test.cr -- /path/to/speech.wav  # streaming STT

Status: file transcription and TTS are verified on-device; streaming STT is implemented and spec-covered, pending on-device verification (see the multimodal roadmap below). PCM-streaming TTS is a planned follow-up.

Design docs:

Documentation for AI Coding Agents

Llamero ships its documentation in forms coding assistants can actually use, so even small models can build with the library:

  • Skills (.claude/skills/): task recipes for cloud-providers, local-inference, and adapter-training, written as complete programs with error→fix tables. With the Ashard fork of shards, shards install copies them into your project as .claude/skills/llamero--<name>/.
  • CLAUDE.md and AGENTS.md: the condensed API contract for any agent harness.
  • A golden training dataset (training_data/llamero_api_qa.jsonl): the API as prompt/completion pairs. Train a local model its own llamero adapter with examples/train_llamero_docs_adapter.cr - the library teaching a model to use the library:
dataset = Llamero::Native::TrainingDataset.from_pairs_jsonl(
  "lib/llamero/training_data/llamero_api_qa.jsonl"
)
session.train_adapter("llamero-docs", dataset, config)

Installation

Add the dependency to your shard.yml:

dependencies:
  llamero:
    github: crimson-knight/llamero

Then run:

shards install

Quick Start

Define Your AI Client

require "llamero"

# Create your application's AI client with failover
class MyAIClient < Llamero::Client
  def initialize
    super(
      primary: :openai,
      fallbacks: [:anthropic, :groq]
    )
  end
end

client = MyAIClient.new

Basic Chat

response = client.chat([
  Llamero::Message.user("What is the capital of France?")
])

puts response.content
# => "The capital of France is Paris."

puts "Provider: #{response.provider_used}"
# => "Provider: openai"

Structured Output

Define a response schema using BaseGrammar:

class PersonInfo < Llamero::BaseGrammar
  property name : String = ""
  property age : Int32 = 0
  property occupation : String = ""
end

response = client.chat_structured(
  [Llamero::Message.user("Generate a random person's info")],
  PersonInfo
)

person = response.parsed.not_nil!
puts "Name: #{person.name}, Age: #{person.age}"

Streaming

client.chat_stream([
  Llamero::Message.user("Tell me a short story")
]) do |chunk|
  print chunk
end

Configuration

Environment Variables

Set API keys as environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."
export OPENROUTER_API_KEY="sk-or-..."

Configuration File

Create .llamero/config.yml in your project directory:

providers:
  openai:
    api_key: "sk-..."
    organization: "org-..."  # optional
  anthropic:
    api_key: "sk-ant-..."
  groq:
    api_key: "gsk_..."
  openrouter:
    api_key: "sk-or-..."

defaults:
  provider: openai
  model: gpt-4o
  temperature: 0.7
  max_tokens: 4096

Priority order: Explicit constructor values > Environment variables > Config file > Defaults

Provider Failover

The unified Client automatically handles failover:

class ResilientClient < Llamero::Client
  def initialize
    super(
      primary: :openai,
      fallbacks: [:anthropic, :groq],
      retry_config: Llamero::RetryConfig.new(
        max_retries: 3,
        base_delay: 1.second
      )
    )

    # Optional: Monitor failovers
    on_fallback do |from, to, error|
      Log.warn { "Failing over from #{from} to #{to}: #{error.message}" }
    end

    on_retry do |provider, attempt, error|
      Log.info { "Retry #{attempt} for #{provider}" }
    end
  end
end

Retry Behavior

Error Type Behavior
Rate Limit (429) Retry with exponential backoff
Server Error (5xx) Retry with backoff
Auth Error (401/403) Immediate failover (no retry)
Quota Exceeded (402) Immediate failover

Direct Provider Access

For advanced use cases, access provider clients directly:

# OpenAI
client = Llamero::OpenAIClient.new
response = client.chat([Llamero::Message.user("Hello!")])

# Anthropic
client = Llamero::AnthropicClient.new
response = client.chat([Llamero::Message.user("Hello!")])

# With custom settings
client = Llamero::OpenAIClient.new(
  api_key: "sk-...",
  default_model: "gpt-4o-mini",
  timeout: 5.minutes
)

API Reference

Message

Llamero::Message.system("You are a helpful assistant")
Llamero::Message.user("Hello!")
Llamero::Message.assistant("Hi there!")
Llamero::Message.tool(content, tool_call_id, name)

ChatResponse

response.content        # String - the response text
response.model          # String - model used
response.usage          # Usage - token counts
response.finish_reason  # String - why generation stopped
response.parsed         # T? - parsed structured output
response.provider_used  # Symbol - which provider was used
response.attempts       # Int32 - total attempt count

BaseGrammar

Inherit from BaseGrammar to define structured response schemas:

class Analysis < Llamero::BaseGrammar
  property sentiment : String = ""
  property confidence : Float32 = 0.0
  property keywords : Array(String) = [] of String
end

# Get JSON Schema for the grammar
schema = Analysis.to_json_schema

RetryConfig

# Default configuration
Llamero::RetryConfig.new

# Aggressive retries
Llamero::RetryConfig.aggressive

# Conservative (fewer retries)
Llamero::RetryConfig.conservative

# No retries
Llamero::RetryConfig.no_retry

# Custom
Llamero::RetryConfig.new(
  max_retries: 5,
  base_delay: 500.milliseconds,
  max_delay: 30.seconds,
  exponential_base: 2.0,
  jitter: 0.1
)

Development

# Run tests
crystal spec

# Type check
crystal build src/llamero.cr --no-codegen

Contributing

Open an issue to discuss features before developing.

Branch naming:

  • Bug fixes: issue/1234-description
  • Features: feature/1234-description
  1. Fork it (https://github.com/crimson-knight/llamero/fork)
  2. Create your feature branch (git checkout -b feature/description)
  3. Commit your changes (git commit -am 'Add feature')
  4. Push to the branch (git push origin feature/description)
  5. Create a Pull Request

Contributors

Repository

llamero

Owner
Statistic
  • 19
  • 0
  • 6
  • 0
  • 2
  • about 2 hours ago
  • April 2, 2024
License

MIT License

Links
Synced at

Thu, 11 Jun 2026 23:59:52 GMT

Languages