cogni-ml v0.40.0
Cogni-ML
Crystal machine learning library with native Apple Silicon GPU acceleration.
Highlights:
- Native Metal GPU embedding pipeline — 43ms for 260 tokens on M2 Max (2.2x faster than baseline)
- GGUF model loading with Q5_K/Q6_K quantization support
simdgroup_matrix_multiply_accumulateGEMM kernels- Compute graph with automatic wave-based barrier optimization
- Autograd engine, NN layers, Adam optimizer
- llama.cpp bindings for any GGUF model
Architecture
src/ml/
core/ Tensor, Shape, MetalBuffer
autograd/ Variable, GradFn (backward pass)
nn/ Linear, LayerNorm, MultiHeadAttention, ViT
optim/ Adam/AdamW
llm/ llama.cpp FFI bindings
gguf/ GGUF reader, tokenizer, dequantization, NomicBertMoE
metal/ Device, ComputeEncoder, ComputeGraph, GraphEncoder
GPU Embedding Pipeline
The crown jewel: a fully native Metal compute pipeline for nomic-embed-text-v2-moe BERT embeddings.
require "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"
ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf("path/to/model.gguf", ML::GGUF::MetalBackend.new)
embedding = model.embed("Your text here") # → Array(Float32), dim=768
Performance (Apple M2 Max, 38 GPU cores)
| Tokens | Latency |
|---|---|
| 20 | 14ms |
| 94 | 16ms |
| 196 | 33ms |
| 433 | 70ms |
What's inside
- simdgroup_matrix GEMM — hardware-accelerated 8x8 matrix tiles for Q5_K/Q6_K dequant+multiply
- Batched expert GEMM — all 8 MoE experts in 1 dispatch (LTP Diamond surgery)
- ComputeGraph — automatic wave scheduling with offset-aware + Block Integrity dependency analysis
- GraphEncoder — drop-in ComputeEncoder replacement that builds the compute graph
- Fused kernels — QKV split+RoPE, gate+softmax+topk, atomic scatter, f32 norm2
- Indirect dispatch — GPU-driven threadgroup counts, zero CPU-GPU sync for MoE routing
Supported models
| Model | Format | Status |
|---|---|---|
| nomic-embed-text-v2-moe | GGUF Q5_K_M | Full native Metal pipeline |
| Any BERT-like encoder | GGUF | Via NomicBertMoE (if architecture matches) |
| Llama, Qwen, Mistral, etc. | GGUF | Via llama.cpp bindings |
Installation
# shard.yml
dependencies:
cogni-ml:
github: anthropics/cogni-ml # or local path
version: ~> 0.10.0
Build with Metal GPU
make build # Compiles bridge.mm + links Metal frameworks
make spec # Run tests with GPU
EMBED_MODEL=/path/to/nomic.gguf make profile_nomic # Stage breakdown for native Metal embeddings
EMBED_MODEL=/path/to/nomic.gguf make profile_nomic_layers # Per-layer hotspot breakdown
EMBED_MODEL=/path/to/nomic.gguf make profile_nomic_vs_llama # Head-to-head vs llama.cpp
EMBED_MODEL=/path/to/nomic.gguf make profile_nomic_vs_llama ARGS="--runs=15 --warmup=6" # Override benchmark depth
CPU-only build
crystal build -Dcpu_only your_app.cr
Quick Start
Tensor + Autograd (CPU)
require "ml"
x = ML::Autograd::Variable.rand(2, 3, requires_grad: true, device: ML::Tensor::Device::CPU)
layer = ML::NN::Linear.new(3, 4, device: ML::Tensor::Device::CPU)
out = layer.forward(x)
loss = out.mean
loss.backward
opt = ML::Optim::Adam.new(layer.parameters)
opt.step
opt.zero_grad
LLM Inference (llama.cpp)
require "ml/llm/llama"
ML::LLM.init
model = ML::LLM::Model.new("path/to/model.gguf")
gen = ML::LLM::Generator.new(model)
puts gen.ask("What is Crystal?", max_tokens: 100)
ML::LLM.cleanup
GGUF Embeddings (Metal GPU)
require "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"
ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf(
"nomic-embed-text-v2-moe.Q5_K_M.gguf",
ML::GGUF::MetalBackend.new
)
# Single embedding
vec = model.embed("Crystal programming language")
puts "dim=#{vec.size}" # 768
# Batch embedding
vecs = model.embed_batch(["Hello", "World", "Crystal"])
Metal Kernels
11 Metal shader files implementing:
| Kernel | Purpose |
|---|---|
gemm_mm.metal |
simdgroup_matrix GEMM for Q5_K/Q6_K + batched expert variants |
gemm_simd.metal |
Scalar SIMD GEMM (small batch fallback) |
attention_matmul.metal |
Flash attention with simdgroup_matrix Q*K^T |
bert_fp16.metal |
Fused ops: QKV split+RoPE, gate+softmax+topk, norms, scatter, routing |
gemm_mm_f16.metal |
FP16 GEMM (experimental) |
nn.metal |
General NN ops (linear, layernorm, GELU) |
Platform Support
| Platform | GPU | CPU | Status |
|---|---|---|---|
| macOS (Apple Silicon) | Metal | Yes | Primary target |
| macOS (Intel) | Metal | Yes | Supported |
| Linux | - | Yes | -Dcpu_only |
Build Flags
| Flag | Effect |
|---|---|
-Dcpu_only |
Disable Metal, pure CPU |
-Duse_gguf |
Enable GGUF model loading (requires llama.cpp for LLM, standalone for embeddings) |
License
MIT
Repository
cogni-ml
Owner
Statistic
- 2
- 0
- 0
- 0
- 0
- 1 day ago
- February 1, 2026
License
Links
Synced at
Sun, 05 Apr 2026 02:36:18 GMT
Languages