Cogni-ML

Crystal machine learning library with native Apple Silicon GPU acceleration.

Cogni-ML is currently two things:

A general Crystal ML toolkit: tensors, autograd, NN layers, optimizers, GGUF readers, and llama.cpp bindings.
A native Metal inference lab for GGUF models, with production-oriented work on nomic-embed-text-v2-moe embeddings and Qwen 3.5 text generation.

Highlights

Native Metal embedding pipeline for nomic-embed-text-v2-moe.
Native Qwen 3.5 9B GGUF inference path for Apple Silicon Metal.
Q4_K/Q5_K/Q6_K/Q8_0 quantized matmul kernels.
Chunked Qwen 3.5 prefill, decode wave scheduling, prompt-state cache restore, and exact speculative decode harnesses.
ComputeGraph wave scheduling with offset-aware barrier optimization.
Crystal autograd engine, NN layers, and Adam/AdamW optimizers.
llama.cpp FFI bindings for general GGUF model access.

Architecture

src/ml/
  core/         Tensor, Shape, MetalBuffer
  autograd/     Variable, GradFn backward pass
  nn/           Linear, LayerNorm, MultiHeadAttention, ViT
  optim/        Adam/AdamW
  llm/          llama.cpp FFI bindings
  gguf/         GGUF reader, tokenizer, dequantization, Qwen35, NomicBertMoE
  metal/        Device, ComputeEncoder, ComputeGraph, GraphEncoder

Qwen 3.5 Native Metal

The native Qwen path targets Qwen3.5-9B-Q4_K_M.gguf on Apple Silicon. The code supports:

Qwen 3.5 GGUF metadata and tokenizer loading.
Q4_K, Q5_K, Q6_K, and Q8_0 quantized projections.
Full-attention layers with GQA, partial RoPE, KV cache writes, and fused output projection.
DeltaNet/recurrent layers with GPU-resident recurrent state and chunked prefill scan.
Chunked prefill with final-token top1 shortcut.
Decode wave scheduling to reduce command-buffer boundaries.
Exact prompt-state save/restore and longest-prefix prompt cache.
Exact speculative decode harnesses:
- neural draft with Qwen 3.5 0.8B Q8_0,
- n-gram/cache draft for repeated/generated-template text,
- target-verifier chunks with row-batched top1 for larger accepted chunks.

The 9B Q4_K_M path is the primary verified target. Qwen 3.6 27B is a scale-up target, but it should be treated as experimental until local correctness and performance runs are completed.

Model Layout

The developer CLIs default to local LM Studio / llama.cpp-style paths:

~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf
~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf
~/SrcArchives/AI/llama.cpp/build/bin/llama-tokenize
~/SrcArchives/AI/llama.cpp/build/bin/llama-bench

Most benchmark/probe CLIs also accept --model, --target, --draft, --tokenizer-bin, or environment overrides. bin/qwen35_generate.cr is intentionally a small demo and currently uses its constants at the top of the file.

Build Qwen CLIs

Build the CPU-only GGUF/Qwen metadata smoke on Linux, CUDA hosts, or any environment where Metal is unavailable:

crystal build -Dcpu_only bin/qwen35_gguf_info.cr -o build/qwen35_gguf_info
./build/qwen35_gguf_info --model /path/to/Qwen3.5-9B-Q4_K_M.gguf
./build/qwen35_gguf_info --model /path/to/Qwen3.5-0.8B-Q8_0.gguf --load-weights

This entrypoint intentionally does not run inference. It verifies GGUF parsing, Qwen 3.5/3.6 hparams, tensor inventory, and the structured Qwen35Weights loader without pulling the Metal bridge into a Linux build.

Build the minimal Crystal CUDA Driver API smoke on NVIDIA/Linux hosts:

crystal build bin/cuda_driver_smoke.cr -o build/cuda_driver_smoke
./build/cuda_driver_smoke 4096

The CUDA smoke is a backend boundary probe only: it links libcuda, loads embedded PTX, launches a vector-add kernel, and checks the result.

CUDA probe code uses src/ml/cuda/driver.cr for reusable CUDA context, module/function, launch, copy, synchronize, and device-buffer ownership. This is intentionally small: it owns the raw CUDA Driver API lifecycle and calls, while higher-level layer execution is still probe-local until the CUDA backend split is promoted. It also provides ML::CUDA::ResidentSequenceRunner, a thin lifecycle facade for resident sequence probes with explicit upload_weights, reset_sequence, run_sequence, and read_outputs phases. src/ml/cuda/qwen_recurrent_layer_runner.cr is the first Qwen-specific runner extraction: it owns one recurrent layer's CUDA modules, device buffers, kernel parameters, weight upload, sequence reset, token launch graph, and output readback. QwenRecurrentLayerRunner::Weights.load owns GGUF tensor lookup, tensor-shape/type validation, and raw weight reads for the runner, including recurrent-layer ffn_down tensors stored as either Q4_K or Q6_K. CPU-reference comparison intentionally remains in the probe.

Build the first quantized CUDA correctness probe on NVIDIA/Linux hosts:

crystal build -Dcpu_only bin/cuda_q8_gemv_probe.cr -o build/cuda_q8_gemv_probe
./build/cuda_q8_gemv_probe \
  --model /path/to/Qwen3.5-0.8B-Q8_0.gguf \
  --tensor blk.0.ffn_up.weight \
  --kernel warp4 \
  --reps 100 \
  --warmup 10

cuda_q8_gemv_probe loads a real GGUF Q8_0 tensor, launches a Crystal-driven CUDA Driver API GEMV kernel over the raw GGUF block layout, and compares against the existing CPU QuantMatmul reference. --kernel scalar keeps the first one-thread-per-output-row correctness kernel; the default --kernel warp4 maps four output rows to four warps per thread block and is the current faster probe shape. This is still a standalone backend-boundary probe, not an optimized Qwen CUDA inference path yet. The current full qwen35_generate CLI remains Metal-first.

Build the first Q4_K CUDA correctness probe for Qwen 9B/27B-style target tensors:

crystal build -Dcpu_only bin/cuda_q4k_gemv_probe.cr -o build/cuda_q4k_gemv_probe
./build/cuda_q4k_gemv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --tensor blk.0.attn_gate.weight \
  --kernel warp4 \
  --reps 20 \
  --warmup 3

cuda_q4k_gemv_probe uses the raw GGUF Q4_K block layout (d, dmin, 12-byte packed scales/mins, 128-byte packed nibbles) and checks the CUDA output against the CPU QuantMatmul Q4_K reference. --kernel scalar keeps the first correctness kernel; the default --kernel warp4 maps four output rows to four warps per block and is the current faster probe shape.

Build the Q6_K CUDA correctness/speed probe for Q4_K_M tensors that remain in Q6_K:

crystal build -Dcpu_only bin/cuda_q6k_gemv_probe.cr -o build/cuda_q6k_gemv_probe
./build/cuda_q6k_gemv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --tensor blk.0.ffn_down.weight \
  --kernel warp4 \
  --reps 10 \
  --warmup 2

cuda_q6k_gemv_probe covers the GGUF Q6_K block layout (ql, qh, signed scales, d) used by output/value/down projections in mixed-quant target models. Like the Q4_K/Q8_0 probes, it is a standalone backend primitive check; full CUDA Qwen execution is still a separate backend split.

Build the first GPU-resident FFN sequence probe:

crystal build -Dcpu_only bin/cuda_ffn_sequence_probe.cr -o build/cuda_ffn_sequence_probe
./build/cuda_ffn_sequence_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --reps 10 \
  --warmup 2

cuda_ffn_sequence_probe composes the checked CUDA primitives as Q4_K ffn_gate + Q4_K ffn_up -> SwiGLU -> Q6_K ffn_down while keeping the input, intermediate activations, and output projection input GPU-resident. Only the final hidden vector is copied back for comparison against the CPU QuantMatmul FFN reference.

Build the full-attention input projection bundle probe:

crystal build -Dcpu_only bin/cuda_attn_projection_probe.cr -o build/cuda_attn_projection_probe
./build/cuda_attn_projection_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 3 \
  --reps 10 \
  --warmup 2

cuda_attn_projection_probe runs Q4_K attn_q + Q4_K attn_k + Q6_K attn_v from one GPU-resident hidden vector and copies Q/K/V back only after all projections complete. It targets full-attention layers such as blk.3 in Qwen3.5 9B. The probe now routes through ML::CUDA::QwenFullAttnProjectionRunner, supports --tokens N, and keeps Q/K/V outputs GPU-resident until the final correctness readback. It is the reusable input-projection boundary for future full-attention/KV CUDA work, not a complete full-attention layer runner yet.

Build the full-attention Q/K normalization + RoPE + KV-cache boundary probe:

crystal build -Dcpu_only bin/cuda_full_attn_kv_probe.cr -o build/cuda_full_attn_kv_probe
./build/cuda_full_attn_kv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 3 \
  --tokens 4 \
  --start-pos 2 \
  --max-seq 12

cuda_full_attn_kv_probe now routes through ML::CUDA::QwenFullAttnLayerRunner, a residual-hidden-to-final-hidden wrapper around the projection runner and ML::CUDA::QwenFullAttnKVRunner. The projection runner can apply the initial attn_norm on CUDA from residual hidden states before Q/K/V projection; Q is split into normalized/RoPE'd Q and gate, K is RMSNormed/RoPE'd, K/V rows are appended to a CUDA-resident cache at start_pos, a correctness-first serial CUDA kernel computes GQA scores, softmax, value reduction, and Q-gate multiplication, the resident gated attention output is projected through attn_output.weight, and the layer tail runs residual add, post-attention RMSNorm, FFN gate/up/SwiGLU/down, and final residual. It checks Q, gate, K, gated attention output, projected attention output, final hidden, K-cache, and V-cache against the CPU Qwen reference. This is now a clean one-layer semantics probe with device input/output hooks for mixed-stack composition, but it is not yet an end-to-end Linux decode path: full/recurrent stack scheduling, logits/top1, tokenizer/sampling, restored nonzero prefix KV, and a faster attention kernel remain separate gates.

Build the Q5_K CUDA recurrent-QKV probe:

crystal build -Dcpu_only bin/cuda_q5k_gemv_probe.cr -o build/cuda_q5k_gemv_probe
./build/cuda_q5k_gemv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --tensor blk.0.attn_qkv.weight \
  --reps 10 \
  --warmup 2

cuda_q5k_gemv_probe covers the GGUF Q5_K block layout used by recurrent-layer combined attn_qkv.weight tensors in the current Qwen3.5 9B Q4_K_M file.

Build the recurrent-layer projection bundle probe:

crystal build -Dcpu_only bin/cuda_recurrent_projection_probe.cr -o build/cuda_recurrent_projection_probe
./build/cuda_recurrent_projection_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --reps 10 \
  --warmup 2

cuda_recurrent_projection_probe runs Q5_K attn_qkv + Q4_K attn_gate + Q4_K ssm_alpha + Q4_K ssm_beta from one GPU-resident hidden vector and copies the four outputs back only after all kernels complete. It is the first CUDA recurrent projection-bundle proof; DeltaNet recurrence, convolution, state updates, and ssm_out remain separate work.

Build the synthetic DeltaNet output slice probe:

crystal build -Dcpu_only bin/cuda_deltanet_output_probe.cr -o build/cuda_deltanet_output_probe
./build/cuda_deltanet_output_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --reps 10 \
  --warmup 2

cuda_deltanet_output_probe runs a synthetic CUDA DeltaNet state update, applies post RMSNorm/SiLU gating on GPU, and feeds the result directly into the real Q4_K ssm_out.weight projection. It is a stateful boundary probe, not a full recurrent layer: recurrent conv prep, alpha/beta transforms, residuals, and FFN remain separate work.

Build the recurrent prep/output slice probe:

crystal build -Dcpu_only bin/cuda_recurrent_prep_output_probe.cr -o build/cuda_recurrent_prep_output_probe
./build/cuda_recurrent_prep_output_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --tokens 4 \
  --reps 10 \
  --warmup 2

cuda_recurrent_prep_output_probe now composes one full recurrent layer token slice: input RMSNorm, real recurrent projection bundle (attn_qkv, attn_gate, ssm_alpha, ssm_beta), recurrent conv prep, alpha/beta transforms, DeltaNet, post RMSNorm/SiLU, Q4_K ssm_out, residual add, post-attention RMSNorm, Q4_K FFN gate/up, SwiGLU, Q6_K FFN down, and final residual. --tokens N runs a GPU-resident sequence through persistent conv/SSM state and compares all token outputs plus final recurrent states against the CPU reference. The probe now separates one-time weight upload from per-sequence input/state reset and prints weight_upload_ms; timed cuda_ms_per_token excludes the persistent weight upload. GGUF recurrent-layer tensor loading is routed through QwenRecurrentLayerRunner::Weights.load, so the probe no longer manually passes every raw tensor into the runner constructor. It is still a standalone one-layer probe, not an end-to-end Linux decoder.

Build the recurrent multi-layer stack scaffold:

crystal build -Dcpu_only bin/cuda_recurrent_stack_probe.cr -o build/cuda_recurrent_stack_probe
./build/cuda_recurrent_stack_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layers 0,2,4 \
  --tokens 2

cuda_recurrent_stack_probe chains multiple QwenRecurrentLayerRunner instances and compares the final hidden sequence plus each layer's recurrent conv/SSM state against the CPU reference. The default path hands each recurrent layer's CUDA output buffer directly to the next layer's CUDA input; --host-handoff keeps the older host-copy route as a debug oracle. This is still a recurrent-only scaffold, not an end-to-end Linux decoder.

Build the mixed recurrent/full-attention CUDA stack probe:

crystal build -Dcpu_only bin/cuda_mixed_stack_probe.cr -o build/cuda_mixed_stack_probe
./build/cuda_mixed_stack_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layers 0,1,2,3,4 \
  --tokens 2 \
  --start-pos 2 \
  --max-seq 12

cuda_mixed_stack_probe composes QwenRecurrentLayerRunner and QwenFullAttnLayerRunner in model layer order with device-resident hidden handoff across recurrent/full-attention boundaries, then runs QwenOutputHeadRunner for output RMSNorm, quantized lm-head projection, and resident top1. The layer/head loop is now owned by ML::CUDA::QwenMixedStackRunner, which is the first model-slice decode-state object for CUDA. By default the probe copies back only the CUDA top1 id/value plus hidden/state debug outputs; pass --read-logits to also copy full logits for attribution, and --profile-phases to insert per-layer/head synchronizations and print attribution lines. It compares the final hidden sequence, top1, recurrent conv/SSM states, and full-attention KV cache rows against the CPU reference. This is the first mixed-stack CUDA correctness scaffold through resident top1; it still stops before tokenizer/sampling, repeated full-model decode ownership, and an optimized topK/sampling kernel. The current resident top1 is a simple two-phase partial-scan/reduce kernel: correct, but not yet promoted as a speed-optimized head.

Build the Metal bridge once:

make build/bridge.o

Build the practical generation demo:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/qwen35_generate.cr \
  -o build/qwen35_generate

Run greedy generation:

./build/qwen35_generate "The capital of France is" 64

Enable exact n-gram speculative decode for repeated text:

QWEN35_NGRAM_DECODE=1 ./build/qwen35_generate "The capital of France is" 64

Use the conservative automatic decode policy:

QWEN35_DECODE_POLICY=auto ./build/qwen35_generate "The capital of France is" 64

auto is the product-safe proposal-aware profile: it uses exact n-gram/cache proposals only when they are large enough to amortize verification, enables the candidate-shape risk gate, and otherwise falls back to exact target decoding without invoking the neural draft model.

Enable exact neural speculative decode with the Qwen 3.5 0.8B draft:

QWEN35_SPECULATIVE_DECODE=1 \
QWEN35_HEAD_FULL_ROWS_GUARDED=1 \
./build/qwen35_generate "The capital of France is" 64

Enable exact prompt cache:

QWEN35_PROMPT_CACHE=1 \
QWEN35_SESSION_ID=demo \
./build/qwen35_generate "The capital of France is" 64

Useful Qwen environment switches:

Variable	Effect
`QWEN35_PROMPT_CACHE=1`	Enable exact prompt-state cache lookup/save in `qwen35_generate`.
`QWEN35_PROMPT_CACHE_ROOT=/path`	Override prompt-cache artifact root.
`QWEN35_PREPARE_STATE_OFF=1`	Disable eager Metal state-buffer preparation in `qwen35_generate`. By default the CLI prepares KV/DeltaNet buffers before timing prompt ingest.
`QWEN35_DECODE_POLICY=greedy\|ngram\|speculative\|auto`	Explicit decode-mode selector. `auto` chooses the exact fail-closed n-gram path with risk gating; explicit policy overrides legacy mode envs.
`QWEN35_TRACE_STEPS_OFF=1`	Suppress per-token/per-cycle trace lines in `qwen35_generate` while keeping summaries and final output.
`QWEN35_QUIET=1`	Alias for suppressing per-step traces in `qwen35_generate`; useful for cleaner local timing.
`QWEN35_NGRAM_DECODE=1`	Enable exact n-gram speculative decode in `qwen35_generate`.
`QWEN35_NGRAM_GAMMA=32`	Maximum n-gram verifier chunk size.
`QWEN35_NGRAM_MIN=6`	Minimum repeated suffix length before n-gram drafting.
`QWEN35_NGRAM_MAX=8`	Maximum suffix length to search for n-gram drafting.
`QWEN35_NGRAM_MIN_CANDIDATES=N`	Skip n-gram proposals shorter than `N` candidates. In `auto`, the default is `8`; explicit `ngram` keeps the old default `0` unless set.
`QWEN35_NGRAM_STAGE_MIN=N`	Split only n-gram verifier chunks with at least `N` candidates into staged subchunks. In `auto`, the default is `QWEN35_NGRAM_GAMMA + 1`, so the common full chunk is kept intact unless overridden. Explicit `ngram` keeps the old default `0` unless set.
`QWEN35_NGRAM_RISK_GATE=0\|1`	In `auto`, the exact candidate-shape risk gate is enabled by default; set `0` to disable. Explicit `ngram` keeps the old default off unless set to `1`.
`QWEN35_NGRAM_RISK_MIN_SIZE=16`	Candidate size threshold used by the n-gram risk gate. This is independent from `QWEN35_NGRAM_STAGE_MIN`, so staging can be disabled without weakening fail-closed risk checks.
`QWEN35_NGRAM_RECURSIVE_OFF=1`	Disable recursive n-gram extension through scratch history.
`QWEN35_NGRAM_DISABLE_AFTER_REJECT_OFF=1`	Exploration mode: keep trying n-gram chunks after first rejection.
`QWEN35_NGRAM_REPLAY_ON_REJECT=1`	Research/fast-path mode: skip n-gram target-state backups and rebuild the exact target state only after a non-final n-gram reject. Use with the default `auto` risk gate; it can regress badly when a large bad n-gram chunk is forced through verification.
`QWEN35_SPECULATIVE_DECODE=1`	Enable exact neural speculative decode in `qwen35_generate` using the 0.8B draft.
`QWEN35_DRAFT_MODEL=/path`	Override the Qwen 3.5 draft GGUF used by neural speculative decode.
`QWEN35_SPEC_GAMMA=4`	Initial neural draft chunk size in `qwen35_generate`.
`QWEN35_SPEC_MAX_GAMMA=32`	Maximum adaptive neural draft chunk size.
`QWEN35_SPEC_PLAIN_FALLBACK_OFF=1`	Disable target-only fallback after low-gamma speculative rejection. Useful for A/B experiments; default fallback is faster on rejection-heavy prompts.
`QWEN35_SPEC_PLAIN_FALLBACK_GAMMA=2`	Gamma threshold at or below which rejected neural speculative decode falls back to target-only generation.
`QWEN35_SPEC_BOOTSTRAP_GAMMA=N`	Default-off neural speculative jump after a fully accepted initial chunk. Can help 100%-accept runs; may regress prompts that reject after an accepted prefix.
`QWEN35_SPEC_SINGLE_FAST_OFF=1`	Disable the exact gamma=1 accepted-token fast path in neural speculative decode. Mostly useful when target-only fallback is disabled for A/B experiments.
`QWEN35_SPEC_VERIFY=chunk-inplace\|hybrid\|serial`	Choose neural speculative verifier strategy. Default `chunk-inplace` is best for high-accept prompts; `hybrid` can help first-cycle partial-reject prompts.
`QWEN35_SPEC_SKIP_DRAFT_BEFORE_FALLBACK_OFF=1`	Disable the exact optimization that skips draft resync work when a rejection is guaranteed to enter target-only fallback.
`QWEN35_SPEC_SKIP_DRAFT_BACKUP_BEFORE_FALLBACK_OFF=1`	Disable the matching draft-backup skip before fallback-bound speculative chunks.
`QWEN35_HEAD_FULL_ROWS_GUARDED=1`	Experimental speculative-verifier accelerator for large accepted chunks; uses a margin guard and exact fallback for low-margin rows.
`QWEN35_HEAD_FULL_ROWS_MARGIN=0.25`	Margin threshold for the guarded full-row verifier route. Higher is safer but falls back more often.
`QWEN35_FFN_DOWN_ADD_FUSED_OFF=1`	Disable decode-wave FFN-down residual-add fusion for Q4/Q6 target and Q8 draft experiments.
`QWEN35_Q4K_PAIR_H16_MIN_BATCH=64`	Tune the prefill Q4 gate/up shared H16 conversion threshold. The current default enables sharing from pp64 upward after refreshed A/B showed a small exact win.
`QWEN35_REC_PROJ_SHARED_H16_OFF=1`	Disable the exact recurrent prefill projection optimization that shares one H16 input conversion between Q5 qkv and Q4 gate GEMMs.
`QWEN35_PREFILL_CHUNK_OFF=1`	Force older non-chunked prefill path.
`QWEN35_DECODE_WAVE_OFF=1`	Force older non-wave decode path.

Library Integration

The current Qwen API is low-level and intended for native inference experiments:

require "ml/gguf/qwen35_cpu"
require "ml/gguf/qwen35_weights"

model = "/path/to/Qwen3.5-9B-Q4_K_M.gguf"
weights = ML::GGUF::Qwen35Weights.from_gguf(model)
state = ML::GGUF::Qwen35CPU::State.new(weights.hparams, max_seq: 1024)

prompt_ids = [760_i32, 6511_i32, 314_i32, 9338_i32, 13_i32]
next_id, next_logit = ML::GGUF::Qwen35CPU.prefill_tokens_top1(weights, prompt_ids, 0, state)

64.times do |i|
  puts next_id
  next_id, next_logit = ML::GGUF::Qwen35CPU.forward_top1(weights, next_id, prompt_ids.size + i, state)
end

When linking an executable that uses Metal, include the bridge object and Apple frameworks:

crystal build your_app.cr \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++"

For CPU-only builds:

crystal build -Dcpu_only your_app.cr

Benchmark Against llama.cpp

Build the matched benchmark:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/benchmark_qwen_vs_llama.cr \
  -o build/benchmark_qwen_vs_llama

Run a normal first-run prefill/decode comparison:

./build/benchmark_qwen_vs_llama \
  --model ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
  --llama-bench ~/SrcArchives/AI/llama.cpp/build/bin/llama-bench \
  --prompt=64 \
  --gen=64 \
  --reps=5 \
  --warmup=2

For publishable measurements, wait for a quiet host:

./build/benchmark_qwen_vs_llama \
  --prompt=64 \
  --gen=64 \
  --reps=5 \
  --warmup=2 \
  --wait-quiet-ms=60000 \
  --require-quiet

Additional benchmark modes:

# Fresh State per repetition, but Metal state buffers are prepared before
# the timed prefill. This measures prompt ingest without first-touch buffer
# allocation/zeroing in the timed region.
./build/benchmark_qwen_vs_llama --native-prefill-prepare-state

# State buffers allocated once, then reset between reps.
./build/benchmark_qwen_vs_llama --native-prefill-prealloc

# Exact prompt-cache restore after one seeded native prefill.
./build/benchmark_qwen_vs_llama --native-prefill-cache

Fresh local M2 Max 64GB relaxed-load snapshot after the shared-H16 recurrent projection cleanup, Qwen 3.5 9B Q4_K_M, llama.cpp llama-bench, prompt=64, gen=64, reps=3, warmup=1, flash-attention off:

Mode	cogni-ml	llama.cpp	Gap
First-run prefill	426.70 tok/s p50	455.10 tok/s avg	-6.24%
Fresh state, prepared Metal buffers	449.73 tok/s p50	464.78 tok/s avg	-3.24%
Prefill with preallocated state	448.60 tok/s p50	465.91 tok/s avg	-3.71%
Prompt-cache restore	1350.65 tok/s p50	465.80 tok/s avg	+189.97%
Plain greedy decode, first-run bench	48.67 tok/s p50	46.67 tok/s avg	+4.29%
Plain greedy decode, prepared-state bench	48.59 tok/s p50	46.43 tok/s avg	+4.67%
Plain greedy decode, preallocated bench	48.51 tok/s p50	46.63 tok/s avg	+4.05%
Plain greedy decode, prompt-cache bench	48.52 tok/s p50	46.58 tok/s avg	+4.18%

Notes:

The table is a local engineering snapshot, not a lab-clean public benchmark.
First-run prefill is still behind llama.cpp on this machine. The native wins currently come from state reuse, prompt-cache restore, and exact speculative decode.
--native-prefill-prepare-state uses a fresh State per repetition but calls Qwen35CPU.prepare_state_metal! before timing. This is useful for server-style latency where a session object can be prepared before the prompt arrives.
--native-prefill-cache measures exact restore of a previously computed prompt state; it is not a first-run prefill replacement.
Short decode runs are noisy on a desktop system. The two plain decode rows above are intentionally both shown: treat plain decode as parity-to-faster, not as a stable public margin without a quiet rerun.

Speculative Decode Harnesses

Neural draft harness with Qwen 3.5 0.8B Q8_0:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/qwen35_speculative_accept.cr \
  -o build/qwen35_speculative_accept

./build/qwen35_speculative_accept \
  --target ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
  --draft ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf \
  --tokens 64 \
  --ngram \
  "The capital of France is"

Cheap-proposal-only policy for repeated/template spans:

QWEN35_SPEC_NGRAM_MIN_CANDIDATES=8 \
./build/qwen35_speculative_accept \
  --target ~/.cache/lm-studio/models/lmstudio-community/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf \
  --draft ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf \
  --tokens 32 \
  --ngram \
  --ngram-risk-gate \
  --ngram-target-only \
  "alpha beta gamma alpha beta gamma alpha beta gamma alpha"

Target-only n-gram speculative harness:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/qwen35_ngram_speculative.cr \
  -o build/qwen35_ngram_speculative

./build/qwen35_ngram_speculative \
  --tokens 64 \
  --gamma 32 \
  --min-ngram 6 \
  "The capital of France is"

Both harnesses replay/check exact greedy target output by default unless their CLI explicitly says otherwise.

Fresh local speculative smoke, same M2 Max 64GB and Qwen 3.5 9B target:

Mode / prompt	Effective speed	Plain target	Notes
Neural draft, `The capital of France is`	15.38 ms/tok, 65.01 tok/s	21.98 ms/tok, 45.49 tok/s	100% accepted, 64/64 candidates
Neural draft, `def fibonacci(n):`	21.06 ms/tok, 47.48 tok/s	21.71 ms/tok, 46.07 tok/s	falls back after rejection; small but safe win
N-gram + neural, `The capital of France is`	10.10 ms/tok, 98.98 tok/s	21.91 ms/tok, 45.64 tok/s	repeated-text path, 48/48 n-gram candidates accepted
Experimental guarded full-row verifier + neural, `The capital of France is`	14.32 ms/tok, 69.82 tok/s	22.36 ms/tok, 44.73 tok/s	`QWEN35_HEAD_FULL_ROWS_GUARDED=1`, 0 fallback rows in this run
Experimental guarded full-row verifier + n-gram + neural, `The capital of France is`	9.20 ms/tok, 108.64 tok/s	noisy target run	`QWEN35_HEAD_FULL_ROWS_GUARDED=1`, 48/48 n-gram candidates accepted

Speculative decode caveats:

The speculative paths are exact greedy verification paths, not approximate sampling shortcuts.
Neural speculative speed depends on draft acceptance. High-accept prompts are faster; rejection-heavy prompts quickly fall back to plain target decode.
In qwen35_generate, neural speculative decode is useful for longer high-accept generations. In a local 64-token smoke, The capital of France is measured 20.40 ms/tok greedy, 16.61 ms/tok neural speculative, and 15.10 ms/tok neural speculative with guarded full-row verification. A 32-token smoke was slower due fixed draft/verifier overhead.
N-gram speculation is a workload-specialized path for repeated/generated-template text. QWEN35_DECODE_POLICY=auto is the recommended product profile: risk-gated, min_candidates=8, no neural fallback, and exact target-only fallback outside clean cheap-copy spans.
In the research acceptance harness, --ngram-target-only / QWEN35_SPEC_NGRAM_TARGET_ONLY=1 skips neural draft fallback after the cheap n-gram proposal source and uses exact target-only steps instead. On a 27B mixed JSONL gate it beat neural default on 5/7 prompts with paired ratio 0.909x when combined with --ngram-risk-gate, but its average speed was near plain target decode; the real win is on clean repeated spans.
The n-gram risk gate now also catches small-period prefix overruns such as IP/YAML tails. A focused 27B probe kept clean alpha beta gamma repeats at ~2.58x while turning a YAML overrun from 0.80x into fail-closed target-only 1.006x.
The productization smoke after the structured-tail fix kept clean 27B repeats fast (~2.55x over plain), made a YAML-like overrun fail closed (1.003x), and had ngram_target_only_risk beat neural default on the tiny paired suite (0.815x ratio). Treat this as opt-in CLI evidence, not a broad default claim.
QWEN35_NGRAM_REPLAY_ON_REJECT=1 is exact but deliberately opt-in. It removes rollback-copy overhead on high-confidence accepted n-gram chunks; on the local 27B+0.8B repeat8 harness it improved ngram_router16_risk from 30.85 to 30.21 ms/tok. A forced no-risk YAML reject regressed from 71.53 to 95.86 ms/tok, so this is not a broad default.
N-gram verifier chunks temporarily disable guarded full-row verification even if QWEN35_HEAD_FULL_ROWS_GUARDED=1, because partial n-gram rejection exposed a close-row guard failure during adversarial CLI testing.
QWEN35_HEAD_FULL_ROWS_GUARDED=1 is still an experimental research switch. The harness checks final output against plain greedy target output, but the route is not broad-defaulted because it relies on a full-row F16 top1 margin guard.
These numbers are effective decode throughput after prompt prefill; they do not make first-run prefill faster.

Native Metal Embeddings

The embedding path targets nomic-embed-text-v2-moe with a fully native Metal compute pipeline.

require "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"

ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf("path/to/model.gguf", ML::GGUF::MetalBackend.new)

embedding = model.embed("Your text here")

Embedding Performance

Apple M2 Max, 38 GPU cores:

Tokens	Latency
20	14 ms
94	16 ms
196	33 ms
433	70 ms

Embedding Pipeline Internals

simdgroup-matrix GEMM for Q5_K/Q6_K dequant+multiply.
Batched expert GEMM for MoE experts.
ComputeGraph wave scheduling with offset-aware dependency analysis.
Fused QKV split/RoPE, gate/softmax/top-k, scatter, and norm kernels.
GPU-driven dispatch where useful.

Supported Models

Model	Format	Status
`Qwen3.5-9B`	GGUF Q4_K_M	Native Metal text generation path, active optimization target.
`Qwen3.5-0.8B`	GGUF Q8_0	Native draft model path for speculative decode harnesses.
`Qwen3.6-27B`	GGUF Q4_K_M target	Planned/experimental scale-up target.
`nomic-embed-text-v2-moe`	GGUF Q5_K_M	Native Metal embedding pipeline.
BERT-like encoders	GGUF	Via `NomicBertMoE` when the architecture matches.
Other Llama/Qwen/Mistral-style models	GGUF	Via llama.cpp bindings.

Installation

# shard.yml
dependencies:
  cogni-ml:
    github: skuznetsov/cogni-ml
    version: ~> 0.40.0

Build And Test

make build
make spec

CPU-only:

make build_cpu
make spec_cpu

llama.cpp helper targets:

make llama
make llama_env

The Makefile searches common local, Homebrew, and system library locations for libllama. Override with LLAMA_DIR, LLAMA_BUILD, or LLAMA_LIB_DIR if needed.

Quick Start

Tensor + Autograd

require "ml"

x = ML::Autograd::Variable.rand(2, 3, requires_grad: true, device: ML::Tensor::Device::CPU)
layer = ML::NN::Linear.new(3, 4, device: ML::Tensor::Device::CPU)

out = layer.forward(x)
loss = out.mean
loss.backward

opt = ML::Optim::Adam.new(layer.parameters)
opt.step
opt.zero_grad

LLM Inference Through llama.cpp

require "ml/llm/llama"

ML::LLM.init
model = ML::LLM::Model.new("path/to/model.gguf")
gen = ML::LLM::Generator.new(model)
puts gen.ask("What is Crystal?", max_tokens: 100)
ML::LLM.cleanup

GGUF Embeddings

require "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"

ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf(
  "nomic-embed-text-v2-moe.Q5_K_M.gguf",
  ML::GGUF::MetalBackend.new
)

vec = model.embed("Crystal programming language")
puts "dim=#{vec.size}"

vecs = model.embed_batch(["Hello", "World", "Crystal"])

Metal Kernels

Kernel	Purpose
`gemm_q4k.metal`	Q4_K GEMV/GEMM paths for Qwen.
`gemm_q56k.metal`	Q5_K/Q6_K/Q8_0 GEMV, top1, and helper kernels for Qwen.
`gemm_mm.metal`	simdgroup-matrix GEMM for Q5_K/Q6_K and batched expert variants.
`gemm_simd.metal`	Scalar SIMD GEMM fallback.
`ffn_qwen35.metal`	Qwen FFN, add, RMSNorm, and activation helpers.
`delta_net.metal`	Qwen 3.5 DeltaNet/recurrent kernels.
`fullattn_qwen35.metal`	Qwen full-attention prefill/decode helpers.
`attn_decode_qwen35.metal`	Qwen gated attention decode.
`attention_matmul.metal`	Flash-style attention matrix helpers.
`bert_fp16.metal`	Nomic/BERT fused ops.
`nn.metal`	General NN ops.

Platform Support

Platform	GPU	CPU	Status
macOS Apple Silicon	Metal	Yes	Primary target.
macOS Intel	Metal	Yes	Supported for general Metal paths; Qwen performance focus is Apple Silicon.
Linux	Experimental CUDA probes	Yes	Use `-Dcpu_only` for GGUF/metadata and CUDA probe CLIs; full Qwen generation is still Metal-first.
FreeBSD	No native Metal	Untested CPU-only	Not a primary CI target.

NVIDIA/CUDA support is currently an experimental backend-probe track, not a full decoder. The Qwen native generation path remains Metal-first.

Build Flags

Flag	Effect
`-Dcpu_only`	Disable Metal and build pure CPU paths.
`-Duse_gguf`	Enable GGUF model loading where applicable.

License

MIT

Repository

cogni-ml

Owner

skuznetsov

Statistic

7
0
0
0
0
5 days ago
February 1, 2026

License

Links

Synced at

Wed, 13 May 2026 11:17:49 GMT

Languages

Crystal 85.97% Metal 12.6% Objective-C++ 0.86% Python 0.31% Makefile 0.14% Shell 0.07% C 0.06%

cogni-ml v0.40.0

Cogni-ML

Highlights

Architecture

Qwen 3.5 Native Metal

Model Layout

Build Qwen CLIs

Library Integration

Benchmark Against llama.cpp

Speculative Decode Harnesses

Native Metal Embeddings

Embedding Performance

Embedding Pipeline Internals

Supported Models

Installation

Build And Test

Quick Start

Tensor + Autograd

LLM Inference Through llama.cpp

GGUF Embeddings

Metal Kernels

Platform Support

Build Flags

License