MicroGPT / AGPT Research Repository

This repository is the home of the AGPT (Aggregated-Gradient Pretraining) research project and its reference implementation. It also contains the underlying Crystal transformer library used as the experimental substrate, plus supporting work.

What's here

AGPT paper — the primary research artifact: a gradient-factorization theorem for autoregressive training on prefix tries, a memory-scalable implementation, and empirical results on Shakespeare. This is the main thing to read.
AGPT training engine — the CUDA implementation validating the paper: radix-compressed trie training with per-subtree KV-cache scoping, bigram partitioning, auto-LR scaling, and frequency-based pruning.
MicroGPT — a minimal Crystal transformer (from-scratch attention, pluggable Crystal/OpenBLAS/cuBLAS backends) used as the substrate for the AGPT experiments and the window-training baseline. Described in detail in the Design Decisions section below.
Path-probability convergence paper — an earlier preprint on the statistical convergence of path-frequency distributions across independent corpus splits; supporting work that motivated some of AGPT's trie-based formulation. Results in rnd/results-extended.md.
Construction kit (src/microgpt/construction_kit.cr, frontend in frontend/) — an experimental graphical UI for designing transformer architectures. Partially implemented and not required to run or evaluate AGPT.

For a short summary of the AGPT research claim and context, see notes/grants/emergent-ventures-pitch.md.

MicroGPT library — Design Decisions

The following sections describe the Crystal transformer library itself, which AGPT uses as its transformer substrate.

Data Chunking

Sequential walk through the text with configurable stride. Default stride equals seq_len, giving non-overlapping chunks with full coverage per epoch.

~1.1M tokens in tinyshakespeare, vocab size 65 (character-level).
At seq_len=32, one epoch = ~34k steps.
No random sampling — every token is seen exactly once per epoch.
Stride is configurable: set below seq_len for overlapping context at chunk boundaries if desired.

Embedding / Unembedding Weights

W_e (token embeddings) and W_unembed (output projection) are independent. Embedding.token_emb is (vocab × d_model), OutputHead.proj is a separate Linear with its own (d_model × vocab) weight matrix. No weight tying.

W_o and Heterogeneous Heads

MultiHeadAttention supports heads of different dimensions. Head outputs are concatenated column-wise into a (seq_len × d_model) matrix — head dims must sum to d_model. W_o is a plain (d_model × d_model) linear projection over the full concatenated vector. It has no awareness of head boundaries; it treats the concatenation uniformly.

Feed-Forward Dimension

d_ff = d_model (1:1 ratio). The FF block is two linear layers: d_model → d_ff → d_model. At d_model=64 this keeps the parameter count low (~61k total) and training fast (~116 steps/sec on OpenBLAS).

Memory Protection

Mat tracks global allocated bytes with a configurable cap (default 3 GiB). Raises with a detailed message before exceeding the limit.
GC.collect runs every 10 training steps to reclaim intermediate matrices.
just run wraps execution with ulimit -v (default 8 GiB) as an OS-level safety net.

Usage

The CLI uses a JSON schema for arguments. All flags use --flag value syntax.

Window Training (standard)

# Quick training run
./bin/microgpt data/input.txt --steps 2000 --d-model 64 --n-layers 2 --seq-len 32

# With held-out validation
./bin/microgpt data/input.txt --steps 5000 --d-model 64 --n-layers 2 --seq-len 32 \
  --val-tokens 5000 --val-interval 30

# GPU accelerated
./bin/microgpt data/input.txt --steps 5000 --d-model 64 --n-layers 2 --seq-len 32 \
  --backend cublas

# Other options
--seed 42              # reproducible init
--lr 0.0003            # learning rate (default 0.0003)
--no-save              # don't save checkpoint
--model path.model     # save/load checkpoint path

AGPT Training (trie-walk)

Trains on a prefix trie of the corpus. Shared prefixes are computed once.

# Basic AGPT run
./bin/microgpt data/input.txt --steps 10 --d-model 32 --n-layers 1 --seq-len 16 \
  --agpt --agpt-max-starts 2000

# With held-out validation (for comparison with window training)
./bin/microgpt data/input.txt --steps 20 --d-model 32 --n-layers 1 --seq-len 16 \
  --agpt --agpt-max-starts 2000 --val-tokens 5000 --val-interval 10

# AGPT-specific options
--agpt                     # enable AGPT trie-walk mode
--agpt-max-starts 2000     # number of corpus start positions (0 = all)
--agpt-start-offset 0      # deterministic offset for start positions
--agpt-progress 1000       # print trie build progress every N starts
--agpt-save-index path.idx # save built trie index to disk
--agpt-load-index path.idx # load previously saved trie index
--build-only               # build/report trie without training

AGPT terminology:

steps = number of epochs (full trie traversals)
epoch = one forward pass over all trie nodes + backward + weight updates
starts = number of corpus positions used to build the trie (more starts = more prefix sharing but slower)

Comparison Runs

To compare AGPT vs window training fairly, use the same model config, seed, and held-out split:

# Window baseline
./bin/microgpt data/input.txt --steps 5000 --d-model 32 --n-layers 1 --seq-len 16 \
  --val-tokens 5000 --val-interval 10 --seed 42 --no-save > window.log 2>&1

# AGPT comparison
./bin/microgpt data/input.txt --steps 20 --d-model 32 --n-layers 1 --seq-len 16 \
  --val-tokens 5000 --val-interval 10 --seed 42 --no-save \
  --agpt --agpt-max-starts 2000 > agpt.log 2>&1

# Extract held-out CE curves (plot wall-clock vs held_out_ce)
grep '\[val\]' window.log
grep '\[val\]' agpt.log

Backends

crystal — pure Crystal CPU (default, no dependencies)
openblas — CPU with OpenBLAS acceleration
cublas — GPU via cuBLAS + custom CUDA kernels (requires NVIDIA GPU + CUDA toolkit)

Building

just build           # debug build (CPU only)
just build-release   # release build (CPU only)
just build-cuda      # release build with GPU support

Contributors

Thomas Sawyer — creator and maintainer

License

This repository is released under the PolyForm Noncommercial License 1.0.0 — see LICENSE.

Academic research, personal study, and use by educational or research institutions are permitted and encouraged.

Commercial licensing available

If you want to use this code in a commercial product, service, or for-profit internal workflow, a separate commercial license is required. Terms are flexible and sized to the deployment — startup-friendly arrangements are welcome.

See COMMERCIAL_LICENSE.md for details, or contact transfire@gmail.com directly.

Repository

microgpt

Owner

trans

Statistic

0
0
0
1
2
2 months ago
March 14, 2026

License

Other

Links

Synced at

Fri, 24 Apr 2026 09:13:09 GMT

Languages

Crystal 42.45% Cuda 23.5% JavaScript 22.92% Shell 4.12% CSS 3.36% Python 1.42% Svelte 0.9% HTML 0.62% Just 0.4% C 0.31%