crystal-coder-vault

crystal-coder-vault

ccvault is a small Crystal CLI for collecting, deduplicating, and recording training data provenance for the crystal-coder model.

It is designed for a server such as ollama.openbeagle.org where you want:

  • raw files on cheap or rebuildable storage,
  • tiny durable manifests in Git,
  • optional S3 archive sync for raw files you cannot easily recreate,
  • Git commit history as the native datastore for code examples.

The tool intentionally does not put a database in the middle. The registry is a normal Git repository that can be backed up to GitHub.

Storage layout

Default root:

/srv/crystal-coder
  registry/                       # Git repo; durable; push to GitHub
    records/YYYY/MM/DD/*.json     # file, URL, and Git commit records
    snapshots/<name>/*.jsonl      # exact list of content used for a dataset run
  blobs/
    archive/sha256/ab/<hash>      # durable local raw blobs, optionally S3-synced
    cache/sha256/ab/<hash>        # rebuildable local raw blobs; can be deleted
  incoming/                       # upload/drop zone
  work/                           # scratch

Suggested server policy:

Back up or push:      /srv/crystal-coder/registry
Optional durable:     /srv/crystal-coder/blobs/archive
Do not back up:       /srv/crystal-coder/blobs/cache, incoming, work

If you rely on EBS snapshots, put cache, incoming, and work on a separate volume that is not snapshotted. EBS snapshots operate at the volume/block level, so directory-level exclusion is not a snapshot feature.

Build

shards build --release
sudo install -m 0755 bin/ccvault /usr/local/bin/ccvault

Initialize on ollama.openbeagle.org

sudo mkdir -p /srv/crystal-coder
sudo chown "$USER:$USER" /srv/crystal-coder

ccvault init \
  --root /srv/crystal-coder \
  --remote git@github.com:embedconsult/crystal-coder-dataset-registry.git

Add PDFs and Markdown

ccvault ingest-dir /srv/crystal-coder/incoming/docs \
  --root /srv/crystal-coder \
  --class cache \
  --project crystal-coder \
  --tags docs,pdf,markdown \
  --ext pdf,md,markdown

Use --class archive for files that should be retained locally and optionally synced to S3. Use --class cache for files that can be gathered again.

Record external references without storing raw blobs

ccvault add-url s3://my-bucket/training-inputs/vendor-manual.pdf \
  --root /srv/crystal-coder \
  --project crystal-coder \
  --tags docs,pdf,external

Track Git commit history

ccvault track-repo /home/git/my-crystal-project \
  --root /srv/crystal-coder \
  --name my-crystal-project \
  --project crystal-coder \
  --tags code,git,crystal \
  --since 2024-01-01

This stores small JSON records that reference commit SHAs and changed paths. It does not store full diffs by default. The source Git repository remains the authoritative datastore for code history.

Create a durable training snapshot

ccvault snapshot crystal-coder-sft-v0 \
  --root /srv/crystal-coder \
  --project crystal-coder \
  --tags docs,code,git,crystal

The snapshot is a JSONL file committed to the registry. It is the exact content list you used for a training or RAG build.

Push the registry to GitHub

ccvault push --root /srv/crystal-coder

Optional: sync durable archive blobs to S3

ccvault sync-s3 s3://my-bucket/crystal-coder-vault \
  --root /srv/crystal-coder

Only blobs/archive is synced. Cache blobs are intentionally left out.

Verify

ccvault verify --root /srv/crystal-coder

This checks whether locally stored raw blobs still match their recorded SHA-256 hashes.

Reclaim cache space

ccvault gc-cache --root /srv/crystal-coder --older-than-days 30

This removes rebuildable cached blobs older than the chosen age. The registry still records what was used.

Repository

crystal-coder-vault

Owner
Statistic
  • 0
  • 0
  • 0
  • 0
  • 0
  • 10 days ago
  • May 9, 2026
License

Links
Synced at

Sat, 09 May 2026 13:36:33 GMT

Languages