crystal-coder-vault
crystal-coder-vault
ccvault is a small Crystal CLI for collecting, deduplicating, and recording training data provenance for the crystal-coder model.
It is designed for a server such as ollama.openbeagle.org where you want:
- raw files on cheap or rebuildable storage,
- tiny durable manifests in Git,
- optional S3 archive sync for raw files you cannot easily recreate,
- Git commit history as the native datastore for code examples.
The tool intentionally does not put a database in the middle. The registry is a normal Git repository that can be backed up to GitHub.
Storage layout
Default root:
/srv/crystal-coder
registry/ # Git repo; durable; push to GitHub
records/YYYY/MM/DD/*.json # file, URL, and Git commit records
snapshots/<name>/*.jsonl # exact list of content used for a dataset run
blobs/
archive/sha256/ab/<hash> # durable local raw blobs, optionally S3-synced
cache/sha256/ab/<hash> # rebuildable local raw blobs; can be deleted
incoming/ # upload/drop zone
work/ # scratch
Suggested server policy:
Back up or push: /srv/crystal-coder/registry
Optional durable: /srv/crystal-coder/blobs/archive
Do not back up: /srv/crystal-coder/blobs/cache, incoming, work
If you rely on EBS snapshots, put cache, incoming, and work on a separate volume that is not snapshotted. EBS snapshots operate at the volume/block level, so directory-level exclusion is not a snapshot feature.
Build
shards build --release
sudo install -m 0755 bin/ccvault /usr/local/bin/ccvault
Initialize on ollama.openbeagle.org
sudo mkdir -p /srv/crystal-coder
sudo chown "$USER:$USER" /srv/crystal-coder
ccvault init \
--root /srv/crystal-coder \
--remote git@github.com:embedconsult/crystal-coder-dataset-registry.git
Add PDFs and Markdown
ccvault ingest-dir /srv/crystal-coder/incoming/docs \
--root /srv/crystal-coder \
--class cache \
--project crystal-coder \
--tags docs,pdf,markdown \
--ext pdf,md,markdown
Use --class archive for files that should be retained locally and optionally synced to S3. Use --class cache for files that can be gathered again.
Record external references without storing raw blobs
ccvault add-url s3://my-bucket/training-inputs/vendor-manual.pdf \
--root /srv/crystal-coder \
--project crystal-coder \
--tags docs,pdf,external
Track Git commit history
ccvault track-repo /home/git/my-crystal-project \
--root /srv/crystal-coder \
--name my-crystal-project \
--project crystal-coder \
--tags code,git,crystal \
--since 2024-01-01
This stores small JSON records that reference commit SHAs and changed paths. It does not store full diffs by default. The source Git repository remains the authoritative datastore for code history.
Create a durable training snapshot
ccvault snapshot crystal-coder-sft-v0 \
--root /srv/crystal-coder \
--project crystal-coder \
--tags docs,code,git,crystal
The snapshot is a JSONL file committed to the registry. It is the exact content list you used for a training or RAG build.
Push the registry to GitHub
ccvault push --root /srv/crystal-coder
Optional: sync durable archive blobs to S3
ccvault sync-s3 s3://my-bucket/crystal-coder-vault \
--root /srv/crystal-coder
Only blobs/archive is synced. Cache blobs are intentionally left out.
Verify
ccvault verify --root /srv/crystal-coder
This checks whether locally stored raw blobs still match their recorded SHA-256 hashes.
Reclaim cache space
ccvault gc-cache --root /srv/crystal-coder --older-than-days 30
This removes rebuildable cached blobs older than the chosen age. The registry still records what was used.
crystal-coder-vault
- 0
- 0
- 0
- 0
- 0
- 10 days ago
- May 9, 2026
Sat, 09 May 2026 13:36:33 GMT