cr_1brc

A Crystal take on the One Billion Row Challenge, done as an iterative optimization exercise: start with a deliberately naive implementation, then improve it one isolated change at a time and measure the effect of each.

The goal here is not to top the leaderboard. It's to rediscover the performance techniques from the ground up, one rung at a time, and keep an honest record of what each change actually bought.

The problem

Read a file of ~1,000,000,000 rows in the format <station>;<temperature>, where the temperature always has exactly one fractional digit:

Hamburg;12.0
Bulawayo;8.9
St. John's;15.2

Compute the min, mean, and max temperature per station and print them sorted by station name:

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, ...}

Mean and the min/max are each rounded to one fractional digit.

Layout

Each optimization step is its own entry file under src/, wired up as a named build target in shard.yml. Every version coexists in the tree, so any two rungs can be benchmarked against each other directly without checking out a different commit.

# shard.yml
targets:
  naive:
    main: src/naive.cr
  int:
    main: src/int.cr
  # ...one target per milestone

Shared, version-independent code (output formatting, the sorted print, the stats merge) lives in src/common.cr and is required by each increment, so each src/*.cr contains only the delta for that rung.

Milestones are tagged in git (m1-naive, m2-int, ...). git log --oneline is the changelog; git checkout <tag> reproduces any rung.

Build & run

# build one target
shards build --release naive

# build all targets
shards build --release

# run against a data file (path is required; no hardcoded default)
./bin/naive measurements_10m.txt

Iterate on the 10M file (sub-second feedback) and only run the 1B file at milestone tags to confirm a speedup holds at scale.

Generating the data

The data files are not committed (they're 130 MB / 13 GB and trivially regenerable). Generate them with the official Java generator:

git clone https://github.com/gunnarmorling/1brc
cd 1brc
./mvnw clean verify                 # needs JDK 21
java -cp target/average-1.0.0-SNAPSHOT.jar \
     dev.morling.onebrc.CreateMeasurements 10000000

The generator writes to ./measurements.txt (the row count is the only argument). Rename per size and move them next to this repo:

mv measurements.txt measurements_10m.txt
# ...and a second run with 1000000000 for measurements_1b.txt

Correctness

expected_10m.out is the frozen golden output, originally produced by the reference CalculateAverage_baseline from the 1brc repo on the 10M file. It is tracked (one ~13 KB line). verify.sh runs a built binary and diffs its output against it:

./verify.sh ./bin/naive measurements_10m.txt

Both outputs are a single comma-joined line, so the script splits on commas before diffing to get per-station results. A clean diff means the rung is correct; run it at every milestone before recording a time.

Note on rounding: the original golden file reflects float summation with half-to-even rounding. Once integer/fixed-point parsing lands and rounding is made explicit (half-up, per the spec), a handful of means may legitimately shift by 0.1. That is the implementation becoming more correct, not a regression — regenerate expected_10m.out from that rung once the deltas are confirmed to be rounding-only, and note it in RESULTS.md.

Milestones

#	Step	Idea	Status
m1	Naive baseline	`split` on `;`, `Float64` parse, `Hash` aggregate	✅ done
m2	Integer temperatures	Parse to tenths as `Int64`; exact sum, no `to_f`	✅ done
m3	Parse in place	Drop `split`; scan bytes, no per-line substrings	✅ done
m4	Reusable read buffer / mmap	Stop minting a `String` per line	planned
m5	Custom byte-keyed map	Open-addressing table keyed on name bytes	planned
m6	Parallelize	Split input across cores, local maps, merge	planned
m7	SWAR / branchless	Word-at-a-time delimiter scan, branch-free parse — if hot	planned

Order is the expected hot path, not gospel: after m3, profile before picking the next rung rather than assuming.

Results

See the RESULTS.md file

Credits

The challenge, data generator, reference baseline, and the techniques being rediscovered here are all from Gunnar Morling's 1brc. Results write-up: 1BRC — The Results Are In!.

Repository

cr_1brc

Owner

Lillevang

Statistic

0
0
0
0
0
about 1 month ago
June 8, 2026

License

MIT License

Links

Synced at

Mon, 08 Jun 2026 18:50:33 GMT

Languages

Crystal 96.33% Shell 3.67%