cr_1brc

cr_1brc

A Crystal take on the One Billion Row Challenge, done as an iterative optimization exercise: start with a deliberately naive implementation, then improve it one isolated change at a time and measure the effect of each.

The goal here is not to top the leaderboard. It's to rediscover the performance techniques from the ground up, one rung at a time, and keep an honest record of what each change actually bought.

The problem

Read a file of ~1,000,000,000 rows in the format <station>;<temperature>, where the temperature always has exactly one fractional digit:

Hamburg;12.0
Bulawayo;8.9
St. John's;15.2

Compute the min, mean, and max temperature per station and print them sorted by station name:

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, ...}

Mean and the min/max are each rounded to one fractional digit.

Layout

Each optimization step is its own entry file under src/, wired up as a named build target in shard.yml. Every version coexists in the tree, so any two rungs can be benchmarked against each other directly without checking out a different commit.

# shard.yml
targets:
  naive:
    main: src/naive.cr
  int:
    main: src/int.cr
  # ...one target per milestone

Shared, version-independent code (output formatting, the sorted print, the stats merge) lives in src/common.cr and is required by each increment, so each src/*.cr contains only the delta for that rung.

Milestones are tagged in git (m1-naive, m2-int, ...). git log --oneline is the changelog; git checkout <tag> reproduces any rung.

Build & run

# build one target
shards build --release naive

# build all targets
shards build --release

# run against a data file (path is required; no hardcoded default)
./bin/naive measurements_10m.txt

Iterate on the 10M file (sub-second feedback) and only run the 1B file at milestone tags to confirm a speedup holds at scale.

Generating the data

The data files are not committed (they're 130 MB / 13 GB and trivially regenerable). Generate them with the official Java generator:

git clone https://github.com/gunnarmorling/1brc
cd 1brc
./mvnw clean verify                 # needs JDK 21
java -cp target/average-1.0.0-SNAPSHOT.jar \
     dev.morling.onebrc.CreateMeasurements 10000000

The generator writes to ./measurements.txt (the row count is the only argument). Rename per size and move them next to this repo:

mv measurements.txt measurements_10m.txt
# ...and a second run with 1000000000 for measurements_1b.txt

Correctness

expected_10m.out is the frozen golden output, originally produced by the reference CalculateAverage_baseline from the 1brc repo on the 10M file. It is tracked (one ~13 KB line). verify.sh runs a built binary and diffs its output against it:

./verify.sh ./bin/naive measurements_10m.txt

Both outputs are a single comma-joined line, so the script splits on commas before diffing to get per-station results. A clean diff means the rung is correct; run it at every milestone before recording a time.

Note on rounding: the original golden file reflects float summation with half-to-even rounding. Once integer/fixed-point parsing lands and rounding is made explicit (half-up, per the spec), a handful of means may legitimately shift by 0.1. That is the implementation becoming more correct, not a regression — regenerate expected_10m.out from that rung once the deltas are confirmed to be rounding-only, and note it in RESULTS.md.

Milestones

# Step Idea Status
m1 Naive baseline split on ;, Float64 parse, Hash aggregate ✅ done
m2 Integer temperatures Parse to tenths as Int64; exact sum, no to_f ✅ done
m3 Parse in place Drop split; scan bytes, no per-line substrings planned
m4 Reusable read buffer / mmap Stop minting a String per line planned
m5 Custom byte-keyed map Open-addressing table keyed on name bytes planned
m6 Parallelize Split input across cores, local maps, merge planned
m7 SWAR / branchless Word-at-a-time delimiter scan, branch-free parse — if hot planned

Order is the expected hot path, not gospel: after m3, profile before picking the next rung rather than assuming.

Results

See the RESULTS.md file

Credits

The challenge, data generator, reference baseline, and the techniques being rediscovered here are all from Gunnar Morling's 1brc. Results write-up: 1BRC — The Results Are In!.

Repository

cr_1brc

Owner
Statistic
  • 0
  • 0
  • 0
  • 0
  • 0
  • about 1 hour ago
  • June 8, 2026
License

MIT License

Links
Synced at

Mon, 08 Jun 2026 13:37:53 GMT

Languages