grubber

A data retrieval tool for Markdown files. It just grubs fast through a big data field.

YAML code blocks in Markdown are usually treated as code examples, something to display or syntax-highlight. grubber treats them as structured data records that live next to their context, across an entire directory of files. Think dataview without Obsidian.

Quick example

Input (project-alpha.md):

---
title: Project Alpha
keywords: [project]
---

# Project Alpha

\```yaml
type: project
name: "Project Alpha"
org: "Northwind Corp"
status: active
start: 2025-01-15
end: 2025-06-30
owner: Jane Smith
\```

Kickoff completed. First milestone due end of February.

Output (grubber extract examples/):

[
  {
    "title": "Project Alpha",
    "keywords": ["project"],
    "_note_file": "examples/project-alpha.md",
    "type": "project",
    "name": "Project Alpha",
    "org": "Northwind Corp",
    "status": "active",
    "start": "2025-01-15",
    "end": "2025-06-30",
    "owner": "Jane Smith"
  }
]

Frontmatter and YAML block are merged into one flat record. The prose stays in Markdown, untouched. Run this across a directory of 1,000 notes and you get a single JSON file with all your records.

Installation

Ruby

No dependencies beyond stdlib. Requires Ruby 3.1+.

chmod +x grubber
cp grubber /usr/local/bin/

Crystal (optional)

A version written in the Crystal programming language is included. Crystal is a compiled language with a Ruby-like syntax. It produces a standalone binary with zero interpreter startup time, uses a multi-threaded worker pool, and outputs records in deterministic order sorted by filename.

crystal build grubber.cr -o grubber_crystal --release
cp grubber_crystal /usr/local/bin/grubber

Pre-built binary for macOS (Apple Silicon) available under Releases.

Quick start

grubber extract ~/notes
grubber extract ~/notes -f "type=project" --format tsv
grubber extract ~/notes/project.md

Why

Structured data and the context around it usually live in different places. A database for the fields, a wiki or folder for the notes. grubber keeps them together: queryable YAML blocks inside Markdown files. The data stays where you read and write it.

Standard Markdown. Any editor or renderer handles the format correctly. grubber just adds a read layer on top.
Fast enough to skip the database. 1,000 files in under 0.2s (Ruby) or under 50ms (Crystal). Well, actually it's 15ms. No index, no daemon, no setup.
Only structure what you query. Put queryable fields in YAML blocks. Everything else stays in plain Markdown. This includes addresses, serial numbers, or meeting notes. If you'd never filter by it, don't put it in a code block.

vs. databases

grubber is not a multi-dimensional database. But it covers many use cases: filtering, sorting, aggregating flat records across thousands of files. For personal data like contracts, contacts, inventory, or projects, that's often enough. And your data stays in plain text files that outlive any software.

vs. Dataview (Obsidian)

	Dataview	grubber
Editor lock-in	Obsidian only	Any editor
Query language	Proprietary DQL	Standard tools (jq, nushell, miller)
Output	In-note rendering	JSON/TSV for any pipeline
Live updates	Yes	No (run on demand)
Extensible	Plugin API	Shell scripts, any language

grubber trades live updates for tool independence. No proprietary query language to learn. If you know jq or nushell, you already know how to query grubber output.

How it works

grubber scans Markdown files for YAML frontmatter and fenced YAML code blocks, merges them into flat records, and outputs JSON or TSV. It does one thing: extract. All logic like filtering, sorting, or aggregating happens downstream with standard tools.

Multiple YAML blocks per file produce multiple records, each inheriting the frontmatter fields. The YAML block holds queryable data. The prose around it is context that doesn't need to be queried.

Available as a standalone CLI or as a Ruby library for use in other scripts:

require_relative 'grubber'

grubber = DataGrubber::Grubber.new('~/notes', blocks_only: true)
result = grubber.extract
result[:records].each { |r| puts r['name'] }

Usage

# Extract all records from a directory
grubber extract ~/notes

# Output as TSV
grubber extract ~/notes --format tsv

# Filter records
grubber extract ~/notes -f "type=contract"
grubber extract ~/notes -f "type=contract" -f "end^2025"

# Only YAML blocks (skip frontmatter-only notes)
grubber extract ~/notes --blocks-only

# Only frontmatter (ignore YAML blocks)
grubber extract ~/notes --frontmatter-only

# Write to file
grubber extract ~/notes -o data.json

# Use a config set
grubber extract --set contracts

Filter operators

Operator	Meaning	Example
`=`	equals	`type=contract`
`~`	contains	`name~hosting`
`^`	starts with	`end^2025`
`!`	not equals	`status!archived`

Filters are case-insensitive and work on arrays (matches if any element matches).

Piping to other tools

grubber outputs JSON by default, designed for piping. The downstream tool does the thinking:

# jq: contracts expiring in 2025
grubber extract ~/notes -f type=contract | jq '[.[] | select(.end | startswith("2025"))]'

# nushell: sort contacts by last interaction
grubber extract ~/notes -f type=person | from json | sort-by last_contact -r

# miller: TSV processing
grubber extract ~/notes --format tsv | mlr --tsv sort-by -nr amount

Configuration

Optional config file at ~/.config/grubber/config.yaml:

defaults:
  blocks_only: true
  array_fields: [keywords, category]

sets:
  contracts:
    path: ~/notes
    filters: [type=contract]
    blocks_only: true

  people:
    path: ~/notes
    filters: [type=person]

Use sets with grubber extract --set contracts.

Override hierarchy

CLI flags > Config set > Environment variables > Config defaults > Built-in defaults

Environment variables

Variable	Purpose
`GRUBBER_NOTES`	Default notes directory
`GRUBBER_ARRAY_FIELDS`	Fields to normalize to arrays (comma-separated)

Options

-o, --output FILE         Write to file instead of stdout
-s, --set NAME            Load options from config set
    --format FORMAT       json (default) or tsv
-b, --blocks-only         Only extract YAML blocks
-m, --frontmatter-only    Only extract frontmatter
-a, --all                 Extract everything, override config defaults
    --array-fields FIELDS Normalize fields to arrays (splits comma-separated values)
    --mmd                 Also read MultiMarkdown metadata headers
-d, --depth N             Limit directory recursion depth (0 = no subdirectories)
-f, --filter EXPR         Filter records (repeatable)
-h, --help                Show help

--array-fields normalizes string values to arrays. Comma-separated strings like a, b, c are split into ["a", "b", "c"]. YAML arrays are kept as-is. Values that contain commas as part of their meaning (e.g. "Smith, John") should be written as a YAML array instead.

How to structure your notes

grubber reads two things from a Markdown file: YAML frontmatter and fenced YAML code blocks (```yaml). Everything else is ignored.

Frontmatter holds note-level metadata (title, keywords, created date). These fields are merged into every record from that file.
YAML code blocks hold structured data records. Only ```yaml blocks are read — other fenced blocks are ignored.
Multiple YAML blocks in one note produce multiple records. Each inherits the frontmatter fields.
On field name collision, the YAML block wins over frontmatter.
Notes without YAML blocks are extracted as frontmatter-only records (unless --blocks-only).
A _note_file field is added automatically to every record for traceability.
grubber scans directories recursively. Every .md file is included.

See examples/SCHEMA.md for an example schema.

Design

Extract only. grubber reads and outputs. No transforms, no joins, no computed fields. Complexity belongs in downstream tools.
Valid Markdown. The format doesn't break any renderer. grubber adds a queryable layer on top.
Dates are output as strings (YYYY-MM-DD) for safe JSON serialization.
Schema-agnostic. grubber extracts whatever YAML it finds. Field names and record types are up to you.

License

MIT

Repository

grubber

Owner

rhsev

Statistic

14
0
0
0
0
about 1 month ago
February 19, 2026

License

Links

Synced at

Thu, 12 Mar 2026 08:27:36 GMT

Languages

Crystal 50.05% Ruby 47.83% Makefile 2.12%

grubber v0.7.2

grubber

Quick example

Installation

Ruby

Crystal (optional)

Quick start

Why

vs. databases

vs. Dataview (Obsidian)

How it works

Usage

Filter operators

Piping to other tools

Configuration

Override hierarchy

Environment variables

Options

How to structure your notes

Design

See also

License