markout v0.3.0

Convert HTML to clean, readable Markdown in Crystal.

Markout

Convert HTML to clean, readable Markdown in Crystal. Designed for content migration, static site generators, documentation tools, and LLM data preparation pipelines.

Features

  • Fast: 25-40x faster than Python alternatives (markdownify, html2text)
  • Clean output: Strips navigation, scripts, styles, and UI chrome automatically
  • LLM-ready: Produces clean markdown ideal for RAG pipelines and context windows
  • Configurable: ATX/Setext headings, bullet styles, reference links, and more
  • Comprehensive: Tables, nested lists, blockquotes, code blocks, images, links
  • CLI included: Command-line tool for Unix-style pipelines

Installation

As a Library

Add the dependency to your shard.yml:

dependencies:
  markout:
    github: amscotti/markout

Then run:

shards install

CLI Binary

Download pre-built binaries from the releases page or build from source:

shards build --production
# Binary will be at bin/markout

Or use Docker:

docker pull ghcr.io/amscotti/markout:latest

Quick Start

CLI Usage

The markout CLI follows Unix philosophy - read from stdin or file, write to stdout or file:

# Convert from stdin (pipe)
curl -s https://example.com | markout > article.md

# Convert file to stdout
markout input.html

# Convert file to file
markout input.html -o output.md

# Chain with other tools
cat page.html | markout | grep "^#" > headings.txt

CLI Options

Flag Description
-o, --output FILE Output file (default: stdout)
--heading-style=STYLE atx (#) or setext (underlined)
--bullet-char=CHAR -, *, or + (default: -)
--link-style=STYLE inline or referenced
--strip-document Strip HTML document wrapper (default: true)
--no-strip-document Keep document wrapper
-h, --help Show help
-v, --version Show version

CLI Examples

# Fetch HN article and convert
curl -sL https://amplifying.ai/research/claude-code-picks | markout

# Use Setext headings and asterisk bullets
markout --heading-style=setext --bullet-char="*" article.html

Library Usage

require "markout"

html = "<h1>Hello</h1><p>This is <strong>bold</strong> text.</p>"
markdown = Markout.convert(html)
# => "# Hello\n\nThis is **bold** text."

Usage Examples

Basic Conversion

require "markout"

# Simple HTML to Markdown
html = <<-HTML
  <h1>Welcome</h1>
  <p>This is a <strong>test</strong> with a <a href="https://example.com">link</a>.</p>
  <ul>
    <li>Item one</li>
    <li>Item two</li>
  </ul>
HTML

puts Markout.convert(html)
# Output:
# # Welcome
#
# This is a **test** with a [link](https://example.com).
#
# - Item one
# - Item two

With Options

require "markout"

html = "<h1>Title</h1><ul><li>Item</li></ul>"

# Use Setext-style headings and asterisk bullets
options = Markout::Options.new
options.heading_style = Markout::Options::HeadingStyle::Setext
options.bullet_char = '*'

puts Markout.convert(html, options)
# Output:
# Title
# =====
#
# * Item

Reference-Style Links

require "markout"

html = <<-HTML
  <p>Visit <a href="https://example.com">Example</a> and
  <a href="https://test.com">Test</a> for more info.</p>
HTML

options = Markout::Options.new
options.link_style = Markout::Options::LinkStyle::Referenced

puts Markout.convert(html, options)
# Output:
# Visit [Example][1] and [Test][2] for more info.
#
# [1]: https://example.com
# [2]: https://test.com

Processing Web Pages

require "markout"
require "http/client"

# Fetch and convert a web page
response = HTTP::Client.get("https://example.com/article")
markdown = Markout.convert(response.body)

# Navigation, scripts, and styles are automatically stripped
# Only the article content remains

Reusable Converter

require "markout"

# Create a converter instance for multiple documents
options = Markout::Options.new
options.code_fence = "~~~"

converter = Markout::Converter.new(options)

docs = ["<p>Doc 1</p>", "<p>Doc 2</p>", "<p>Doc 3</p>"]
results = docs.map { |html| converter.convert(html) }

Configuration Options

Option Type Default Description
heading_style HeadingStyle ATX ATX (#) or Setext (underline)
bullet_char Char '-' Bullet character for unordered lists
emphasis_char Char '*' Character for italic text
strong_char String "**" Characters for bold text
code_fence String "```" Code fence delimiter
hr_style String "---" Horizontal rule style
link_style LinkStyle Inline Inline or Referenced links
autolinks Bool true Use <url> when link text matches URL
strip_document Bool true Strip leading/trailing whitespace

Performance

Benchmarks against Python's markdownify and html2text (599KB Wikipedia page):

Library Time Output Size
Markout 7.5ms 166KB
html2text 83ms 204KB
markdownify 203ms 210KB

Markout is typically 10-30x faster than Python alternatives and produces 20-30% smaller output by stripping non-content elements. Reproducible scripts can be found in the benchmarks/ directory.

Use Cases

  • RAG Pipelines: Extract clean content from web pages for vector databases
  • Content Migration: Convert HTML documentation to Markdown
  • LLM Context: Maximize useful content in context windows
  • Static Sites: Process HTML for Jekyll, Hugo, or other generators
  • Web Scraping: Clean article extraction from web pages

Development

# Install dependencies
shards install

# Run tests
crystal spec

# Run linter
bin/ameba

Contributing

  1. Fork it (https://github.com/amscotti/markout/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Building from Source

Prerequisites

  • Crystal >= 1.10.0
  • LLVM (for compilation)

Build Commands

# Install dependencies
shards install

# Build CLI binary
shards build --production

# Run tests
crystal spec

# Run linter
bin/ameba

# Format code
crystal tool format

# Generate documentation
crystal docs

Cross-Platform Builds

The project includes GitHub Actions workflows for building release binaries:

  • Linux (amd64, arm64) - statically linked
  • macOS (amd64, arm64)
  • Windows (amd64)

See .github/workflows/release.yml for details.

Docker

Using the Image

# Pull from GitHub Container Registry
docker pull ghcr.io/amscotti/markout:latest

# Convert a file
docker run --rm -v $(pwd):/data ghcr.io/amscotti/markout /data/input.html -o /data/output.md

# Pipe through docker
curl -s https://example.com | docker run --rm -i ghcr.io/amscotti/markout

Building the Image

docker build -t markout .

License

MIT License - see LICENSE for details.

Author

Repository

markout

Owner
Statistic
  • 0
  • 0
  • 0
  • 1
  • 2
  • 2 days ago
  • January 17, 2026
License

MIT License

Links
Synced at

Mon, 09 Mar 2026 00:13:38 GMT

Languages