readability.cr

A standalone version of the readability lib of mozilla, in Crystal

readability

A Crystal port of readability-rust (commit 72dd86cf), which is itself a port of Mozilla's Readability.js.

Extracts the main article content from web pages, removing navigation, ads, and clutter.

Installation

dependencies:
  readability:
    github: dsisnero/readability
shards install

Library Usage

Basic Article Extraction

require "readability"

html = <<-HTML
  <!DOCTYPE html>
  <html>
  <head>
    <title>Sample Article</title>
    <meta name="author" content="John Doe">
  </head>
  <body>
    <article>
      <h1>Article Title</h1>
      <p>This is the main content of the article.</p>
      <p>More substantial content here.</p>
    </article>
    <aside>Sidebar content to be removed</aside>
  </body>
  </html>
HTML

parser = Readability::Parser.new(html)
if article = parser.parse
  puts "Title: #{article.title}"
  puts "By: #{article.byline}"
  puts "Length: #{article.length} chars"
  puts article.text_content
end

Custom Configuration

options = Readability::ReadabilityOptions.new(
  debug: true,
  char_threshold: 250,
  keep_classes: true
)
parser = Readability::Parser.new(html, options)
article = parser.parse

Multi-Strategy Parsing

parser = Readability::Parser.new(html)
result = parser.parse_with_retry
puts "Quality: #{result.quality_score}"
puts "Strategy: #{result.strategy_used}"     # Normal, Lenient, or Strict
puts "Attempts: #{result.retry_count}"
puts "Paragraphs: #{result.metrics.paragraph_count}"

Readability Assessment

if Readability.is_probably_readerable(html)
  puts "This page has readable content"
end

Base URI for Relative URLs

parser = Readability::Parser.new(html, base_uri: "https://example.com")
article = parser.parse

CLI Usage

The library includes a command-line tool.

# Build
shards build readability_cr

# Process a file
./bin/readability_cr -i article.html

# From stdin
curl -s https://example.com/article | ./bin/readability_cr

# Output as plain text
./bin/readability_cr -i article.html -f text

# Output as JSON (default)
./bin/readability_cr -i article.html -f json > article.json

# Output as cleaned HTML
./bin/readability_cr -i article.html -f html

# Check if page is readable (exit 0 = yes, 1 = no)
./bin/readability_cr -i page.html -c

# Set base URI for resolving relative image URLs
./bin/readability_cr -i article.html -b https://example.com

CLI Options

Usage: readability_cr [options]

    -i, --input=FILE             Input HTML file (default: stdin)
    -o, --output=FILE            Output file (default: stdout)
    -f, --format=FORMAT          Output format: json, text, html
    -b, --base-uri=URI           Base URI for resolving relative URLs
    -d, --debug                  Enable debug output
    -c, --check                  Check if readable (exit 0=yes, 1=no)
        --char-threshold=CHARS   Minimum character threshold (default: 500)
        --keep-classes           Keep CSS classes in output
        --disable-json-ld        Disable JSON-LD parsing
    -h, --help                   Show help
    -v, --version                Show version

Core Types

Readability::Article

Field Type Description
title String? Article title
content String? Cleaned HTML content
text_content String? Plain text content
length Int32? Content length in characters
excerpt String? Article excerpt
byline String? Author name
dir String? Text direction
site_name String? Site name
lang String? Content language
published_time String? Publication date
readerable Bool? Whether content was successfully extracted

Readability::ReadabilityOptions

Field Type Default Description
debug Bool false Enable debug output
max_elems_to_parse Int32 0 Max elements to parse (0=unlimited)
nb_top_candidates Int32 5 Top candidates to consider
char_threshold Int32 25 Minimum content length
classes_to_preserve Array(String) [] CSS classes to keep
keep_classes Bool false Preserve CSS classes in output
disable_json_ld Bool false Skip JSON-LD parsing
link_density_modifier Float64 1.0 Link density scaling
flags ReadabilityFlags default Feature flags

Readability::ParseResult

Field Type Description
article Article? Extracted article
quality_score Float64 0.0–1.0 quality rating
strategy_used ParseStrategy Which strategy was best
retry_count Int32 Number of attempts
metrics ParseMetrics Detailed extraction metrics

Module Structure

src/
  readability.cr             # Core types + Parser class
  readability/
    regexps.cr               # 21 regex patterns for content detection
    utils.cr                 # URL, DOM, text helpers + analyzers
    browser_compatibility.cr # Browser-specific HTML preprocessing
    non_standard_html.cr     # Malformed HTML cleanup
    html_rewriter.cr         # HTML rewriting + article content cleaning
    performance_cache.cr     # Regex/score caching
    cli.cr                   # CLI formatter (JSON/text/HTML output)
  cli.cr                     # CLI entry point

Algorithm

Follows Mozilla's Readability.js algorithm:

  1. Preprocessing — Remove scripts, styles, noscript; convert font tags; unwrap noscript images
  2. Content Discovery — Find <p>, <td>, <pre> elements with substantial text
  3. Scoring — Score ancestors based on text length, commas, images, class/id weights
  4. Link Density — Penalize high link density (navigation, link farms)
  5. Candidate Selection — Pick best candidate, check parent for better aggregation
  6. Content Cleaning — Remove nav/header/footer/sidebar, strip presentational attrs
  7. Output — Return cleaned HTML, plain text, and metadata

Quality Gates

crystal tool format --check src spec
./bin/ameba src spec
crystal spec

Documentation

License

MIT. Upstream readability-rust is Apache 2.0. Mozilla Readability.js is Apache 2.0.

Repository

readability.cr

Owner
Statistic
  • 1
  • 0
  • 0
  • 0
  • 2
  • 21 days ago
  • May 29, 2026
License

MIT License

Links
Synced at

Fri, 29 May 2026 07:38:26 GMT

Languages