readability

A Crystal port of readability-rust (commit 72dd86cf), which is itself a port of Mozilla's Readability.js.

Extracts the main article content from web pages, removing navigation, ads, and clutter.

Installation

dependencies:
  readability:
    github: dsisnero/readability

shards install

Library Usage

Basic Article Extraction

require "readability"

html = <<-HTML
  <!DOCTYPE html>
  <html>
  <head>
    <title>Sample Article</title>
    <meta name="author" content="John Doe">
  </head>
  <body>
    <article>
      <h1>Article Title</h1>
      <p>This is the main content of the article.</p>
      <p>More substantial content here.</p>
    </article>
    <aside>Sidebar content to be removed</aside>
  </body>
  </html>
HTML

parser = Readability::Parser.new(html)
if article = parser.parse
  puts "Title: #{article.title}"
  puts "By: #{article.byline}"
  puts "Length: #{article.length} chars"
  puts article.text_content
end

Custom Configuration

options = Readability::ReadabilityOptions.new(
  debug: true,
  char_threshold: 250,
  keep_classes: true
)
parser = Readability::Parser.new(html, options)
article = parser.parse

Multi-Strategy Parsing

parser = Readability::Parser.new(html)
result = parser.parse_with_retry
puts "Quality: #{result.quality_score}"
puts "Strategy: #{result.strategy_used}"     # Normal, Lenient, or Strict
puts "Attempts: #{result.retry_count}"
puts "Paragraphs: #{result.metrics.paragraph_count}"

Readability Assessment

if Readability.is_probably_readerable(html)
  puts "This page has readable content"
end

Base URI for Relative URLs

parser = Readability::Parser.new(html, base_uri: "https://example.com")
article = parser.parse

CLI Usage

The library includes a command-line tool.

# Build
shards build readability_cr

# Process a file
./bin/readability_cr -i article.html

# From stdin
curl -s https://example.com/article | ./bin/readability_cr

# Output as plain text
./bin/readability_cr -i article.html -f text

# Output as JSON (default)
./bin/readability_cr -i article.html -f json > article.json

# Output as cleaned HTML
./bin/readability_cr -i article.html -f html

# Check if page is readable (exit 0 = yes, 1 = no)
./bin/readability_cr -i page.html -c

# Set base URI for resolving relative image URLs
./bin/readability_cr -i article.html -b https://example.com

CLI Options

Usage: readability_cr [options]

    -i, --input=FILE             Input HTML file (default: stdin)
    -o, --output=FILE            Output file (default: stdout)
    -f, --format=FORMAT          Output format: json, text, html
    -b, --base-uri=URI           Base URI for resolving relative URLs
    -d, --debug                  Enable debug output
    -c, --check                  Check if readable (exit 0=yes, 1=no)
        --char-threshold=CHARS   Minimum character threshold (default: 500)
        --keep-classes           Keep CSS classes in output
        --disable-json-ld        Disable JSON-LD parsing
    -h, --help                   Show help
    -v, --version                Show version

Core Types

`Readability::Article`

Field	Type	Description
`title`	`String?`	Article title
`content`	`String?`	Cleaned HTML content
`text_content`	`String?`	Plain text content
`length`	`Int32?`	Content length in characters
`excerpt`	`String?`	Article excerpt
`byline`	`String?`	Author name
`dir`	`String?`	Text direction
`site_name`	`String?`	Site name
`lang`	`String?`	Content language
`published_time`	`String?`	Publication date
`readerable`	`Bool?`	Whether content was successfully extracted

`Readability::ReadabilityOptions`

Field	Type	Default	Description
`debug`	`Bool`	`false`	Enable debug output
`max_elems_to_parse`	`Int32`	`0`	Max elements to parse (0=unlimited)
`nb_top_candidates`	`Int32`	`5`	Top candidates to consider
`char_threshold`	`Int32`	`25`	Minimum content length
`classes_to_preserve`	`Array(String)`	`[]`	CSS classes to keep
`keep_classes`	`Bool`	`false`	Preserve CSS classes in output
`disable_json_ld`	`Bool`	`false`	Skip JSON-LD parsing
`link_density_modifier`	`Float64`	`1.0`	Link density scaling
`flags`	`ReadabilityFlags`	default	Feature flags

`Readability::ParseResult`

Field	Type	Description
`article`	`Article?`	Extracted article
`quality_score`	`Float64`	0.0–1.0 quality rating
`strategy_used`	`ParseStrategy`	Which strategy was best
`retry_count`	`Int32`	Number of attempts
`metrics`	`ParseMetrics`	Detailed extraction metrics

Module Structure

src/
  readability.cr             # Core types + Parser class
  readability/
    regexps.cr               # 21 regex patterns for content detection
    utils.cr                 # URL, DOM, text helpers + analyzers
    browser_compatibility.cr # Browser-specific HTML preprocessing
    non_standard_html.cr     # Malformed HTML cleanup
    html_rewriter.cr         # HTML rewriting + article content cleaning
    performance_cache.cr     # Regex/score caching
    cli.cr                   # CLI formatter (JSON/text/HTML output)
  cli.cr                     # CLI entry point

Algorithm

Follows Mozilla's Readability.js algorithm:

Preprocessing — Remove scripts, styles, noscript; convert font tags; unwrap noscript images
Content Discovery — Find <p>, <td>, <pre> elements with substantial text
Scoring — Score ancestors based on text length, commas, images, class/id weights
Link Density — Penalize high link density (navigation, link farms)
Candidate Selection — Pick best candidate, check parent for better aggregation
Content Cleaning — Remove nav/header/footer/sidebar, strip presentational attrs
Output — Return cleaned HTML, plain text, and metadata

Quality Gates

crystal tool format --check src spec
./bin/ameba src spec
crystal spec

Documentation

License

MIT. Upstream readability-rust is Apache 2.0. Mozilla Readability.js is Apache 2.0.

Repository

readability.cr

Owner

dsisnero

Statistic

1
0
0
0
2
21 days ago
May 29, 2026

License

MIT License

Links

Synced at

Fri, 29 May 2026 07:38:26 GMT

Languages

Crystal 100.0%