readability.cr
readability
A Crystal port of readability-rust (commit 72dd86cf), which is itself a port of Mozilla's Readability.js.
Extracts the main article content from web pages, removing navigation, ads, and clutter.
Installation
dependencies:
readability:
github: dsisnero/readability
shards install
Library Usage
Basic Article Extraction
require "readability"
html = <<-HTML
<!DOCTYPE html>
<html>
<head>
<title>Sample Article</title>
<meta name="author" content="John Doe">
</head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article.</p>
<p>More substantial content here.</p>
</article>
<aside>Sidebar content to be removed</aside>
</body>
</html>
HTML
parser = Readability::Parser.new(html)
if article = parser.parse
puts "Title: #{article.title}"
puts "By: #{article.byline}"
puts "Length: #{article.length} chars"
puts article.text_content
end
Custom Configuration
options = Readability::ReadabilityOptions.new(
debug: true,
char_threshold: 250,
keep_classes: true
)
parser = Readability::Parser.new(html, options)
article = parser.parse
Multi-Strategy Parsing
parser = Readability::Parser.new(html)
result = parser.parse_with_retry
puts "Quality: #{result.quality_score}"
puts "Strategy: #{result.strategy_used}" # Normal, Lenient, or Strict
puts "Attempts: #{result.retry_count}"
puts "Paragraphs: #{result.metrics.paragraph_count}"
Readability Assessment
if Readability.is_probably_readerable(html)
puts "This page has readable content"
end
Base URI for Relative URLs
parser = Readability::Parser.new(html, base_uri: "https://example.com")
article = parser.parse
CLI Usage
The library includes a command-line tool.
# Build
shards build readability_cr
# Process a file
./bin/readability_cr -i article.html
# From stdin
curl -s https://example.com/article | ./bin/readability_cr
# Output as plain text
./bin/readability_cr -i article.html -f text
# Output as JSON (default)
./bin/readability_cr -i article.html -f json > article.json
# Output as cleaned HTML
./bin/readability_cr -i article.html -f html
# Check if page is readable (exit 0 = yes, 1 = no)
./bin/readability_cr -i page.html -c
# Set base URI for resolving relative image URLs
./bin/readability_cr -i article.html -b https://example.com
CLI Options
Usage: readability_cr [options]
-i, --input=FILE Input HTML file (default: stdin)
-o, --output=FILE Output file (default: stdout)
-f, --format=FORMAT Output format: json, text, html
-b, --base-uri=URI Base URI for resolving relative URLs
-d, --debug Enable debug output
-c, --check Check if readable (exit 0=yes, 1=no)
--char-threshold=CHARS Minimum character threshold (default: 500)
--keep-classes Keep CSS classes in output
--disable-json-ld Disable JSON-LD parsing
-h, --help Show help
-v, --version Show version
Core Types
Readability::Article
| Field | Type | Description |
|---|---|---|
title |
String? |
Article title |
content |
String? |
Cleaned HTML content |
text_content |
String? |
Plain text content |
length |
Int32? |
Content length in characters |
excerpt |
String? |
Article excerpt |
byline |
String? |
Author name |
dir |
String? |
Text direction |
site_name |
String? |
Site name |
lang |
String? |
Content language |
published_time |
String? |
Publication date |
readerable |
Bool? |
Whether content was successfully extracted |
Readability::ReadabilityOptions
| Field | Type | Default | Description |
|---|---|---|---|
debug |
Bool |
false |
Enable debug output |
max_elems_to_parse |
Int32 |
0 |
Max elements to parse (0=unlimited) |
nb_top_candidates |
Int32 |
5 |
Top candidates to consider |
char_threshold |
Int32 |
25 |
Minimum content length |
classes_to_preserve |
Array(String) |
[] |
CSS classes to keep |
keep_classes |
Bool |
false |
Preserve CSS classes in output |
disable_json_ld |
Bool |
false |
Skip JSON-LD parsing |
link_density_modifier |
Float64 |
1.0 |
Link density scaling |
flags |
ReadabilityFlags |
default | Feature flags |
Readability::ParseResult
| Field | Type | Description |
|---|---|---|
article |
Article? |
Extracted article |
quality_score |
Float64 |
0.0–1.0 quality rating |
strategy_used |
ParseStrategy |
Which strategy was best |
retry_count |
Int32 |
Number of attempts |
metrics |
ParseMetrics |
Detailed extraction metrics |
Module Structure
src/
readability.cr # Core types + Parser class
readability/
regexps.cr # 21 regex patterns for content detection
utils.cr # URL, DOM, text helpers + analyzers
browser_compatibility.cr # Browser-specific HTML preprocessing
non_standard_html.cr # Malformed HTML cleanup
html_rewriter.cr # HTML rewriting + article content cleaning
performance_cache.cr # Regex/score caching
cli.cr # CLI formatter (JSON/text/HTML output)
cli.cr # CLI entry point
Algorithm
Follows Mozilla's Readability.js algorithm:
- Preprocessing — Remove scripts, styles, noscript; convert font tags; unwrap noscript images
- Content Discovery — Find
<p>,<td>,<pre>elements with substantial text - Scoring — Score ancestors based on text length, commas, images, class/id weights
- Link Density — Penalize high link density (navigation, link farms)
- Candidate Selection — Pick best candidate, check parent for better aggregation
- Content Cleaning — Remove nav/header/footer/sidebar, strip presentational attrs
- Output — Return cleaned HTML, plain text, and metadata
Quality Gates
crystal tool format --check src spec
./bin/ameba src spec
crystal spec
Documentation
License
MIT. Upstream readability-rust is Apache 2.0. Mozilla Readability.js is Apache 2.0.
Repository
readability.cr
Owner
Statistic
- 1
- 0
- 0
- 0
- 2
- 21 days ago
- May 29, 2026
License
MIT License
Links
Synced at
Fri, 29 May 2026 07:38:26 GMT
Languages