crystal-html5 v0.7.0
Crystal-HTML5
Crystal-HTML5 shard is a Pure Crystal implementation of an HTML5-compliant Tokenizer and Parser with streaming support. The relevant specifications include:
- https://html.spec.whatwg.org/multipage/syntax.html
- https://html.spec.whatwg.org/multipage/syntax.html#tokenization
Shard also provides CSS Selector support by implementing W3 Selectors Level 3 specification and streaming parsing via both token-level iteration and SAX-style event callbacks.
Tokenization is done by creating a Tokenizer for an IO. It is the caller responsibility to ensure that provided IO provides UTF-8 encoded HTML. The tokenization algorithm implemented by this shard is not a line-by-line transliteration of the relatively verbose state-machine in the WHATWG specification. A more direct approach is used instead, where the program counter implies the state, such as whether it is tokenizing a tag or a text node. Specification compliance is verified by checking expected and actual outputs over a test suite rather than aiming for algorithmic fidelity.
Parsing is done by calling HTML5.parse with either a String containing HTML or an IO instance. HTML5.parse returns a document root as HTML5::Node instance.
Parsing a fragment is done by calling HTML5.parse_fragment with either a String containing fragment of HTML5 or an IO instance. If the fragment is the InnerHTML for an existing element, pass that element in context. HTML5.parse_fragment returns a list of HTML5::Node that were found.
Streaming is supported at two levels: HTML5.each_token / HTML5.token_iterator for lightweight token-level streaming in constant memory, and HTML5.stream for SAX-style event callbacks during full HTML5-compliant tree construction.
Installation
-
Add the dependency to your
shard.yml:dependencies: html5: github: naqvis/crystal-html5 -
Run
shards install
Usage
Example 1: Process each anchor <a> node.
require "html5"
html = <<-HTML5
<!DOCTYPE html><html lang="en-US">
<head>
<title>Hello,World!</title>
</head>
<body>
<div class="container">
<header>
<!-- Logo -->
<h1>City Gallery</h1>
</header>
<nav>
<ul>
<li><a href="/London">London</a></li>
<li><a href="/Paris">Paris</a></li>
<li><a href="/Tokyo">Tokyo</a></li>
</ul>
</nav>
<article>
<h1>London</h1>
<img src="pic_mountain.jpg" alt="Mountain View" style="width:304px;height:228px;">
<p>London is the capital city of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</article>
<footer>Copyright © W3Schools.com</footer>
</div>
</body>
</html>
HTML5
def process(node)
if node.element? && node.data == "a"
# Do something with node
href = node["href"]?
puts "#{node.first_child.try &.data} => #{href.try &.val}"
# print all attributes
node.attr.each do |a|
# puts "#{a.key} = \"#{a.val}\""
end
end
c = node.first_child
while c
process(c)
c = c.next_sibling
end
end
doc = HTML5.parse(html)
process(doc)
# Output
# London => /London
# Paris => /Paris
# Tokyo => /Tokyo
Example 2: Parse an HTML or Fragment of HTML
require "html5"
def parse_html(html, context)
if context.empty?
doc = HTML5.parse(html)
else
namespace = ""
if (i = context.index(' ')) && (i >= 0)
namespace, context = context[...i], context[i + 1..]
end
cnode = HTML5::Node.new(
type: HTML5::NodeType::Element,
data_atom: HTML5::Atom.lookup(context.to_slice),
data: context,
namespace: namespace,
)
nodes = HTML5.parse_fragment(html, cnode)
doc = HTML5::Node.new(type: HTML5::NodeType::Document)
nodes.each do |n|
doc.append_child(n)
end
end
doc
end
html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
doc = parse_html(html, "body")
process(doc)
# Output
# Foo => foo
# BarBaz => /bar/baz
Example 3: Render HTML5::Node to HTML
require "html5"
html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
doc = HTML5.parse(html)
doc.render(STDOUT)
# Output
# <html><head></head><body><p>Links:</p><ul><li><a href="foo">Foo</a></li><li><a href="/bar/baz">BarBaz</a></li></ul></body></html>
Example 3: XPath Query
require "html5"
html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
doc = HTML5.parse(html)
# Find all A elements
list = html.xpath_nodes("//a")
# Find all A elements that have `href` attribute.
list = html.xpath_nodes("//a[@href]")
# Find all A elements with `href` attribute and only return `href` value.
list = html.xpath_nodes("//a/@href")
list.each {|a| pp a.inner_text}
# Find the second `a` element
a = html.xpath("//a[2]")
# Count the number of all a elements.
v = html.xpath_float("//a")
Refer to specs for more sample usages. And refer to Crystal XPath2 Shard for details of what functions and functionality is supported by XPath implementation.
Example 4: CSS Selector
html = <<-HTML
<html>
<body>
<table id="t1">
<tr><td>Hello</td></tr>
</table>
<table id="t2">
<tr><td>123</td><td>other</td></tr>
<tr><td>foo</td><td>columns</td></tr>
<tr><td>bar</td><td>are</td></tr>
<tr><td>xyz</td><td>ignored</td></tr>
</table>
</body>
</html>
HTML
node = HTML5.parse(html)
p node.css("#t2 tr td:first-child").map(&.inner_text).to_a # => ["123", "foo", "bar", "xyz"]
p node.css("#t2 tr td:first-child").map(&.to_html(true)).to_a # => "<td>123</td>", "<td>foo</td>", "<td>bar</td>", "<td>xyz</td>"]
html = <<-HTML
<p>
<h2 id="foo">a header</h2>
<h2 id="bar">another header</h2>
</p>
HTML
node = HTML5.parse(html)
p node.css("h2#foo").map(&.to_html(true)).to_a # => ["<h2 id=\"foo\">a header</h2>"]
html = <<-HTML
<div>
<p id=p1>
<p id=p2 class=jo>
<p id=p3>
<a href="some.html" id=a1>link1</a>
<a href="some.png" id=a2>link2</a>
<div id=bla>
<p id=p4 class=jo>
<p id=p5 class=bu>
<p id=p6 class=jo>
</div>
</div>
HTML
node = HTML5.parse(html)
# select all p nodes which id like `*p*`
p node.css("p[id*=p]").map(&.["id"].val).to_a # => ["p1", "p2", "p3", "p4", "p5", "p6"]
# select all nodes with class "jo"
p node.css("p.jo").map(&.["id"].val).to_a # => ["p2", "p4", "p6"]
p node.css(".jo").map(&.["id"].val).to_a # => ["p2", "p4", "p6"]
# a element with href ends like .png
p node.css(%q{a[href$=".png"]}).map(&.["id"].val).to_a # => ["a2"]
# find all a tags inside <p id=p3>, which href contain `html`
p node.css(%q{p[id=p3] > a[href*="html"]}).map(&.["id"].val).to_a # => ["a1"]
# :scope pseudo-class - matches the element on which css() was called
element = node.css("#p3").first
p element.css(":scope").map(&.["id"].val).to_a # => ["p3"] (the element itself)
p element.css(":scope > a").map(&.["id"].val).to_a # => ["a1", "a2"] (direct children)
Refer to spec/css specs for more sample usages.
Example 5: Streaming — Token-Level
Process HTML as a stream of tokens without building a parse tree. Runs in constant memory regardless of input size.
require "html5"
html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
# Block-based iteration
HTML5.each_token(html) do |token|
case token.type
when .start_tag?
print "<#{token.data}>"
when .end_tag?
print "</#{token.data}>"
when .text?
print token.data
end
end
puts
# Pull-based iterator
HTML5.token_iterator(html).each do |token|
if token.type.start_tag? && token.data == "a"
token.attr.each do |a|
puts a.val if a.key == "href"
end
end
end
# Output
# foo
# /bar/baz
Both each_token and token_iterator accept an IO or a String. The token stream reflects the raw markup — no implicit tags or tree corrections are applied.
Example 6: Streaming — SAX-Style Events
Get incremental callbacks as the HTML5 parser constructs the document tree. Implement the HTML5::StreamingHandler module and pass it to HTML5.stream.
require "html5"
class LinkExtractor
include HTML5::StreamingHandler
getter links = [] of String
def on_element_open(tag : String, attrs : Array(HTML5::Attribute), namespace : String)
if tag == "a"
attrs.each do |attr|
links << attr.val if attr.key == "href"
end
end
end
end
html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
extractor = LinkExtractor.new
doc = HTML5.stream(html, extractor)
puts extractor.links # => ["foo", "/bar/baz"]
# doc is the complete parse tree, same as HTML5.parse would return
The handler receives events as they happen during parsing:
| Callback | When |
|---|---|
on_element_open(tag, attrs, namespace) |
An element is added to the tree |
on_element_close(tag, namespace) |
An element is closed (popped from the open elements stack) |
on_text(text) |
A text node is added |
on_comment(text) |
A comment node is added |
on_doctype(data) |
A doctype declaration is found |
on_document_end(doc) |
Parsing is complete; receives the full Node tree |
All callbacks have default no-op implementations — override only the ones you need. The parser still builds the full DOM tree internally (required by the HTML5 spec for correct handling of misnested markup), but your handler receives events incrementally.
HTML5.stream accepts an IO or a String and returns the complete document Node, just like HTML5.parse.
Development
To run all tests:
crystal spec
Contributing
- Fork it (https://github.com/naqvis/crystal-html5/fork)
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request
Contributors
- Ali Naqvi - creator and maintainer
crystal-html5
- 36
- 3
- 0
- 10
- 1
- 18 days ago
- May 1, 2020
MIT License
Fri, 17 Apr 2026 12:58:27 GMT