sanitize
sanitize
sanitize
is a Crystal library for transforming HTML/XML trees. It's primarily used to sanitize HTML from untrusted sources in order to prevent XSS attacks and other adversities.
It builds on stdlib's XML
module to parse HTML/XML. Based on libxml2 it's a solid parser and turns malformed and malicious input into valid and safe markup.
- Code: https://github.com/straight-shoota/sanitize
- API docs: https://straight-shoota.github.io/sanitize/api/latest/
- Issue tracker: https://github.com/straight-shoota/sanitize/issues
- Shardbox: https://shardbox.org/shards/sanitize
Installation
-
Add the dependency to your
shard.yml
:dependencies: sanitize: github: straight-shoota/sanitize
-
Run
shards install
Sanitization Features
The Sanitize::Policy::HTMLSanitizer
policy applies the following sanitization steps. Except for the first one (which is essential to the entire process), all can be disabled or configured.
- Turns malformed and malicious HTML into valid and safe markup.
- Strips HTML elements and attributes not included in the safe list.
- Sanitizes URL attributes (like
href
orsrc
) with customizable sanitization policy. - Adds
rel="nofollow"
to all links andrel="noopener"
to links withtarget
. - Validates values of accepted attributes
align
,width
andheight
. - Filters
class
attributes based on a whitelist (by default all classes are rejected).
Usage
Transformation is based on rules defined by Sanitize::Policy
implementations.
The recommended standard policy for HTML sanitization is Sanitize::Policy::HTMLSanitizer.common
which represents good defaults for most use cases. It sanitizes user input against a known safe list of accepted elements and their attributes.
require "sanitize"
sanitizer = Sanitize::Policy::HTMLSanitizer.common
sanitizer.process(%(<a href="javascript:alert('foo')">foo</a>)) # => %(foo)
sanitizer.process(%(<p><a href="foo">foo</a></p>)) # => %(<p><a href="foo" rel="nofollow">foo</a></p>)
sanitizer.process(%(<img src="foo.jpg">)) # => %(<img src="foo.jpg">)
sanitizer.process(%(<table><tr><td>foo</td><td>bar</td></tr></table>)) # => %(<table><tr><td>foo</td><td>bar</td></tr></table>)
Sanitization should always run after any other processing (for example rendering Markdown) and is a must when including HTML from untrusted sources into a web page.
With Markd
A typical format for user generated content is Markdown
. Even though it has only a very limited feature set compared to HTML, it can still produce potentially harmful HTML and is is usually possible to embed raw HTML directly. So Sanitization is necessary.
The most common Markdown renderer is markd, so here is a sample how to use it with sanitize
:
sanitizer = Sanitize::Policy::HTMLSanitizer.common
# Allow classes with `language-` prefix which are used for syntax highlighting.
sanitizer.valid_classes << /language-.+/
markdown = <<-MD
Sanitization with [https://shardbox.org/shards/sanitize](sanitize) is not that
**difficult**.
```cr
puts "Hello World!"
```
<p><a href="javascript:alert("XSS attack!")">Hello world!</a></p>
MD
html = Markd.to_html(markdown)
sanitized = sanitizer.process(html)
puts sanitized
The result:
<p>Sanitization with <a href="sanitize" rel="nofollow">https://shardbox.org/shards/sanitize</a> is not that
<strong>difficult</strong>.</p>
<pre><code class="language-cr">puts "Hello World!"
</code></pre>
<p>Hello world!</p>
Limitations
Sanitizing CSS is not supported. Thus style
attributes can't be accepted in a safe way. CSS sanitization features may be added when a CSS parsing library is available.
Security
If you want to privately disclose security-issues, please contact straightshoota on Keybase or straightshoota@gmail.com (PGP: DF2D C9E9 FFB9 6AE0 2070 D5BC F0F3 4963 7AC5 087A
).
Contributing
- Fork it (https://github.com/straight-shoota/sanitize/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Contributors
- Johannes Müller - creator and maintainer
sanitize
- 23
- 2
- 1
- 4
- 1
- 2 months ago
- May 19, 2020
Apache License 2.0
Sat, 21 Dec 2024 22:32:29 GMT