da_html_escape.cr
da_html_escape.cr
DA_HTML is a Crystal shard to escape/unescape HTML.
A much better shard: https://github.com/kostya/myhtml More info: https://github.com/kostya/entities/pull/1#issuecomment-330849435
If you are still curious about this shard:
HTML entity decoding is done by the kostya/myhtml shard.
Encoding characters into HTML entities is done by taking the codepoints and converting them to hexadecimal HTML entities. I ported segments of HTMLEntities by Paul Battley for the encoding and most of the specs/tests.
This shard escapes all non-ASCII characters to hexadecimal only. No named or decimal entities are used when escaping. This shard is also useless for XML entities.
I use this because Crystal's standard lib's HTML only escapes a few characters. I decide to play it extra safe and escape all non-ASCII characters.
Also, HTML.unescape and XML.parse_html does not fully unescape certain chars. kostya/myhtml is the only one that properly unescapes the following into <
brackets:
bracket = "
< < < < < < < <
< < < < < < <
< < < < < < <
< < < < < <
< < < < < < <
< < < < < <
< < < < < < < <
< < < < < <
< < < < < < < <
< < \x3c \x3C \u003c \u003C
"
Notes:
Further security info: OSWAP: Cross Site Prevention
ASCII Table:
List of hexadecimal entities with counterpart codepoints:
- http://www.howtocreate.co.uk/sidehtmlentity.html
- https://dev.w3.org/html5/html-author/charref
- Multibyte chars: https://www.w3schools.com/charsets/ref_html_entities_v.asp
- Convert chars: https://r12a.github.io/apps/conversion/
Searchable list of entities: http://www.fileformat.info/info/unicode/char/0000/index.htm
Usage:
require "da_html_escape"
# Escape unsafe/non-ASCII codepoints using hexadecimal entities:
raw = "<élan>"
DA_HTML_ESCAPE.escape(raw) # => "<élan>"
# Unescaping:
escaped = "élan"
DA_HTML_ESCAPE.unescape_once(escaped) # => "élan"
# :unescape! keeps looping until it can no
# longer unescape any more:
escaped = "&amp;amp;eacute;lan"
DA_HTML_ESCAPE.unescape!(escaped) # => "élan"
Licence
This code is free to use under the terms of the MIT licence. See the file LICENSE for more details.
Useless Benchmarks:
NOTE: Encoding is twice as slow as using HTML.escape
from the Crystal standard library. The main reason is because DA_HTML_ESCAPE
replaces control characters with spaces and non-ASCII chars with hexadecimal HTML entities.
$ crystal run perf/benchmark.cr --release --no-debug
# 100 iterations
:escape 0.400000 0.030000 0.430000 ( 0.466291)
:unescape_once 0.760000 0.000000 0.760000 ( 0.822908)
:unescape! 1.810000 0.010000 1.820000 ( 2.019186)
$ neofetch
CPU: AMD Athlon 5350 APU with Radeon R3 (4) @ 2.050GHz
Memory: 1555MiB / 7934MiB
da_html_escape.cr
- 0
- 0
- 0
- 3
- 2
- almost 7 years ago
- September 22, 2017
Other
Thu, 07 Nov 2024 13:04:32 GMT