da_html_escape.cr

A Crystal shard to escape/unescape strings for HTML5.

da_html_escape.cr

DA_HTML is a Crystal shard to escape/unescape HTML.

A much better shard: https://github.com/kostya/myhtml More info: https://github.com/kostya/entities/pull/1#issuecomment-330849435

If you are still curious about this shard:

HTML entity decoding is done by the kostya/myhtml shard.

Encoding characters into HTML entities is done by taking the codepoints and converting them to hexadecimal HTML entities. I ported segments of HTMLEntities by Paul Battley for the encoding and most of the specs/tests.

This shard escapes all non-ASCII characters to hexadecimal only. No named or decimal entities are used when escaping. This shard is also useless for XML entities.

I use this because Crystal's standard lib's HTML only escapes a few characters. I decide to play it extra safe and escape all non-ASCII characters.

Also, HTML.unescape and XML.parse_html does not fully unescape certain chars. kostya/myhtml is the only one that properly unescapes the following into < brackets:

    bracket = "
      < &lt &lt; &LT &LT; &#60 &#060 &#0060
      &#00060 &#000060 &#0000060 &#60; &#060; &#0060; &#00060;
      &#000060; &#0000060; &#x3c &#x03c &#x003c &#x0003c &#x00003c
      &#x000003c &#x3c; &#x03c; &#x003c; &#x0003c; &#x00003c;
      &#x000003c; &#X3c &#X03c &#X003c &#X0003c &#X00003c &#X000003c
      &#X3c; &#X03c; &#X003c; &#X0003c; &#X00003c; &#X000003c;
      &#x3C &#x03C &#x003C &#x0003C &#x00003C &#x000003C &#x3C; &#x03C;
      &#x003C; &#x0003C; &#x00003C; &#x000003C; &#X3C &#X03C
      &#X003C &#X0003C &#X00003C &#X000003C &#X3C; &#X03C; &#X003C; &#X0003C;
      &#X00003C; &#X000003C; \x3c \x3C \u003c \u003C
    "

Notes:

Further security info: OSWAP: Cross Site Prevention

ASCII Table:

List of hexadecimal entities with counterpart codepoints:

Searchable list of entities: http://www.fileformat.info/info/unicode/char/0000/index.htm

Usage:

  require "da_html_escape"

  # Escape unsafe/non-ASCII codepoints using hexadecimal entities:
  raw = "<élan>"
  DA_HTML_ESCAPE.escape(raw) # => "&#x3c;&#xe9;lan&#x3e;"

  # Unescaping:
  escaped = "&eacute;lan"
  DA_HTML_ESCAPE.unescape_once(escaped) # => "élan"

  # :unescape! keeps looping until it can no
  # longer unescape any more:
  escaped = "&amp;amp;amp;eacute;lan"
  DA_HTML_ESCAPE.unescape!(escaped) # => "élan"

Licence

This code is free to use under the terms of the MIT licence. See the file LICENSE for more details.

Useless Benchmarks:

NOTE: Encoding is twice as slow as using HTML.escape from the Crystal standard library. The main reason is because DA_HTML_ESCAPE replaces control characters with spaces and non-ASCII chars with hexadecimal HTML entities.

$ crystal run perf/benchmark.cr --release --no-debug
  # 100 iterations
  :escape           0.400000   0.030000   0.430000 (  0.466291)
  :unescape_once    0.760000   0.000000   0.760000 (  0.822908)
  :unescape!        1.810000   0.010000   1.820000 (  2.019186)

$ neofetch
  CPU: AMD Athlon 5350 APU with Radeon R3 (4) @ 2.050GHz
  Memory: 1555MiB / 7934MiB
Repository

da_html_escape.cr

Owner
Statistic
  • 0
  • 0
  • 0
  • 3
  • 2
  • almost 7 years ago
  • September 22, 2017
License

Other

Links
Synced at

Thu, 07 Nov 2024 13:04:32 GMT

Languages