zh-chardet.cr

A Crystal library for detecting and reading Chinese character encodings.

zh-chardet.cr

A Crystal library for detecting and reading Chinese character encodings.

Supported Encodings

This library supports detection and decoding of the following encodings, with specific heuristics to disambiguate between them:

  • UTF-8
  • UTF-16 (BE/LE with BOM)
  • GB18030 (Superset of GBK and GB2312)
  • Big5 (Traditional Chinese)
  • EUC-TW (Traditional Chinese, with SS2 detection)
  • ISO-2022-CN
  • HZ-GB-2312 (custom decoder included)

Installation

  1. Add the dependency to your shard.yml:

    dependencies:
      zh-chardet:
        github: chivi/zh-chardet.cr
    
  2. Run shards install

Usage

require "zh-chardet"

# 1. Detect Encoding from Bytes
# Returns the encoding name as a String (e.g., "GB18030", "UTF-8")
bytes = File.open("file.txt", "r") {|f| f.read_string({f.size, 1000}.min)}
encoding = ZhChardet.chardet(bytes)
puts "Detected: #{encoding}"

# 2. Read String (Auto-detect and decode)
# Automatically detects the encoding and converts the content to a UTF-8 String.
# Handles custom decoding for formats like HZ-GB-2312.
text = ZhChardet.read_string(bytes)
puts text

# 3. Read File directly
# Convenience method to read a file content as UTF-8.
text = ZhChardet.read_file("path/to/file.txt")
puts text

How it works

The library uses a priority-based detection algorithm:

  1. BOM Check: Checks for UTF-8/16 BOMs.
  2. Distinctive Sequences: Checks for escape sequences (ISO-2022-CN) or unique markers (HZ).
  3. Validity Check: strict UTF-8 validation.
  4. Heuristic Scoring: If multiple multi-byte encodings (GB18030, Big5, EUC-TW) are strictly valid, it measures the density of valid Han characters (\p{Han}) and specific byte sequences (like 0x8E for EUC-TW) to determine the most likely encoding.

Contributing

  1. Fork it (https://github.com/chivi/zh-chardet.cr/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request
Repository

zh-chardet.cr

Owner
Statistic
  • 0
  • 0
  • 0
  • 1
  • 0
  • 11 days ago
  • January 31, 2026
License

MIT License

Links
Synced at

Sat, 31 Jan 2026 16:31:12 GMT

Languages