zh-chardet.cr
zh-chardet.cr
A Crystal library for detecting and reading Chinese character encodings.
Supported Encodings
This library supports detection and decoding of the following encodings, with specific heuristics to disambiguate between them:
- UTF-8
- UTF-16 (BE/LE with BOM)
- GB18030 (Superset of GBK and GB2312)
- Big5 (Traditional Chinese)
- EUC-TW (Traditional Chinese, with SS2 detection)
- ISO-2022-CN
- HZ-GB-2312 (custom decoder included)
Installation
-
Add the dependency to your
shard.yml:dependencies: zh-chardet: github: chivi/zh-chardet.cr -
Run
shards install
Usage
require "zh-chardet"
# 1. Detect Encoding from Bytes
# Returns the encoding name as a String (e.g., "GB18030", "UTF-8")
bytes = File.open("file.txt", "r") {|f| f.read_string({f.size, 1000}.min)}
encoding = ZhChardet.chardet(bytes)
puts "Detected: #{encoding}"
# 2. Read String (Auto-detect and decode)
# Automatically detects the encoding and converts the content to a UTF-8 String.
# Handles custom decoding for formats like HZ-GB-2312.
text = ZhChardet.read_string(bytes)
puts text
# 3. Read File directly
# Convenience method to read a file content as UTF-8.
text = ZhChardet.read_file("path/to/file.txt")
puts text
How it works
The library uses a priority-based detection algorithm:
- BOM Check: Checks for UTF-8/16 BOMs.
- Distinctive Sequences: Checks for escape sequences (ISO-2022-CN) or unique markers (HZ).
- Validity Check: strict UTF-8 validation.
- Heuristic Scoring: If multiple multi-byte encodings (GB18030, Big5, EUC-TW) are strictly valid, it measures the density of valid Han characters (
\p{Han}) and specific byte sequences (like0x8Efor EUC-TW) to determine the most likely encoding.
Contributing
- Fork it (https://github.com/chivi/zh-chardet.cr/fork)
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request
Repository
zh-chardet.cr
Owner
Statistic
- 0
- 0
- 0
- 1
- 0
- 11 days ago
- January 31, 2026
License
MIT License
Links
Synced at
Sat, 31 Jan 2026 16:31:12 GMT
Languages