uni_text_seg v0.1.3
Unicode Text Segmentation
Shard TextSegment
implements Unicode Text Segmentation according to Unicode Standard Annex #29 (Unicode version 13.0.0) to determine the grapheme cluster boundaries of unicode text.
In Crystal, String
class provides a codepoints
method to return Unicode code points. However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls grapheme cluster. Here are some examples:
String | Bytes (UTF-8) | Code points | Grapheme clusters |
---|---|---|---|
Käse | 6 bytes: 4b 61 cc 88 73 65 |
5 code points: 4b 61 308 73 65 |
4 clusters: [4b],[61 308],[73],[65] |
🧙♂️💈 | 17 bytes: f0 9f a7 99 e2 80 8d e2 99 82 ef b8 8f f0 9f 92 88 |
5 code points: 1f9d9 200d 2642 fe0f 1f488 |
2 cluster: [1f9d9 200d 2642 fe0f],[1f488] |
🇵🇰 | 8 bytes: f0 9f 87 b5 f0 9f 87 b0 |
2 code points: 1f1f5 1f1f0 |
1 cluster: [1f1f5 1f1f0] |
This shard provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.
Installation
-
Add the dependency to your
shard.yml
:dependencies: textseg: github: naqvis/uni_text_seg
-
Run
shards install
Usage
require "textseg"
TextSegment.each_grapheme("🔮👍🏼!") do |cluster|
pp cluster.codepoints
pp cluster.positions
pp cluster.str
pp cluster.bytes
end
Development
To run all tests:
crystal spec
Contributing
- Fork it (https://github.com/naqvis/uni_text_seg/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Contributors
- Ali Naqvi - creator and maintainer
uni_text_seg
- 6
- 0
- 0
- 1
- 0
- over 2 years ago
- May 6, 2021
MIT License
Tue, 21 Jan 2025 08:16:16 GMT