Unicode Text Segmentation

Shard TextSegment implements Unicode Text Segmentation according to Unicode Standard Annex #29 (Unicode version 13.0.0) to determine the grapheme cluster boundaries of unicode text.

In Crystal, String class provides a codepoints method to return Unicode code points. However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls grapheme cluster. Here are some examples:

String	Bytes (UTF-8)	Code points	Grapheme clusters
Käse	6 bytes: `4b 61 cc 88 73 65`	5 code points: `4b 61 308 73 65`	4 clusters: `[4b],[61 308],[73],[65]`
🧙‍♂️💈	17 bytes: `f0 9f a7 99 e2 80 8d e2 99 82 ef b8 8f f0 9f 92 88`	5 code points: `1f9d9 200d 2642 fe0f 1f488`	2 cluster: `[1f9d9 200d 2642 fe0f],[1f488]`
🇵🇰	8 bytes: `f0 9f 87 b5 f0 9f 87 b0`	2 code points: `1f1f5 1f1f0`	1 cluster: `[1f1f5 1f1f0]`

This shard provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Installation

Add the dependency to your shard.yml:

dependencies:
  textseg:
    github: naqvis/uni_text_seg

Run shards install

Usage

require "textseg"

TextSegment.each_grapheme("🔮👍🏼!") do |cluster|
  pp cluster.codepoints
  pp cluster.positions
  pp cluster.str
  pp cluster.bytes
end

Development

To run all tests:

crystal spec

Contributing

Fork it (https://github.com/naqvis/uni_text_seg/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

Ali Naqvi - creator and maintainer

Repository

uni_text_seg

Owner

naqvis

Statistic

6
0
0
2
0
almost 4 years ago
May 6, 2021

License

MIT License

Links

Synced at

Mon, 26 Jan 2026 23:37:45 GMT

Languages