Unicode Text Segmentation - Grapheme clusters

Just published a shard to determine the graphemes cluster boundaries of unicode text.

In Crystal, String class provides a codepoints method to return Unicode code points. However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls grapheme cluster.

This shard provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

8 Likes

Wow, this is great!

Ideally I’d like this to be eventually in the library, if it’s well optimized. Graphemes are very important in some application domains, and it would be really nice to have support for that out of the box (like in Swift)

9 Likes

Yes, it would be awesome to include this in stdlib.

5 Likes