The Crystal Programming Language Forum

Iterate over "extended grapheme clusters" in a String

Many languages now have the ability to treat a text string as a sequence or collection of extended grapheme clusters, rather than Unicode code points. Is there an equivalent in Crystal?

Examples:

There’s no such thing yet. In the meantime you can use a regex like this one:

Thanks!

I’m not much of a Rubyist… would something like this be considered idiomatic?

def each_grapheme(s : String, &)
  s.scan(/\X/) do |match|
    yield match[0]
  end
end

def graphemes(s : String) : Array(String)
  result = Array(String).new
  each_grapheme(s) do |g|
    result << g
  end
  return result
end

# Example from https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html
s = "\u{E9}\u{65}\u{301}\u{D55C}\u{1112}\u{1161}\u{11AB}"
each_grapheme(s) do |g|
  puts "#{g}\t#{g.codepoints}"
end

Edit: is it possible to mimic Ruby’s “enumerator” API, which has a #to_a method? Not that it’s really necessary, but these kinds of toy problems help me learn my way around a new language.

I think it’s a good API. The main reason we didn’t do it in the standard library like that is lack of time and not wanting to depend on the regex engine. Maybe also a grapheme type would be nice to have. So for now all of that is pending.

There’s no good way to do Enumerator like in Ruby. There are past discussions but none is as efficient as a plain old Iterator.

I guess it should be easy to implement an Iterator for this. You keep the current byte index, initially zero, then match against a regex. Actually, we should probably add an Iterator for the scan method, though I don’t know what would be that name (each_scan reads weird, maybe scan_iterator)

2 Likes

@anodize See Unicode Text Segmentation - Grapheme clusters and https://github.com/crystal-lang/crystal/pull/10721