Linguistic sorting?

String#<=> uses unsafe.memcmp, so it’s a numeric byte sort rather than anything language-aware. Sorting of UTF-8 codepoints for a particular language would require a table of character-order for that language, so that "ä" and "a" are in proper order relative to each other. A table of characters to ignore in sorting, like "'", is also necessary.

Has anyone done this for Crystal? There is a treatise on Unicode sorting at UTS #10: Unicode Collation Algorithm that is a mullti-level sort with weights, a lot more than just two tables, but I don’t know of an implementation.

Thanks

Bruce
2 Likes

None i’m aware of. Your best bet would be to go with icu and luckily there is a Crystal bindings for that. GitHub - olbat/icu.cr: A Cystal binding/wrapper to the ICU library

Collator might be something you are looking for
https://olbat.github.io/icu.cr/ICU/Collator.html

Hope it helps.

Ali Naqvi

1 Like