Linguistic sorting?

BrucePerens · November 25, 2021, 8:36pm

String#<=> uses unsafe.memcmp, so it’s a numeric byte sort rather than anything language-aware. Sorting of UTF-8 codepoints for a particular language would require a table of character-order for that language, so that "ä" and "a" are in proper order relative to each other. A table of characters to ignore in sorting, like "'", is also necessary.

Has anyone done this for Crystal? There is a treatise on Unicode sorting at UTS #10: Unicode Collation Algorithm that is a mullti-level sort with weights, a lot more than just two tables, but I don’t know of an implementation.

Thanks

Bruce

naqvis · November 26, 2021, 4:31pm

None i’m aware of. Your best bet would be to go with icu and luckily there is a Crystal bindings for that. GitHub - olbat/icu.cr: A Cystal binding/wrapper to the ICU library

Collator might be something you are looking for
https://olbat.github.io/icu.cr/ICU/Collator.html

Hope it helps.

Ali Naqvi

Topic		Replies	Views
Lessons from the trenches, with map and sort Help & Support	11	364	November 24, 2022
Bsearch issues Help & Support	2	331	May 23, 2019
Poll: should default sort behavior be "fast" or "stable"? Crystal Contrib	3	383	November 14, 2019
Quicksort using Crystal, the speed is not as fast as expected compare to Ruby 3.1.1 Help & Support	8	373	September 1, 2022
Hi, I am Onur Introductions	5	389	July 24, 2023

Linguistic sorting?

Related topics