Unicode Character & String Width

Published a shard to provide functionality to get fixed width of Unicode characters and String.

11 Likes

This is awesome. Like Unicode Text Segmentation - Grapheme clusters it would be great to merge this into stdlib. I think those are relevant tools for building truly internationalized software and should be included in the standard library.

8 Likes

There’s even a use case in the compiler for this: Formatter calculates the width of a fullwidth character as 1 · Issue #11034 · crystal-lang/crystal · GitHub

1 Like

A while back I wrote my own tiny piece of code for calculating the display width of a Unicode string, using the wcwidth function from libc. In benchmarking it appears to be about 35 times faster than naqvis/uni_char_width. Not being an expert on these things, it’s possible I’m missing something. Here is the code:

lib LibC
  alias WChar = UInt32
  fun wcwidth(c : WChar) : Int
end

module Unicode
  extend self

  def width(s : String) : Int32
    width = 0
    chreader = Char::Reader.new(s)
    chreader.each do |ch|
      wc : LibC::WChar = ch.ord.to_u
      wclen = LibC.wcwidth(wc)
      if wclen < 0
        wclen = 1
      end
      width += wclen
    end
    width
  end

end

Shard implementation might not be optimized, but comparison you are doing is an apple and orange comparison.

Posix wcwidth vs uni_char_with shard providing support for Unicode Standard Annex #11. Both serve different purposes.

Simple example

# posix wcwidth
pp Unicode.width("つのだ☆HIRO") # => 8

# uni_char_width

# without east-asian context
pp UnicodeCharWidth.width("つのだ☆HIRO") # => 11

# with east-asian context
c = UnicodeCharWidth::Condition.new(false)
c.east_asian = true
pp c.width("つのだ☆HIRO") # => 12

HIH

Thank you for the explanation.