Unicode Character & String Width

naqvis · May 8, 2021, 1:27pm

Published a shard to provide functionality to get fixed width of Unicode characters and String.

straight-shoota · May 8, 2021, 3:33pm

This is awesome. Like Unicode Text Segmentation - Grapheme clusters it would be great to merge this into stdlib. I think those are relevant tools for building truly internationalized software and should be included in the standard library.

straight-shoota · July 29, 2021, 11:06am

There’s even a use case in the compiler for this: Formatter calculates the width of a fullwidth character as 1 · Issue #11034 · crystal-lang/crystal · GitHub

bloovis · January 8, 2024, 11:49am

A while back I wrote my own tiny piece of code for calculating the display width of a Unicode string, using the wcwidth function from libc. In benchmarking it appears to be about 35 times faster than naqvis/uni_char_width. Not being an expert on these things, it’s possible I’m missing something. Here is the code:

lib LibC
  alias WChar = UInt32
  fun wcwidth(c : WChar) : Int
end

module Unicode
  extend self

  def width(s : String) : Int32
    width = 0
    chreader = Char::Reader.new(s)
    chreader.each do |ch|
      wc : LibC::WChar = ch.ord.to_u
      wclen = LibC.wcwidth(wc)
      if wclen < 0
        wclen = 1
      end
      width += wclen
    end
    width
  end

end

naqvis · January 9, 2024, 7:10am

Shard implementation might not be optimized, but comparison you are doing is an apple and orange comparison.

Posix wcwidth vs uni_char_with shard providing support for Unicode Standard Annex #11. Both serve different purposes.

Simple example

# posix wcwidth
pp Unicode.width("つのだ☆HIRO") # => 8

# uni_char_width

# without east-asian context
pp UnicodeCharWidth.width("つのだ☆HIRO") # => 11

# with east-asian context
c = UnicodeCharWidth::Condition.new(false)
c.east_asian = true
pp c.width("つのだ☆HIRO") # => 12

HIH

bloovis · January 9, 2024, 1:12pm

Thank you for the explanation.

Topic		Replies	Views
Unicode Text Segmentation - Grapheme clusters News	2	543	May 6, 2021
String#ascii_only? Crystal Contrib	17	451	November 23, 2024
Add validity flags to String Crystal Contrib	5	298	October 13, 2023
Unicode as syntax	17	525	July 1, 2023
String.new with Pointer(Char) Crystal Contrib	5	760	May 19, 2022

Unicode Character & String Width

Related topics