String#ascii_only?

self · March 28, 2024, 11:36pm

I’m probably wrong, but couldn’t String#ascii_only? be simplified?
The current implementation is:

  def ascii_only? : Bool
    if @bytesize == size
      each_byte do |byte|
        return false unless byte < 0x80
      end
      true
    else
      false
    end
  end

I think it could be shortened to:

  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end

or (faster):

  def ascii_only_3? : Bool
    return false  if (@length > 0 && @length != @bytesize)
    to_slice.each{ |byte| return false  if byte >= 0x80 }
    true
  end

In --release build, both #ascii_only_3? and #ascii_only_2? are much faster than #ascii_only? by a factor of 10 (depending upon the corpus of strings).
In non---release build, #ascii_only_3? is faster is faster than #ascii_only? and #ascii_only_2?.
I’d go with #ascii_only_3?.

Blacksmoke16 · March 29, 2024, 12:21am

Can you share your benchmark code? When I ran it, ascii_only_2? was indeed faster than the current implementation, but only by like 20%, not a factor of 10.

self · March 29, 2024, 12:45am

strs_ascii = [
  "abcdef",
  "a",
  "ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication.",
  "Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope.",
  "ASCII was developed in part from telegraph code.",
  "",
]
strs_non_ascii = [
  "ASCII was developed in part from telegraph code. …",
  "ASCII was developed in part from telegraph code. 😎",
  "😎 ASCII was developed in part from telegraph code.",
  "\xED\xA0\x80\xED\xBF\xBF",
  "あ",
]
strs = strs_ascii + strs_non_ascii

class String
  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end
  def ascii_only_3? : Bool
    return false  if (@length > 0 && @length != @bytesize)
    to_slice.each{ |byte| return false  if byte >= 0x80 }
    true
  end
end

strs.each{ |str|
  if (str.ascii_only_2? != str.ascii_only?)
    puts "#{str.inspect}.ascii_only_2? != #{str.inspect}.ascii_only?"
    exit 1
  end
  if (str.ascii_only_3? != str.ascii_only?)
    puts "#{str.inspect}.ascii_only_3? != #{str.inspect}.ascii_only?"
    exit 1
  end
}

require "benchmark"
Benchmark.ips do |x|
  x.report("ascii_only?"  ) { strs.each( &.ascii_only?   )}
  x.report("ascii_only_2?") { strs.each( &.ascii_only_2? )}
  x.report("ascii_only_3?") { strs.each( &.ascii_only_3? )}
end

jgaskins · March 29, 2024, 12:46am

Hmm, I’m not seeing a notable difference in performance. I also threw in another attempt, which I realized after writing it is basically a manual inlining of your 2 implementation.

require "benchmark"

{
  "hello",
  "." * 1024,
}.each do |string|
  puts
  puts "String size: #{string.bytesize}B"
  value = nil
  Benchmark.ips do |x|
    # Without doing multiple iterations per block, it runs too
    # quickly to get a useful measurement because it completes
    # in ~1ns.
    iterations = 1000

    x.report "ascii_only?" { iterations.times { value = string.ascii_only? } }
    x.report "ascii_only_2?" { iterations.times { value = string.ascii_only_2? } }
    x.report "ascii_only_3?" { iterations.times { value = string.ascii_only_3? } }
    x.report "ascii_only_4?" { iterations.times { value = string.ascii_only_4? } }
  end
  pp value: value # Side effect so LLVM won't optimize out the blocks
end

class String
  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end

  def ascii_only_3? : Bool
    return false if (@length > 0 && @length != @bytesize)
    to_slice.each { |byte| return false if byte >= 0x80 }
    true
  end

  def ascii_only_4? : Bool
    each_byte do |byte|
      return false if byte >= 0x80
    end
    true
  end
end

➜  Code crystal run --release bench_ascii_only.cr

String size: 5B
  ascii_only? 355.19k (  2.82µs) (± 0.73%)  0.0B/op   1.12× slower
ascii_only_2? 355.44k (  2.81µs) (± 0.34%)  0.0B/op   1.12× slower
ascii_only_3? 396.39k (  2.52µs) (± 5.54%)  0.0B/op        fastest
ascii_only_4? 354.27k (  2.82µs) (± 0.88%)  0.0B/op   1.12× slower
{value: true}

String size: 1024B
  ascii_only?   3.03k (330.48µs) (± 0.47%)  0.0B/op   1.01× slower
ascii_only_2?   3.04k (328.77µs) (± 0.24%)  0.0B/op   1.00× slower
ascii_only_3?   3.04k (329.19µs) (± 0.34%)  0.0B/op   1.00× slower
ascii_only_4?   3.04k (328.75µs) (± 0.27%)  0.0B/op        fastest
{value: true}

jgaskins · March 29, 2024, 1:03am

Hmm, it just occurred to me that String#size memoizes the character count, so doing any benchmark that uses that instance variable repeatedly on a single string instance is going to have skewed results.

self · March 29, 2024, 1:30am

Looks like I need to learn how to benchmark. :-)
There doesn’t seem much of a difference.

require "benchmark"

{
  "hello",
  "." * 1024,
  "abcdef",
  "a",
  "ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication.",
  "Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope.",
  "ASCII was developed in part from telegraph code.",
  "",
  "ASCII was developed in part from telegraph code. …",
  "ASCII was developed in part from telegraph code. 😎",
  "😎 ASCII was developed in part from telegraph code.",
  "\xED\xA0\x80\xED\xBF\xBF",
  "あ",
}.each do |string|
  puts
  puts "String size: #{string.bytesize}B + x"
  value = nil
  Benchmark.ips do |x|
    # Without doing multiple iterations per block, it runs too
    # quickly to get a useful measurement because it completes
    # in ~1ns.
    n = 1000
    n2 = 3

    x.report "ascii_only?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only? }
    }}
    x.report "ascii_only_2?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only_2? }
    }}
    x.report "ascii_only_3?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only_3? }
    }}
    x.report "ascii_only_4?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only_4? }
    }}
  end
  pp value: value # Side effect so LLVM won't optimize out the blocks
end

class String
  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end

  def ascii_only_3? : Bool
    return false if (@length > 0 && @length != @bytesize)
    to_slice.each { |byte| return false if byte >= 0x80 }
    true
  end

  def ascii_only_4? : Bool
    each_byte do |byte|
      return false if byte >= 0x80
    end
    true
  end
end

String size: 5B + x
  ascii_only?  20.82k ( 48.02µs) (± 1.20%)  62.8kB/op   1.00× slower
ascii_only_2?  19.91k ( 50.23µs) (± 1.22%)  62.8kB/op   1.05× slower
ascii_only_3?  20.85k ( 47.96µs) (± 1.12%)  62.8kB/op        fastest
ascii_only_4?  19.87k ( 50.32µs) (± 0.97%)  62.8kB/op   1.05× slower
{value: true}

String size: 1024B + x
  ascii_only? 840.82  (  1.19ms) (± 1.17%)  1.32MB/op   1.01× slower
ascii_only_2? 849.63  (  1.18ms) (± 0.93%)  1.32MB/op   1.00× slower
ascii_only_3? 830.96  (  1.20ms) (± 1.24%)  1.32MB/op   1.02× slower
ascii_only_4? 849.88  (  1.18ms) (± 1.16%)  1.32MB/op        fastest
{value: true}

String size: 6B + x
  ascii_only?  20.31k ( 49.23µs) (± 1.28%)  62.8kB/op   1.00× slower
ascii_only_2?  19.30k ( 51.82µs) (± 2.14%)  62.8kB/op   1.05× slower
ascii_only_3?  20.34k ( 49.16µs) (± 1.21%)  62.8kB/op        fastest
ascii_only_4?  19.42k ( 51.49µs) (± 1.17%)  62.8kB/op   1.05× slower
{value: true}

String size: 1B + x
  ascii_only?  23.39k ( 42.75µs) (± 1.31%)  62.8kB/op        fastest
ascii_only_2?  22.16k ( 45.13µs) (± 1.27%)  62.8kB/op   1.06× slower
ascii_only_3?  23.38k ( 42.77µs) (± 1.01%)  62.8kB/op   1.00× slower
ascii_only_4?  22.03k ( 45.39µs) (± 1.15%)  62.8kB/op   1.06× slower
{value: true}

String size: 138B + x
  ascii_only?   4.49k (222.54µs) (± 1.01%)  188kB/op   1.03× slower
ascii_only_2?   4.63k (215.86µs) (± 1.37%)  188kB/op   1.00× slower
ascii_only_3?   4.55k (219.87µs) (± 1.02%)  188kB/op   1.02× slower
ascii_only_4?   4.64k (215.36µs) (± 1.07%)  188kB/op        fastest
{value: true}

String size: 190B + x
  ascii_only?   3.58k (279.23µs) (± 1.03%)  250kB/op   1.04× slower
ascii_only_2?   3.73k (268.17µs) (± 1.21%)  250kB/op        fastest
ascii_only_3?   3.60k (277.42µs) (± 1.15%)  250kB/op   1.03× slower
ascii_only_4?   3.62k (276.01µs) (± 1.08%)  250kB/op   1.03× slower
{value: true}

String size: 48B + x
  ascii_only?  10.00k ( 99.98µs) (± 1.49%)  109kB/op   1.01× slower
ascii_only_2?   9.98k (100.20µs) (± 1.62%)  109kB/op   1.01× slower
ascii_only_3?  10.03k ( 99.69µs) (± 1.27%)  109kB/op   1.01× slower
ascii_only_4?  10.09k ( 99.08µs) (± 1.42%)  109kB/op        fastest
{value: true}

String size: 0B + x
  ascii_only?  36.55k ( 27.36µs) (± 1.10%)  31.3kB/op        fastest
ascii_only_2?  33.43k ( 29.91µs) (± 1.34%)  31.3kB/op   1.09× slower
ascii_only_3?  36.53k ( 27.38µs) (± 1.10%)  31.4kB/op   1.00× slower
ascii_only_4?  33.53k ( 29.82µs) (± 0.89%)  31.3kB/op   1.09× slower
{value: true}

String size: 52B + x
  ascii_only?  23.01k ( 43.46µs) (± 1.96%)  109kB/op   1.02× slower
ascii_only_2?  10.96k ( 91.25µs) (± 1.11%)  109kB/op   2.14× slower
ascii_only_3?  23.42k ( 42.69µs) (± 1.84%)  109kB/op        fastest
ascii_only_4?  10.95k ( 91.35µs) (± 1.28%)  109kB/op   2.14× slower
{value: false}

String size: 53B + x
  ascii_only?  23.15k ( 43.19µs) (± 1.57%)  109kB/op   1.02× slower
ascii_only_2?  10.93k ( 91.52µs) (± 1.13%)  109kB/op   2.17× slower
ascii_only_3?  23.69k ( 42.22µs) (± 1.38%)  109kB/op        fastest
ascii_only_4?  11.03k ( 90.69µs) (± 1.17%)  109kB/op   2.15× slower
{value: false}

String size: 53B + x
  ascii_only?  22.74k ( 43.97µs) (± 1.85%)  109kB/op   1.02× slower
ascii_only_2?  22.69k ( 44.06µs) (± 1.87%)  109kB/op   1.02× slower
ascii_only_3?  23.19k ( 43.12µs) (± 1.88%)  109kB/op        fastest
ascii_only_4?  22.74k ( 43.98µs) (± 1.80%)  109kB/op   1.02× slower
{value: false}

String size: 6B + x
  ascii_only?  26.42k ( 37.86µs) (± 1.03%)  62.8kB/op   1.00× slower
ascii_only_2?  26.42k ( 37.85µs) (± 1.06%)  62.8kB/op   1.00× slower
ascii_only_3?  26.45k ( 37.81µs) (± 1.19%)  62.8kB/op   1.00× slower
ascii_only_4?  26.50k ( 37.73µs) (± 1.00%)  62.8kB/op        fastest
{value: false}

String size: 3B + x
  ascii_only?  27.19k ( 36.78µs) (± 1.17%)  62.8kB/op   1.03× slower
ascii_only_2?  27.18k ( 36.79µs) (± 1.21%)  62.8kB/op   1.03× slower
ascii_only_3?  27.87k ( 35.88µs) (± 1.31%)  62.8kB/op        fastest
ascii_only_4?  27.20k ( 36.76µs) (± 1.11%)  62.8kB/op   1.02× slower
{value: false}

self · March 30, 2024, 1:34pm

I’ll throw in yet another implementation that checks multiple bytes at once.

I thought that memory alignment plays a role here, but I don’t know enough about that.
(What would I check? alignof(String)? alignof(Bytes), because String#to_slice is a Bytes? alignof(UInt8*), because String#to_unsafe is a Pointer(UInt8)?)

Suprisingly though, the 16 bytes (128 bits) version is faster than the 8 bytes (64 bits) version.

Maybe it would make sense to use is_ascii from simdutf?

Also related:

github.com/crystal-lang/crystal

Cache code range in `String`

opened 09:35AM - 22 Apr 22 UTC

straight-shoota

kind:feature performance topic:stdlib:text tough-cookie

I just stumbled across this article in the current issue of Ruby Weekly: [*Code …Ranges: A Deeper Look at Ruby Strings*](https://shopify.engineering/code-ranges-ruby-strings). It's an interesting read about an internal optimization Ruby interpreters use to characterize a string. Basically, it caches the results of `ascii_only?` and `valid_encoding?`, thus they run in *O(1)* instead of *O(N)* (on repeated use). We could apply this optimization to Crystal's `String` class as well. It's even easier to implement because strings are immutable in Crystal (so we don't have to worry about invalidation) and there is only one valid encoding (UTF-8). It's also more efficient for string literals because the compiler can calculate code range values ahead of time. For implementing this, we need 2 bits in the string header to represent the four states: * ascii_only * valid * broken * unknown We could try to squeeze that into the existing header values, which would reduce the effectively available string size a bit. So it would probably be considered a breaking change. But I don't think there would be any problems with increasing the size of the string header. It just needs to be synchronized between compiler and runtime (which can be signalled via a compile time flag).

class String
  
  def ascii_only_128? : Bool
    case @bytesize
    when     0 ; return true
    when .<  4 ; return (to_slice.all? &.< 0x80)
    when .<  8
      slice32 = Slice( UInt32 ).new( to_unsafe.as( UInt32* ), @bytesize // 4 )  #/ (fix syntax highlighting in the forum)
      return false  if slice32.any?{ |b4| (b4 & 0x80808080_u32) != 0 }
      num_rest_bytes = @bytesize - slice32.bytesize
      unless num_rest_bytes == 0
        rest_slice8 = to_slice[ slice32.bytesize, num_rest_bytes ]
        return false  if rest_slice8.any?{ |b| (b & 0x80_u8) != 0 }
      end
      return true
    when .< 16
      slice64 = Slice( UInt64 ).new( to_unsafe.as( UInt64* ), @bytesize // 8 )  #/ (fix syntax highlighting in the forum)
      return false  if slice64.any?{ |b8| (b8 & 0x8080808080808080_u64) != 0 }
      num_rest_bytes = @bytesize - slice64.bytesize
      unless num_rest_bytes == 0
        rest_slice8 = to_slice[ slice64.bytesize, num_rest_bytes ]
        return false  if rest_slice8.any?{ |b| (b & 0x80_u8) != 0 }
      end
      return true
    else
      slice128 = Slice( UInt128 ).new( to_unsafe.as( UInt128* ), @bytesize // 16 )  #/ (fix syntax highlighting in the forum)
      return false  if slice128.any?{ |b16| (b16 & 0x80808080808080808080808080808080_u128) != 0 }
      num_rest_bytes = @bytesize - slice128.bytesize
      unless num_rest_bytes == 0
        rest_slice8 = to_slice[ slice128.bytesize, num_rest_bytes ]
        return false  if rest_slice8.any?{ |b| (b & 0x80_u8) != 0 }
      end
      return true
    end
  end
end

require "benchmark"

strs = {
  "hello",
  "." * 1024,
  "a",
  "ASCII was developed in part from telegraph code.",
  "",
  "ASCII was developed in part from telegraph code. 😎",
  "😎 ASCII was developed in part from telegraph code.",
  "\xED\xA0\x80\xED\xBF\xBF",
  "あ",
  "ASCII was developed in part from telegraph code.____",
  "ASCII was developed in part from telegraph code. 😎___",
}

strs.each do |str|
  puts "#{str.inspect} ..."
  if (str.ascii_only_128? != str.ascii_only?)
    puts "ERR: #{str.inspect}.ascii_only_128? (#{str.ascii_only_128?.inspect}) != #{str.inspect}.ascii_only? (#{str.ascii_only?.inspect})"
  end
end

strs.each do |str|
  puts
  puts "String size: #{str.bytesize} B (ASCII-only?: #{str.ascii_only?.inspect})"
  value = nil
  Benchmark.ips do |x|
    # Without doing multiple iterations per block, it runs too
    # quickly to get a useful measurement because it completes
    # in ~1ns.
    n = 500
    n2 = 2
    
    x.report "ascii_only?" { n.times {
      new_str = String.new( str.to_slice )
      n2.times { value = new_str.ascii_only? }
    }}
    x.report "ascii_only_128?" { n.times {
      new_str = String.new( str.to_slice )
      n2.times { value = new_str.ascii_only_128? }
    }}
  end
  pp value: value # Side effect so LLVM won't optimize out the blocks
end

String size: 5 B (ASCII-only?: true)
    ascii_only?  90.21k ( 11.09µs) (± 0.74%)  15.7kB/op   1.44× slower
ascii_only_128? 130.02k (  7.69µs) (± 5.37%)  15.7kB/op        fastest
{value: true}

String size: 1024 B (ASCII-only?: true)
    ascii_only?   1.34k (744.37µs) (± 2.00%)  656kB/op   7.66× slower
ascii_only_128?  10.28k ( 97.23µs) (± 3.40%)  656kB/op        fastest
{value: true}

String size: 1 B (ASCII-only?: true)
    ascii_only? 186.71k (  5.36µs) (± 1.02%)  7.86kB/op   1.07× slower
ascii_only_128? 199.20k (  5.02µs) (± 0.61%)  7.86kB/op        fastest
{value: true}

String size: 48 B (ASCII-only?: true)
    ascii_only?  23.64k ( 42.30µs) (± 1.15%)  31.2kB/op   3.89× slower
ascii_only_128?  92.06k ( 10.86µs) (± 1.16%)  31.2kB/op        fastest
{value: true}

String size: 0 B (ASCII-only?: true)
    ascii_only? 635.57k (  1.57µs) (± 1.04%)  0.0B/op   1.42× slower
ascii_only_128? 902.90k (  1.11µs) (± 1.38%)  0.0B/op        fastest
{value: true}

String size: 53 B (ASCII-only?: false)
    ascii_only?  32.17k ( 31.08µs) (± 1.76%)  39.1kB/op   2.25× slower
ascii_only_128?  72.51k ( 13.79µs) (± 1.24%)  39.1kB/op        fastest
{value: false}

String size: 53 B (ASCII-only?: false)
    ascii_only?  36.98k ( 27.04µs) (± 0.74%)  39.1kB/op   2.36× slower
ascii_only_128?  87.32k ( 11.45µs) (± 0.87%)  39.1kB/op        fastest
{value: false}

String size: 6 B (ASCII-only?: false)
    ascii_only?  93.38k ( 10.71µs) (± 0.83%)  15.7kB/op   1.36× slower
ascii_only_128? 127.33k (  7.85µs) (± 0.76%)  15.7kB/op        fastest
{value: false}

String size: 3 B (ASCII-only?: false)
    ascii_only? 129.55k (  7.72µs) (± 0.73%)  15.7kB/op   1.09× slower
ascii_only_128? 140.86k (  7.10µs) (± 1.21%)  15.7kB/op        fastest
{value: false}

String size: 52 B (ASCII-only?: true)
    ascii_only?  21.19k ( 47.19µs) (± 0.83%)  39.1kB/op   3.26× slower
ascii_only_128?  69.16k ( 14.46µs) (± 0.53%)  39.1kB/op        fastest
{value: true}

String size: 56 B (ASCII-only?: false)
    ascii_only?  36.88k ( 27.11µs) (± 0.66%)  39.1kB/op   2.13× slower
ascii_only_128?  78.55k ( 12.73µs) (± 0.66%)  39.1kB/op        fastest
{value: false}

kojix2 · March 31, 2024, 12:35am

I ran it on my own computer.

String size: 5 B (ASCII-only?: true)
    ascii_only?  74.89k ( 13.35µs) (± 0.63%)  15.9kB/op   1.16× slower
ascii_only_128?  86.88k ( 11.51µs) (± 0.55%)  15.9kB/op        fastest
{value: true}

String size: 1024 B (ASCII-only?: true)
    ascii_only?   1.80k (555.54µs) (± 0.06%)  656kB/op   5.23× slower
ascii_only_128?   9.41k (106.29µs) (± 0.44%)  656kB/op        fastest
{value: true}

String size: 1 B (ASCII-only?: true)
    ascii_only? 144.13k (  6.94µs) (± 0.13%)  7.91kB/op   1.04× slower
ascii_only_128? 149.47k (  6.69µs) (± 0.14%)  7.91kB/op        fastest
{value: true}

String size: 48 B (ASCII-only?: true)
    ascii_only?  27.11k ( 36.89µs) (± 0.12%)  31.3kB/op   3.57× slower
ascii_only_128?  96.86k ( 10.32µs) (± 0.25%)  31.3kB/op        fastest
{value: true}

String size: 0 B (ASCII-only?: true)
    ascii_only? 883.34k (  1.13µs) (± 0.07%)  0.0B/op   1.11× slower
ascii_only_128? 980.73k (  1.02µs) (± 0.01%)  0.0B/op        fastest
{value: true}

String size: 53 B (ASCII-only?: false)
    ascii_only?  44.50k ( 22.47µs) (± 0.12%)  39.2kB/op   1.79× slower
ascii_only_128?  79.71k ( 12.55µs) (± 0.22%)  39.2kB/op        fastest
{value: false}

String size: 53 B (ASCII-only?: false)
    ascii_only?  35.76k ( 27.96µs) (± 0.05%)  39.2kB/op   2.58× slower
ascii_only_128?  92.23k ( 10.84µs) (± 0.34%)  39.2kB/op        fastest
{value: false}

String size: 6 B (ASCII-only?: false)
    ascii_only?  99.75k ( 10.03µs) (± 0.15%)  15.7kB/op   1.37× slower
ascii_only_128? 137.15k (  7.29µs) (± 0.20%)  15.7kB/op        fastest
{value: false}

String size: 3 B (ASCII-only?: false)
    ascii_only? 130.78k (  7.65µs) (± 0.17%)  15.7kB/op   1.08× slower
ascii_only_128? 140.87k (  7.10µs) (± 0.20%)  15.7kB/op        fastest
{value: false}

String size: 52 B (ASCII-only?: true)
    ascii_only?  28.47k ( 35.12µs) (± 0.08%)  39.2kB/op   2.54× slower
ascii_only_128?  72.44k ( 13.80µs) (± 0.38%)  39.2kB/op        fastest
{value: true}

String size: 56 B (ASCII-only?: false)
    ascii_only?  34.36k ( 29.11µs) (± 0.19%)  39.2kB/op   2.29× slower
ascii_only_128?  78.57k ( 12.73µs) (± 0.21%)  39.2kB/op        fastest
{value: false}

I have used ascii_only? to validate DNA sequence data. I think this speed-up very useful.

self · April 1, 2024, 11:33pm

I was curious about the simdutf lib. I have hardly any experience with C, let alone C++, but that didn’t stop me from trying anyway. :-) It’s probably all wrong, but here’s what I did.

brew install simdutf (on a Mac)

It’s a C++ lib, so it needs a C wrapper.

simdutf_wrapper.h:

#ifndef SIMDUTF_WRAPPER_H
#define SIMDUTF_WRAPPER_H

#include <stddef.h>  // for size_t

#ifdef __cplusplus
extern "C" {
#endif

// declare the wrapper function
int simdutf_validate_ascii( const char* str, size_t len );

#ifdef __cplusplus
}
#endif

#endif // SIMDUTF_WRAPPER_H

simdutf_wrapper.cpp:

#include "simdutf_wrapper.h"
#include <simdutf.h>

int simdutf_validate_ascii( const char* str, size_t len ){
	return simdutf::validate_ascii( str, len );
}

(Now that I think about it, this isn’t a C wrapper but a C++ wrapper. )

Then compile that to an object file:
gcc -O3 -I/opt/homebrew/include -std=c++20 -c simdutf_wrapper.cpp -o simdutf_wrapper.o
or
clang++ -std=c++20 -I/opt/homebrew/include -c -O3 simdutf_wrapper.cpp

You can now use that in Crystal:

@[Link(ldflags: "#{__DIR__}/simdutf_wrapper/simdutf_wrapper.o -lsimdutf")]
lib SIMDUTF
  fun simdutf_validate_ascii( str : LibC::Char*, len : LibC::Int ) : LibC::Int
  # are the types LibC::Int or LibC::Long ?
end

class String
  def ascii_only_simdutf? : Bool
    SIMDUTF.simdutf_validate_ascii( to_unsafe, @bytesize ) != 0
  end
end

I don’t want to repeat the whole benchmark code here, but my findings can be summarized like this:

For short strings, using simdutf is slower.
For long strings with a non-ASCII character near the beginning, using simdutf is slower.
Only for long strings (>= 512 B or so) with a non-ASCII character near the end or long ASCII-only strings, using simdutf is faster, but not like night and day, unless you get into the megabytes.

I’ll attribute the slowness not to simdutf itself but to the overhead of calling an external library.

String size: 0 B (ASCII-only?: true, string is: "")
        ascii_only? 317.77k (  3.15µs) (± 1.34%)  0.0B/op   1.42× slower
    ascii_only_128? 452.48k (  2.21µs) (± 1.85%)  0.0B/op        fastest
ascii_only_simdutf? 109.53k (  9.13µs) (± 1.19%)  0.0B/op   4.13× slower
{value: true}

String size: 1 B (ASCII-only?: true, string is: "a")
        ascii_only? 104.73k (  9.55µs) (± 2.02%)  15.6kB/op   1.08× slower
    ascii_only_128? 112.84k (  8.86µs) (± 1.72%)  15.6kB/op        fastest
ascii_only_simdutf?  30.20k ( 33.11µs) (± 2.69%)  15.6kB/op   3.74× slower
{value: true}

String size: 5 B (ASCII-only?: true, string is: "hello")
        ascii_only?  50.94k ( 19.63µs) (± 2.08%)  31.2kB/op   1.55× slower
    ascii_only_128?  78.78k ( 12.69µs) (± 1.92%)  31.2kB/op        fastest
ascii_only_simdutf?  25.88k ( 38.63µs) (± 1.81%)  31.2kB/op   3.04× slower
{value: true}

String size: 1048580 B (ASCII-only?: false, string is: "😎" + "a" * 1_048_576)
        ascii_only?   1.40  (716.31ms) (± 0.98%)  0.98GB/op  19.77× slower
    ascii_only_128?  27.60  ( 36.23ms) (± 1.57%)  0.98GB/op        fastest
ascii_only_simdutf?  16.84  ( 59.37ms) (± 2.02%)  0.98GB/op   1.64× slower
{value: false}

String size: 1048580 B (ASCII-only?: false, string is: "a" * 1_048_576 + "😎")
        ascii_only?   1.44  (694.73ms) (± 1.15%)  0.98GB/op  13.05× slower
    ascii_only_128?  12.11  ( 82.57ms) (± 0.84%)  0.98GB/op   1.55× slower
ascii_only_simdutf?  18.78  ( 53.25ms) (± 1.28%)  0.98GB/op        fastest
{value: false}

String size: 1048576 B (ASCII-only?: true, string is: "a" * 1_048_576)
        ascii_only? 716.79m (  1.40s ) (± 0.26%)  0.98GB/op  27.27× slower
    ascii_only_128?  12.52  ( 79.86ms) (± 0.99%)  0.98GB/op   1.56× slower
ascii_only_simdutf?  19.55  ( 51.15ms) (± 2.79%)  0.98GB/op        fastest
{value: true}

If these numbers are anything to go by, I’d say using simdutf::validate_ascii doesn’t make sense unless you are dealing with very long strings and you are fairly certain that the strings are ASCII-only. The signature would then have to be something like: ascii_only?( probably_ascii : Bool = false ). Not a fan.

jgaskins · April 1, 2024, 11:48pm

Is there overhead there? My understanding is that since Crystal is ABI-compatible with C, calling into an external library should be equivalent in performance to calling a method defined in Crystal.

self · April 2, 2024, 4:40pm

Are you asking me? I really don’t know. Maybe someone more knowledgeable than me can chime in and find my mistakes. :-)

I might add that if I use simdutf::validate_ascii_with_errors instead of simdutf::validate_ascii then that’s faster than my ascii_only_128? implementation for strings longer than 1024 bytes or so, no matter where a non-ASCII character is in the string. So we could have something like:

def ascii_only_combined? : Bool
  return (case @bytesize
    when       0 ; true    # optimization for empty strings
    when .< 1024 ; ascii_only_128?
    else         ; ascii_only_simdutf_with_errors?
  end)
end

The same might be true for simdutf::validate_utf8_with_errors, which also returns the number of codepoints for valid UTF-8-encoded strings.

self · April 2, 2024, 4:50pm

That’s what Rust does by the way, where the actual implementation of is_ascii isn’t on str but on slice (which makes sense to me, because the implementation in Crystal does indeed work on Bytes).

kojix2 · April 25, 2024, 3:50am

Please continue.
There is at least one person here who would like to see ascii_only? speed up.

ysbaddaden · April 25, 2024, 12:02pm

String#length may be precomputed at compile time for static strings (to be verified). It will have to be computed at runtime for dynamic strings (i.e. iterate the entire string for utf8 codepoints) and calling it can slow things down, or not as it will only impact the first call (it’s then memoized). Either can skew your benchmark to appear faster or slower.

You’ll want to disassemble the compiled program. It’s possible that LLVM vectorized the loop to use leverage SSE or another CPU feature. It’s also likely to have inlined the calls. That will be faster in the end than calling into a C wrapper to a C++ library, especially for small strings.

self · April 25, 2024, 10:31pm

Yeah. Thanks for pointing me in the right direction.
I don’t think I’ll find the time to look at any assembler code though, and I wouldn’t really know how to read it.

The next best thing I can think of is compiling the different implementations of ascii_only? and valid_encoding? into an executable each, doing just one iteration in each executable (maybe with a ton of different strings each, all created at runtime), and having the benchmark code spawn processes many times. I think that would make sure the benchmark isn’t skewed by any LLVM optimizations.

I would assume though that this approach doesn’t provide any useful results, as I expect the overhead of process creation to be orders of magnitude greater than the actual method call. But who knows.

straight-shoota · April 26, 2024, 12:59am

Yeah, no that won’t yield any useful results.

Topic		Replies	Views
Do ascii/binary strings exist? Help & Support	25	440	March 21, 2022
A blog article on performant vs idiomatic code (using Crystal examples) News blog	15	513	February 11, 2024
Add validity flags to String Crystal Contrib	5	258	October 13, 2023
Loop two String.each_char blocks at the same time Help & Support	8	311	March 3, 2022
Unicode as syntax	17	468	July 1, 2023

String#ascii_only?

Related Topics