String#ascii_only?

I’m probably wrong, but couldn’t String#ascii_only? be simplified?
The current implementation is:

  def ascii_only? : Bool
    if @bytesize == size
      each_byte do |byte|
        return false unless byte < 0x80
      end
      true
    else
      false
    end
  end

I think it could be shortened to:

  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end

or (faster):

  def ascii_only_3? : Bool
    return false  if (@length > 0 && @length != @bytesize)
    to_slice.each{ |byte| return false  if byte >= 0x80 }
    true
  end

In --release build, both #ascii_only_3? and #ascii_only_2? are much faster than #ascii_only? by a factor of 10 (depending upon the corpus of strings).
In non---release build, #ascii_only_3? is faster is faster than #ascii_only? and #ascii_only_2?.
I’d go with #ascii_only_3?.

1 Like

Can you share your benchmark code? When I ran it, ascii_only_2? was indeed faster than the current implementation, but only by like 20%, not a factor of 10.

strs_ascii = [
  "abcdef",
  "a",
  "ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication.",
  "Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope.",
  "ASCII was developed in part from telegraph code.",
  "",
]
strs_non_ascii = [
  "ASCII was developed in part from telegraph code. …",
  "ASCII was developed in part from telegraph code. 😎",
  "😎 ASCII was developed in part from telegraph code.",
  "\xED\xA0\x80\xED\xBF\xBF",
  "あ",
]
strs = strs_ascii + strs_non_ascii

class String
  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end
  def ascii_only_3? : Bool
    return false  if (@length > 0 && @length != @bytesize)
    to_slice.each{ |byte| return false  if byte >= 0x80 }
    true
  end
end

strs.each{ |str|
  if (str.ascii_only_2? != str.ascii_only?)
    puts "#{str.inspect}.ascii_only_2? != #{str.inspect}.ascii_only?"
    exit 1
  end
  if (str.ascii_only_3? != str.ascii_only?)
    puts "#{str.inspect}.ascii_only_3? != #{str.inspect}.ascii_only?"
    exit 1
  end
}

require "benchmark"
Benchmark.ips do |x|
  x.report("ascii_only?"  ) { strs.each( &.ascii_only?   )}
  x.report("ascii_only_2?") { strs.each( &.ascii_only_2? )}
  x.report("ascii_only_3?") { strs.each( &.ascii_only_3? )}
end

Hmm, I’m not seeing a notable difference in performance. I also threw in another attempt, which I realized after writing it is basically a manual inlining of your 2 implementation. :smile:

require "benchmark"

{
  "hello",
  "." * 1024,
}.each do |string|
  puts
  puts "String size: #{string.bytesize}B"
  value = nil
  Benchmark.ips do |x|
    # Without doing multiple iterations per block, it runs too
    # quickly to get a useful measurement because it completes
    # in ~1ns.
    iterations = 1000

    x.report "ascii_only?" { iterations.times { value = string.ascii_only? } }
    x.report "ascii_only_2?" { iterations.times { value = string.ascii_only_2? } }
    x.report "ascii_only_3?" { iterations.times { value = string.ascii_only_3? } }
    x.report "ascii_only_4?" { iterations.times { value = string.ascii_only_4? } }
  end
  pp value: value # Side effect so LLVM won't optimize out the blocks
end

class String
  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end

  def ascii_only_3? : Bool
    return false if (@length > 0 && @length != @bytesize)
    to_slice.each { |byte| return false if byte >= 0x80 }
    true
  end

  def ascii_only_4? : Bool
    each_byte do |byte|
      return false if byte >= 0x80
    end
    true
  end
end
➜  Code crystal run --release bench_ascii_only.cr

String size: 5B
  ascii_only? 355.19k (  2.82µs) (± 0.73%)  0.0B/op   1.12× slower
ascii_only_2? 355.44k (  2.81µs) (± 0.34%)  0.0B/op   1.12× slower
ascii_only_3? 396.39k (  2.52µs) (± 5.54%)  0.0B/op        fastest
ascii_only_4? 354.27k (  2.82µs) (± 0.88%)  0.0B/op   1.12× slower
{value: true}

String size: 1024B
  ascii_only?   3.03k (330.48µs) (± 0.47%)  0.0B/op   1.01× slower
ascii_only_2?   3.04k (328.77µs) (± 0.24%)  0.0B/op   1.00× slower
ascii_only_3?   3.04k (329.19µs) (± 0.34%)  0.0B/op   1.00× slower
ascii_only_4?   3.04k (328.75µs) (± 0.27%)  0.0B/op        fastest
{value: true}

Hmm, it just occurred to me that String#size memoizes the character count, so doing any benchmark that uses that instance variable repeatedly on a single string instance is going to have skewed results.

Looks like I need to learn how to benchmark. :-)
There doesn’t seem much of a difference.

require "benchmark"

{
  "hello",
  "." * 1024,
  "abcdef",
  "a",
  "ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication.",
  "Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope.",
  "ASCII was developed in part from telegraph code.",
  "",
  "ASCII was developed in part from telegraph code. …",
  "ASCII was developed in part from telegraph code. 😎",
  "😎 ASCII was developed in part from telegraph code.",
  "\xED\xA0\x80\xED\xBF\xBF",
  "あ",
}.each do |string|
  puts
  puts "String size: #{string.bytesize}B + x"
  value = nil
  Benchmark.ips do |x|
    # Without doing multiple iterations per block, it runs too
    # quickly to get a useful measurement because it completes
    # in ~1ns.
    n = 1000
    n2 = 3

    x.report "ascii_only?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only? }
    }}
    x.report "ascii_only_2?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only_2? }
    }}
    x.report "ascii_only_3?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only_3? }
    }}
    x.report "ascii_only_4?" { n.times {
    	s = string + Random.rand(Int32::MAX).to_s(36);
    	n2.times { value = s.ascii_only_4? }
    }}
  end
  pp value: value # Side effect so LLVM won't optimize out the blocks
end

class String
  def ascii_only_2? : Bool
    to_slice.all? &.< 0x80
  end

  def ascii_only_3? : Bool
    return false if (@length > 0 && @length != @bytesize)
    to_slice.each { |byte| return false if byte >= 0x80 }
    true
  end

  def ascii_only_4? : Bool
    each_byte do |byte|
      return false if byte >= 0x80
    end
    true
  end
end
String size: 5B + x
  ascii_only?  20.82k ( 48.02µs) (± 1.20%)  62.8kB/op   1.00× slower
ascii_only_2?  19.91k ( 50.23µs) (± 1.22%)  62.8kB/op   1.05× slower
ascii_only_3?  20.85k ( 47.96µs) (± 1.12%)  62.8kB/op        fastest
ascii_only_4?  19.87k ( 50.32µs) (± 0.97%)  62.8kB/op   1.05× slower
{value: true}

String size: 1024B + x
  ascii_only? 840.82  (  1.19ms) (± 1.17%)  1.32MB/op   1.01× slower
ascii_only_2? 849.63  (  1.18ms) (± 0.93%)  1.32MB/op   1.00× slower
ascii_only_3? 830.96  (  1.20ms) (± 1.24%)  1.32MB/op   1.02× slower
ascii_only_4? 849.88  (  1.18ms) (± 1.16%)  1.32MB/op        fastest
{value: true}

String size: 6B + x
  ascii_only?  20.31k ( 49.23µs) (± 1.28%)  62.8kB/op   1.00× slower
ascii_only_2?  19.30k ( 51.82µs) (± 2.14%)  62.8kB/op   1.05× slower
ascii_only_3?  20.34k ( 49.16µs) (± 1.21%)  62.8kB/op        fastest
ascii_only_4?  19.42k ( 51.49µs) (± 1.17%)  62.8kB/op   1.05× slower
{value: true}

String size: 1B + x
  ascii_only?  23.39k ( 42.75µs) (± 1.31%)  62.8kB/op        fastest
ascii_only_2?  22.16k ( 45.13µs) (± 1.27%)  62.8kB/op   1.06× slower
ascii_only_3?  23.38k ( 42.77µs) (± 1.01%)  62.8kB/op   1.00× slower
ascii_only_4?  22.03k ( 45.39µs) (± 1.15%)  62.8kB/op   1.06× slower
{value: true}

String size: 138B + x
  ascii_only?   4.49k (222.54µs) (± 1.01%)  188kB/op   1.03× slower
ascii_only_2?   4.63k (215.86µs) (± 1.37%)  188kB/op   1.00× slower
ascii_only_3?   4.55k (219.87µs) (± 1.02%)  188kB/op   1.02× slower
ascii_only_4?   4.64k (215.36µs) (± 1.07%)  188kB/op        fastest
{value: true}

String size: 190B + x
  ascii_only?   3.58k (279.23µs) (± 1.03%)  250kB/op   1.04× slower
ascii_only_2?   3.73k (268.17µs) (± 1.21%)  250kB/op        fastest
ascii_only_3?   3.60k (277.42µs) (± 1.15%)  250kB/op   1.03× slower
ascii_only_4?   3.62k (276.01µs) (± 1.08%)  250kB/op   1.03× slower
{value: true}

String size: 48B + x
  ascii_only?  10.00k ( 99.98µs) (± 1.49%)  109kB/op   1.01× slower
ascii_only_2?   9.98k (100.20µs) (± 1.62%)  109kB/op   1.01× slower
ascii_only_3?  10.03k ( 99.69µs) (± 1.27%)  109kB/op   1.01× slower
ascii_only_4?  10.09k ( 99.08µs) (± 1.42%)  109kB/op        fastest
{value: true}

String size: 0B + x
  ascii_only?  36.55k ( 27.36µs) (± 1.10%)  31.3kB/op        fastest
ascii_only_2?  33.43k ( 29.91µs) (± 1.34%)  31.3kB/op   1.09× slower
ascii_only_3?  36.53k ( 27.38µs) (± 1.10%)  31.4kB/op   1.00× slower
ascii_only_4?  33.53k ( 29.82µs) (± 0.89%)  31.3kB/op   1.09× slower
{value: true}

String size: 52B + x
  ascii_only?  23.01k ( 43.46µs) (± 1.96%)  109kB/op   1.02× slower
ascii_only_2?  10.96k ( 91.25µs) (± 1.11%)  109kB/op   2.14× slower
ascii_only_3?  23.42k ( 42.69µs) (± 1.84%)  109kB/op        fastest
ascii_only_4?  10.95k ( 91.35µs) (± 1.28%)  109kB/op   2.14× slower
{value: false}

String size: 53B + x
  ascii_only?  23.15k ( 43.19µs) (± 1.57%)  109kB/op   1.02× slower
ascii_only_2?  10.93k ( 91.52µs) (± 1.13%)  109kB/op   2.17× slower
ascii_only_3?  23.69k ( 42.22µs) (± 1.38%)  109kB/op        fastest
ascii_only_4?  11.03k ( 90.69µs) (± 1.17%)  109kB/op   2.15× slower
{value: false}

String size: 53B + x
  ascii_only?  22.74k ( 43.97µs) (± 1.85%)  109kB/op   1.02× slower
ascii_only_2?  22.69k ( 44.06µs) (± 1.87%)  109kB/op   1.02× slower
ascii_only_3?  23.19k ( 43.12µs) (± 1.88%)  109kB/op        fastest
ascii_only_4?  22.74k ( 43.98µs) (± 1.80%)  109kB/op   1.02× slower
{value: false}

String size: 6B + x
  ascii_only?  26.42k ( 37.86µs) (± 1.03%)  62.8kB/op   1.00× slower
ascii_only_2?  26.42k ( 37.85µs) (± 1.06%)  62.8kB/op   1.00× slower
ascii_only_3?  26.45k ( 37.81µs) (± 1.19%)  62.8kB/op   1.00× slower
ascii_only_4?  26.50k ( 37.73µs) (± 1.00%)  62.8kB/op        fastest
{value: false}

String size: 3B + x
  ascii_only?  27.19k ( 36.78µs) (± 1.17%)  62.8kB/op   1.03× slower
ascii_only_2?  27.18k ( 36.79µs) (± 1.21%)  62.8kB/op   1.03× slower
ascii_only_3?  27.87k ( 35.88µs) (± 1.31%)  62.8kB/op        fastest
ascii_only_4?  27.20k ( 36.76µs) (± 1.11%)  62.8kB/op   1.02× slower
{value: false}

I’ll throw in yet another implementation that checks multiple bytes at once.

I thought that memory alignment plays a role here, but I don’t know enough about that.
(What would I check? alignof(String)? alignof(Bytes), because String#to_slice is a Bytes? alignof(UInt8*), because String#to_unsafe is a Pointer(UInt8)?)

Suprisingly though, the 16 bytes (128 bits) version is faster than the 8 bytes (64 bits) version.

Maybe it would make sense to use is_ascii from simdutf?

Also related:

class String
  
  def ascii_only_128? : Bool
    case @bytesize
    when     0 ; return true
    when .<  4 ; return (to_slice.all? &.< 0x80)
    when .<  8
      slice32 = Slice( UInt32 ).new( to_unsafe.as( UInt32* ), @bytesize // 4 )  #/ (fix syntax highlighting in the forum)
      return false  if slice32.any?{ |b4| (b4 & 0x80808080_u32) != 0 }
      num_rest_bytes = @bytesize - slice32.bytesize
      unless num_rest_bytes == 0
        rest_slice8 = to_slice[ slice32.bytesize, num_rest_bytes ]
        return false  if rest_slice8.any?{ |b| (b & 0x80_u8) != 0 }
      end
      return true
    when .< 16
      slice64 = Slice( UInt64 ).new( to_unsafe.as( UInt64* ), @bytesize // 8 )  #/ (fix syntax highlighting in the forum)
      return false  if slice64.any?{ |b8| (b8 & 0x8080808080808080_u64) != 0 }
      num_rest_bytes = @bytesize - slice64.bytesize
      unless num_rest_bytes == 0
        rest_slice8 = to_slice[ slice64.bytesize, num_rest_bytes ]
        return false  if rest_slice8.any?{ |b| (b & 0x80_u8) != 0 }
      end
      return true
    else
      slice128 = Slice( UInt128 ).new( to_unsafe.as( UInt128* ), @bytesize // 16 )  #/ (fix syntax highlighting in the forum)
      return false  if slice128.any?{ |b16| (b16 & 0x80808080808080808080808080808080_u128) != 0 }
      num_rest_bytes = @bytesize - slice128.bytesize
      unless num_rest_bytes == 0
        rest_slice8 = to_slice[ slice128.bytesize, num_rest_bytes ]
        return false  if rest_slice8.any?{ |b| (b & 0x80_u8) != 0 }
      end
      return true
    end
  end
end

require "benchmark"

strs = {
  "hello",
  "." * 1024,
  "a",
  "ASCII was developed in part from telegraph code.",
  "",
  "ASCII was developed in part from telegraph code. 😎",
  "😎 ASCII was developed in part from telegraph code.",
  "\xED\xA0\x80\xED\xBF\xBF",
  "あ",
  "ASCII was developed in part from telegraph code.____",
  "ASCII was developed in part from telegraph code. 😎___",
}

strs.each do |str|
  puts "#{str.inspect} ..."
  if (str.ascii_only_128? != str.ascii_only?)
    puts "ERR: #{str.inspect}.ascii_only_128? (#{str.ascii_only_128?.inspect}) != #{str.inspect}.ascii_only? (#{str.ascii_only?.inspect})"
  end
end

strs.each do |str|
  puts
  puts "String size: #{str.bytesize} B (ASCII-only?: #{str.ascii_only?.inspect})"
  value = nil
  Benchmark.ips do |x|
    # Without doing multiple iterations per block, it runs too
    # quickly to get a useful measurement because it completes
    # in ~1ns.
    n = 500
    n2 = 2
    
    x.report "ascii_only?" { n.times {
      new_str = String.new( str.to_slice )
      n2.times { value = new_str.ascii_only? }
    }}
    x.report "ascii_only_128?" { n.times {
      new_str = String.new( str.to_slice )
      n2.times { value = new_str.ascii_only_128? }
    }}
  end
  pp value: value # Side effect so LLVM won't optimize out the blocks
end
String size: 5 B (ASCII-only?: true)
    ascii_only?  90.21k ( 11.09µs) (± 0.74%)  15.7kB/op   1.44× slower
ascii_only_128? 130.02k (  7.69µs) (± 5.37%)  15.7kB/op        fastest
{value: true}

String size: 1024 B (ASCII-only?: true)
    ascii_only?   1.34k (744.37µs) (± 2.00%)  656kB/op   7.66× slower
ascii_only_128?  10.28k ( 97.23µs) (± 3.40%)  656kB/op        fastest
{value: true}

String size: 1 B (ASCII-only?: true)
    ascii_only? 186.71k (  5.36µs) (± 1.02%)  7.86kB/op   1.07× slower
ascii_only_128? 199.20k (  5.02µs) (± 0.61%)  7.86kB/op        fastest
{value: true}

String size: 48 B (ASCII-only?: true)
    ascii_only?  23.64k ( 42.30µs) (± 1.15%)  31.2kB/op   3.89× slower
ascii_only_128?  92.06k ( 10.86µs) (± 1.16%)  31.2kB/op        fastest
{value: true}

String size: 0 B (ASCII-only?: true)
    ascii_only? 635.57k (  1.57µs) (± 1.04%)  0.0B/op   1.42× slower
ascii_only_128? 902.90k (  1.11µs) (± 1.38%)  0.0B/op        fastest
{value: true}

String size: 53 B (ASCII-only?: false)
    ascii_only?  32.17k ( 31.08µs) (± 1.76%)  39.1kB/op   2.25× slower
ascii_only_128?  72.51k ( 13.79µs) (± 1.24%)  39.1kB/op        fastest
{value: false}

String size: 53 B (ASCII-only?: false)
    ascii_only?  36.98k ( 27.04µs) (± 0.74%)  39.1kB/op   2.36× slower
ascii_only_128?  87.32k ( 11.45µs) (± 0.87%)  39.1kB/op        fastest
{value: false}

String size: 6 B (ASCII-only?: false)
    ascii_only?  93.38k ( 10.71µs) (± 0.83%)  15.7kB/op   1.36× slower
ascii_only_128? 127.33k (  7.85µs) (± 0.76%)  15.7kB/op        fastest
{value: false}

String size: 3 B (ASCII-only?: false)
    ascii_only? 129.55k (  7.72µs) (± 0.73%)  15.7kB/op   1.09× slower
ascii_only_128? 140.86k (  7.10µs) (± 1.21%)  15.7kB/op        fastest
{value: false}

String size: 52 B (ASCII-only?: true)
    ascii_only?  21.19k ( 47.19µs) (± 0.83%)  39.1kB/op   3.26× slower
ascii_only_128?  69.16k ( 14.46µs) (± 0.53%)  39.1kB/op        fastest
{value: true}

String size: 56 B (ASCII-only?: false)
    ascii_only?  36.88k ( 27.11µs) (± 0.66%)  39.1kB/op   2.13× slower
ascii_only_128?  78.55k ( 12.73µs) (± 0.66%)  39.1kB/op        fastest
{value: false}
1 Like

I ran it on my own computer.

String size: 5 B (ASCII-only?: true)
    ascii_only?  74.89k ( 13.35µs) (± 0.63%)  15.9kB/op   1.16× slower
ascii_only_128?  86.88k ( 11.51µs) (± 0.55%)  15.9kB/op        fastest
{value: true}

String size: 1024 B (ASCII-only?: true)
    ascii_only?   1.80k (555.54µs) (± 0.06%)  656kB/op   5.23× slower
ascii_only_128?   9.41k (106.29µs) (± 0.44%)  656kB/op        fastest
{value: true}

String size: 1 B (ASCII-only?: true)
    ascii_only? 144.13k (  6.94µs) (± 0.13%)  7.91kB/op   1.04× slower
ascii_only_128? 149.47k (  6.69µs) (± 0.14%)  7.91kB/op        fastest
{value: true}

String size: 48 B (ASCII-only?: true)
    ascii_only?  27.11k ( 36.89µs) (± 0.12%)  31.3kB/op   3.57× slower
ascii_only_128?  96.86k ( 10.32µs) (± 0.25%)  31.3kB/op        fastest
{value: true}

String size: 0 B (ASCII-only?: true)
    ascii_only? 883.34k (  1.13µs) (± 0.07%)  0.0B/op   1.11× slower
ascii_only_128? 980.73k (  1.02µs) (± 0.01%)  0.0B/op        fastest
{value: true}

String size: 53 B (ASCII-only?: false)
    ascii_only?  44.50k ( 22.47µs) (± 0.12%)  39.2kB/op   1.79× slower
ascii_only_128?  79.71k ( 12.55µs) (± 0.22%)  39.2kB/op        fastest
{value: false}

String size: 53 B (ASCII-only?: false)
    ascii_only?  35.76k ( 27.96µs) (± 0.05%)  39.2kB/op   2.58× slower
ascii_only_128?  92.23k ( 10.84µs) (± 0.34%)  39.2kB/op        fastest
{value: false}

String size: 6 B (ASCII-only?: false)
    ascii_only?  99.75k ( 10.03µs) (± 0.15%)  15.7kB/op   1.37× slower
ascii_only_128? 137.15k (  7.29µs) (± 0.20%)  15.7kB/op        fastest
{value: false}

String size: 3 B (ASCII-only?: false)
    ascii_only? 130.78k (  7.65µs) (± 0.17%)  15.7kB/op   1.08× slower
ascii_only_128? 140.87k (  7.10µs) (± 0.20%)  15.7kB/op        fastest
{value: false}

String size: 52 B (ASCII-only?: true)
    ascii_only?  28.47k ( 35.12µs) (± 0.08%)  39.2kB/op   2.54× slower
ascii_only_128?  72.44k ( 13.80µs) (± 0.38%)  39.2kB/op        fastest
{value: true}

String size: 56 B (ASCII-only?: false)
    ascii_only?  34.36k ( 29.11µs) (± 0.19%)  39.2kB/op   2.29× slower
ascii_only_128?  78.57k ( 12.73µs) (± 0.21%)  39.2kB/op        fastest
{value: false}

I have used ascii_only? to validate DNA sequence data. I think this speed-up very useful.

1 Like

I was curious about the simdutf lib. I have hardly any experience with C, let alone C++, but that didn’t stop me from trying anyway. :-) It’s probably all wrong, but here’s what I did.

brew install simdutf (on a Mac)

It’s a C++ lib, so it needs a C wrapper.

simdutf_wrapper.h:

#ifndef SIMDUTF_WRAPPER_H
#define SIMDUTF_WRAPPER_H

#include <stddef.h>  // for size_t

#ifdef __cplusplus
extern "C" {
#endif

// declare the wrapper function
int simdutf_validate_ascii( const char* str, size_t len );

#ifdef __cplusplus
}
#endif

#endif // SIMDUTF_WRAPPER_H

simdutf_wrapper.cpp:

#include "simdutf_wrapper.h"
#include <simdutf.h>

int simdutf_validate_ascii( const char* str, size_t len ){
	return simdutf::validate_ascii( str, len );
}

(Now that I think about it, this isn’t a C wrapper but a C++ wrapper. :person_shrugging:)

Then compile that to an object file:
gcc -O3 -I/opt/homebrew/include -std=c++20 -c simdutf_wrapper.cpp -o simdutf_wrapper.o
or
clang++ -std=c++20 -I/opt/homebrew/include -c -O3 simdutf_wrapper.cpp

You can now use that in Crystal:

@[Link(ldflags: "#{__DIR__}/simdutf_wrapper/simdutf_wrapper.o -lsimdutf")]
lib SIMDUTF
  fun simdutf_validate_ascii( str : LibC::Char*, len : LibC::Int ) : LibC::Int
  # are the types LibC::Int or LibC::Long ?
end

class String
  def ascii_only_simdutf? : Bool
    SIMDUTF.simdutf_validate_ascii( to_unsafe, @bytesize ) != 0
  end
end

I don’t want to repeat the whole benchmark code here, but my findings can be summarized like this:

  • For short strings, using simdutf is slower.
  • For long strings with a non-ASCII character near the beginning, using simdutf is slower.
  • Only for long strings (>= 512 B or so) with a non-ASCII character near the end or long ASCII-only strings, using simdutf is faster, but not like night and day, unless you get into the megabytes.

I’ll attribute the slowness not to simdutf itself but to the overhead of calling an external library.

String size: 0 B (ASCII-only?: true, string is: "")
        ascii_only? 317.77k (  3.15µs) (± 1.34%)  0.0B/op   1.42× slower
    ascii_only_128? 452.48k (  2.21µs) (± 1.85%)  0.0B/op        fastest
ascii_only_simdutf? 109.53k (  9.13µs) (± 1.19%)  0.0B/op   4.13× slower
{value: true}

String size: 1 B (ASCII-only?: true, string is: "a")
        ascii_only? 104.73k (  9.55µs) (± 2.02%)  15.6kB/op   1.08× slower
    ascii_only_128? 112.84k (  8.86µs) (± 1.72%)  15.6kB/op        fastest
ascii_only_simdutf?  30.20k ( 33.11µs) (± 2.69%)  15.6kB/op   3.74× slower
{value: true}

String size: 5 B (ASCII-only?: true, string is: "hello")
        ascii_only?  50.94k ( 19.63µs) (± 2.08%)  31.2kB/op   1.55× slower
    ascii_only_128?  78.78k ( 12.69µs) (± 1.92%)  31.2kB/op        fastest
ascii_only_simdutf?  25.88k ( 38.63µs) (± 1.81%)  31.2kB/op   3.04× slower
{value: true}

String size: 1048580 B (ASCII-only?: false, string is: "😎" + "a" * 1_048_576)
        ascii_only?   1.40  (716.31ms) (± 0.98%)  0.98GB/op  19.77× slower
    ascii_only_128?  27.60  ( 36.23ms) (± 1.57%)  0.98GB/op        fastest
ascii_only_simdutf?  16.84  ( 59.37ms) (± 2.02%)  0.98GB/op   1.64× slower
{value: false}

String size: 1048580 B (ASCII-only?: false, string is: "a" * 1_048_576 + "😎")
        ascii_only?   1.44  (694.73ms) (± 1.15%)  0.98GB/op  13.05× slower
    ascii_only_128?  12.11  ( 82.57ms) (± 0.84%)  0.98GB/op   1.55× slower
ascii_only_simdutf?  18.78  ( 53.25ms) (± 1.28%)  0.98GB/op        fastest
{value: false}

String size: 1048576 B (ASCII-only?: true, string is: "a" * 1_048_576)
        ascii_only? 716.79m (  1.40s ) (± 0.26%)  0.98GB/op  27.27× slower
    ascii_only_128?  12.52  ( 79.86ms) (± 0.99%)  0.98GB/op   1.56× slower
ascii_only_simdutf?  19.55  ( 51.15ms) (± 2.79%)  0.98GB/op        fastest
{value: true}

If these numbers are anything to go by, I’d say using simdutf::validate_ascii doesn’t make sense unless you are dealing with very long strings and you are fairly certain that the strings are ASCII-only. The signature would then have to be something like: ascii_only?( probably_ascii : Bool = false ). Not a fan.

Is there overhead there? My understanding is that since Crystal is ABI-compatible with C, calling into an external library should be equivalent in performance to calling a method defined in Crystal.

1 Like

Are you asking me? I really don’t know. Maybe someone more knowledgeable than me can chime in and find my mistakes. :-)

I might add that if I use simdutf::validate_ascii_with_errors instead of simdutf::validate_ascii then that’s faster than my ascii_only_128? implementation for strings longer than 1024 bytes or so, no matter where a non-ASCII character is in the string. So we could have something like:

def ascii_only_combined? : Bool
  return (case @bytesize
    when       0 ; true    # optimization for empty strings
    when .< 1024 ; ascii_only_128?
    else         ; ascii_only_simdutf_with_errors?
  end)
end

The same might be true for simdutf::validate_utf8_with_errors, which also returns the number of codepoints for valid UTF-8-encoded strings.

1 Like

That’s what Rust does by the way, where the actual implementation of is_ascii isn’t on str but on slice (which makes sense to me, because the implementation in Crystal does indeed work on Bytes).

2 Likes

Please continue.
There is at least one person here who would like to see ascii_only? speed up.

2 Likes

String#length may be precomputed at compile time for static strings (to be verified). It will have to be computed at runtime for dynamic strings (i.e. iterate the entire string for utf8 codepoints) and calling it can slow things down, or not as it will only impact the first call (it’s then memoized). Either can skew your benchmark to appear faster or slower.

You’ll want to disassemble the compiled program. It’s possible that LLVM vectorized the loop to use leverage SSE or another CPU feature. It’s also likely to have inlined the calls. That will be faster in the end than calling into a C wrapper to a C++ library, especially for small strings.

2 Likes

Yeah. Thanks for pointing me in the right direction.
I don’t think I’ll find the time to look at any assembler code though, and I wouldn’t really know how to read it.

The next best thing I can think of is compiling the different implementations of ascii_only? and valid_encoding? into an executable each, doing just one iteration in each executable (maybe with a ton of different strings each, all created at runtime), and having the benchmark code spawn processes many times. I think that would make sure the benchmark isn’t skewed by any LLVM optimizations.

I would assume though that this approach doesn’t provide any useful results, as I expect the overhead of process creation to be orders of magnitude greater than the actual method call. But who knows.

Yeah, no that won’t yield any useful results.

1 Like

I’ll throw in yet another implementation that checks multiple bytes at once.

I thought that memory alignment plays a role here, but I don’t know enough about that.

Pointer always assumes correct alignment, so the code could crash on some cpus.
The relevant part is that alignof(UInt8) = 1 while alignof(UInt64) = 8, so you’re just assuming that the Pointer will just have correct alignment.

x86 is a bit special because it’s quite fast at accessing memory from unaligned locations and doesn’t need special instructions for that (at least for regular registers).

I tried doing this in a way that should work with any alignment: Optimized String#ascii_only? ($4772390) · Snippets · GitLab

2 Likes

maybe create a PR or a shards?