I was curious about the simdutf
lib. I have hardly any experience with C, let alone C++, but that didn’t stop me from trying anyway. :-) It’s probably all wrong, but here’s what I did.
brew install simdutf
(on a Mac)
It’s a C++ lib, so it needs a C wrapper.
simdutf_wrapper.h
:
#ifndef SIMDUTF_WRAPPER_H
#define SIMDUTF_WRAPPER_H
#include <stddef.h> // for size_t
#ifdef __cplusplus
extern "C" {
#endif
// declare the wrapper function
int simdutf_validate_ascii( const char* str, size_t len );
#ifdef __cplusplus
}
#endif
#endif // SIMDUTF_WRAPPER_H
simdutf_wrapper.cpp
:
#include "simdutf_wrapper.h"
#include <simdutf.h>
int simdutf_validate_ascii( const char* str, size_t len ){
return simdutf::validate_ascii( str, len );
}
(Now that I think about it, this isn’t a C wrapper but a C++ wrapper. )
Then compile that to an object file:
gcc -O3 -I/opt/homebrew/include -std=c++20 -c simdutf_wrapper.cpp -o simdutf_wrapper.o
or
clang++ -std=c++20 -I/opt/homebrew/include -c -O3 simdutf_wrapper.cpp
You can now use that in Crystal:
@[Link(ldflags: "#{__DIR__}/simdutf_wrapper/simdutf_wrapper.o -lsimdutf")]
lib SIMDUTF
fun simdutf_validate_ascii( str : LibC::Char*, len : LibC::Int ) : LibC::Int
# are the types LibC::Int or LibC::Long ?
end
class String
def ascii_only_simdutf? : Bool
SIMDUTF.simdutf_validate_ascii( to_unsafe, @bytesize ) != 0
end
end
I don’t want to repeat the whole benchmark code here, but my findings can be summarized like this:
- For short strings, using
simdutf
is slower.
- For long strings with a non-ASCII character near the beginning, using
simdutf
is slower.
- Only for long strings (>= 512 B or so) with a non-ASCII character near the end or long ASCII-only strings, using
simdutf
is faster, but not like night and day, unless you get into the megabytes.
I’ll attribute the slowness not to simdutf
itself but to the overhead of calling an external library.
String size: 0 B (ASCII-only?: true, string is: "")
ascii_only? 317.77k ( 3.15µs) (± 1.34%) 0.0B/op 1.42× slower
ascii_only_128? 452.48k ( 2.21µs) (± 1.85%) 0.0B/op fastest
ascii_only_simdutf? 109.53k ( 9.13µs) (± 1.19%) 0.0B/op 4.13× slower
{value: true}
String size: 1 B (ASCII-only?: true, string is: "a")
ascii_only? 104.73k ( 9.55µs) (± 2.02%) 15.6kB/op 1.08× slower
ascii_only_128? 112.84k ( 8.86µs) (± 1.72%) 15.6kB/op fastest
ascii_only_simdutf? 30.20k ( 33.11µs) (± 2.69%) 15.6kB/op 3.74× slower
{value: true}
String size: 5 B (ASCII-only?: true, string is: "hello")
ascii_only? 50.94k ( 19.63µs) (± 2.08%) 31.2kB/op 1.55× slower
ascii_only_128? 78.78k ( 12.69µs) (± 1.92%) 31.2kB/op fastest
ascii_only_simdutf? 25.88k ( 38.63µs) (± 1.81%) 31.2kB/op 3.04× slower
{value: true}
String size: 1048580 B (ASCII-only?: false, string is: "😎" + "a" * 1_048_576)
ascii_only? 1.40 (716.31ms) (± 0.98%) 0.98GB/op 19.77× slower
ascii_only_128? 27.60 ( 36.23ms) (± 1.57%) 0.98GB/op fastest
ascii_only_simdutf? 16.84 ( 59.37ms) (± 2.02%) 0.98GB/op 1.64× slower
{value: false}
String size: 1048580 B (ASCII-only?: false, string is: "a" * 1_048_576 + "😎")
ascii_only? 1.44 (694.73ms) (± 1.15%) 0.98GB/op 13.05× slower
ascii_only_128? 12.11 ( 82.57ms) (± 0.84%) 0.98GB/op 1.55× slower
ascii_only_simdutf? 18.78 ( 53.25ms) (± 1.28%) 0.98GB/op fastest
{value: false}
String size: 1048576 B (ASCII-only?: true, string is: "a" * 1_048_576)
ascii_only? 716.79m ( 1.40s ) (± 0.26%) 0.98GB/op 27.27× slower
ascii_only_128? 12.52 ( 79.86ms) (± 0.99%) 0.98GB/op 1.56× slower
ascii_only_simdutf? 19.55 ( 51.15ms) (± 2.79%) 0.98GB/op fastest
{value: true}
If these numbers are anything to go by, I’d say using simdutf::validate_ascii
doesn’t make sense unless you are dealing with very long strings and you are fairly certain that the strings are ASCII-only. The signature would then have to be something like: ascii_only?( probably_ascii : Bool = false )
. Not a fan.