This is charconv a pure Crystal implementation of libiconv. This is intended to be a drop in replacement with no C dependencies and increased performance.
Crystal’s String#encode and IO encoding support rely on the system’s libiconv. But is has some downsides. Most of all it is not written in Crystal but also.
FFI overhead
Platform-dependent behavior (macOS iconv vs GNU libiconv vs musl)
Performance
Static linking
Performance
Conversion
charconv
system iconv
Speedup
ASCII → ASCII
73 µs
11.9 ms
162x
ISO-8859-1 → UTF-8
2.1 ms
14.2 ms
6.9x
CP1252 → UTF-8
2.5 ms
17.2 ms
6.9x
UTF-8 → ISO-8859-1
3.4 ms
14.6 ms
4.3x
UTF-16BE → UTF-8
3.7 ms
10.8 ms
2.9x
UTF-8 → UTF-16LE
4.6 ms
10.1 ms
2.2x
I took some advice from some casey muratori videos on iconv and writing a terminal. It worked out pretty well.
Encoding Support
most of them I put them all in the README.md
One-shot conversion
result = CharConv.convert(“Hello wörld”, “UTF-8”, “ISO-8859-1”)
Doing crystal on Windows is certainly easier with fewer dependencies, so as a general rule I’m all for replacing external C dependencies with crystal libs. Not that iconv has given me any trouble so far…
Om the other hand, the number of maintainers that are able to fix bugs in such a library will certainly go down (compared to libiconv). But that’s always the risk for programming languages with a smaller base.
I will try this out later today, if it works out I would be very happy to use it in more projects.
And those results are .. not in line with yours. Or well, the crystal ones are mostly in line. Is the mac running the thing through Rosetta, or what is going on here?
that is a great question. I will have to run this on a server and figure out why the performance is so different. I am surprised the majority flipped to slower. Definitely worth looking into.
There are a lot of massive array literals of tuples, Slice.literal will work better (but it doesn’t support tuples as element types yet, so some kind of AoS-to-SoA rewriting is necessary)
ASCII → ASCII 73 µs 11.9 ms 162x
ISO-8859-1 → UTF-8 2.1 ms 14.2 ms 6.9x
CP1252 → UTF-8 2.5 ms 17.2 ms 6.9x
UTF-8 → ISO-8859-1 3.4 ms 14.6 ms 4.3x
UTF-16BE → UTF-8 3.7 ms 10.8 ms 2.9x
UTF-8 → UTF-16LE 4.6 ms 10.1 ms 2.2x
Could you please give me a code to run this benchmark on my (Latest) Arch Linux laptop?
If this this real, we really should port this shards into Crystal stand-lib.
There are writing systems where a single complex character carries the meaning of a whole word. One of them is Kanji(漢字). The CJK world has many variations — in Japanese, Kanji is mixed with Hiragana(ひらがな), Katakana(カタカナ), and even emoji. Encoding in these cultures is complicated. Vendors have historically shipped subtly different implementations under the same name, and libiconv has accumulated edge-case handling. I’m somewhat skeptical that charconv handles all of this correctly.
I hope it does, though…