The cast to a pointer of a larger type of int is a particularly clever way of levereaging the hardware to reduce the user time. I just did a small test, and it significantly improved the speed of writing to a buffer in a tight loop. This is definitely a trick that I will be keeping on my sleeve!
IMO this doesn’t really answer an important question: is wc truly “simple” compared to our Crystal code? Why can’twc simply do what we have written, except expressed in C code?
At other times, things happen. Someone thought that grep could have a better user interface and created ack, it got reimplemented as ag in C (with some tricks up it sleave), which got reimplemented as pt in Go, which in turn got reimplemented as rg in Rust (with even more tricks)…
But the side effect of this little arms race is that I don’t bother with tag files or source code search engines anymore. With SSD disks, I just rg whenever I need to find anything. Faster searching has basically obsoleted an entire type of tools for me.
And that’s just by optimizing an existing tool, so @nogginly, do carry on.
(oh, and I wonder if rg is faster than wc. If I could figure out how to make it count newlines…)
My main gripe is how it frequently compares performance against wc (presumably GNU’s) without actually trying to rank wc’s source complexity against the Crystal snippets. For reference, it has a heuristic that selects between a plain loop and rawmemchr (not even the memchr mentioned in the article), and there is a whole AVX2-specific variant which pushes the “long words” idea even further. Thus comparisons against wc are not fair without accounting for those differences, and only comparisons against the first “simple line counter” are.
Good point. I used it as a baseline since it was so well known and nothing more. I’m posted an update with a “Furthermore” section at the end where I try to address this.
@jzakiya, you are correct in that the program does not, and cannot, modify the number of threads used as workers by the Crystal runtime. I’ve updated the post to clarify that I set certain env vars before running the benchmarks. I intentionally rely on CRYSTAL_WORKERS within the programs since that tells me how many “real threads” are running to ensure I didn’t “oversaturate” the threads with more fibres than threads.
Thanks for finding this. I’ll admit, my goal (as I’ve now updated the post to reflect) was not to denigrate wc in any way. It was a convenient baseline as something everyone would know, and it provided me with a line in the sand to compare the naive slow initial versions and the later faster versions.