A blog article on optimizing counting lines (using Crystal examples)

Hola folks. I’ve just pushed out another blog post, here’s the link:

I really enjoyed the feedback and discussion on my last post, and looking forward to your thoughts and comments this time as well.

2024-03-03 Updated article based on some feedback.


Great article!

The cast to a pointer of a larger type of int is a particularly clever way of levereaging the hardware to reduce the user time. I just did a small test, and it significantly improved the speed of writing to a buffer in a tight loop. This is definitely a trick that I will be keeping on my sleeve!

IMO this doesn’t really answer an important question: is wc truly “simple” compared to our Crystal code? Why can’t wc simply do what we have written, except expressed in C code?

CRYSTAL_WORKERS is hardwired set to 4.
In your code this doesn’t change the number of workers from 4.

# Number of concurrent workers

Even this won’t change the number of active workers from 4.
But it will set NUM_PAR to them.

# Number of concurrent workers
NUM_PAR = (ENV["CRYSTAL_WORKERS"]? || System.cpu_count).to_i

Are you sure you are setting CRYSTAL_WORKERS to the number of available system threads,
without having to externally do this before running the code:

$ CRYSTAL_WORKERS=XX ./your_program

316.5 ms


The flip side of “if it ain’t broke”…

At other times, things happen. Someone thought that grep could have a better user interface and created ack, it got reimplemented as ag in C (with some tricks up it sleave), which got reimplemented as pt in Go, which in turn got reimplemented as rg in Rust (with even more tricks)…

But the side effect of this little arms race is that I don’t bother with tag files or source code search engines anymore. With SSD disks, I just rg whenever I need to find anything. Faster searching has basically obsoleted an entire type of tools for me.

And that’s just by optimizing an existing tool, so @nogginly, do carry on. :grin:

(oh, and I wonder if rg is faster than wc. If I could figure out how to make it count newlines…)


My main gripe is how it frequently compares performance against wc (presumably GNU’s) without actually trying to rank wc’s source complexity against the Crystal snippets. For reference, it has a heuristic that selects between a plain loop and rawmemchr (not even the memchr mentioned in the article), and there is a whole AVX2-specific variant which pushes the “long words” idea even further. Thus comparisons against wc are not fair without accounting for those differences, and only comparisons against the first “simple line counter” are.

Good point. I used it as a baseline since it was so well known and nothing more. I’m posted an update with a “Furthermore” section at the end where I try to address this.

@jzakiya, you are correct in that the program does not, and cannot, modify the number of threads used as workers by the Crystal runtime. I’ve updated the post to clarify that I set certain env vars before running the benchmarks. I intentionally rely on CRYSTAL_WORKERS within the programs since that tells me how many “real threads” are running to ensure I didn’t “oversaturate” the threads with more fibres than threads.

Thanks for finding this. I’ll admit, my goal (as I’ve now updated the post to reflect) was not to denigrate wc in any way. It was a convenient baseline as something everyone would know, and it provided me with a line in the sand to compare the naive slow initial versions and the later faster versions.

This was a great read, thank you so much for sharing!