First, I’m not a developer. I have a shallow knowledge about programming so please be patient with me.
Second, yeah, English isn’t my first language. Not even second.
My question is why are there big differences in timings between these three functions implementing Sieve of Eratosthenes when the main difference is the indexing style(?). I’m sorry if this a very noob question. I don’t really know when to use which style. It would be great if you can point me to a “Crystal Tips and Tricks” page.
PS: I ran these in my Realme 9 pro (Snapdragon 695, 6/128) through termux. Crystal 1.11.2.
At a quick glance, I suspect that the first two are creating arrays and allocating memory. The last one, on the other hand, does not create an array, so I thought there would be no unnecessary memory allocation. (Sorry if I’m wrong)
If you are on Mac or Linux, you can use Instruments.app, perf, valgrind, or other common tools to determine the cause.
So solutions 1 and 2 are in the same ball park, but solution 3 is orders of magnitude faster.
I suspect that the nested loop in solution 3 might be easier for the compiler to optimize. It’s essentially just a StepIterator. The other examples have more complex chained iterators that are harder to optimize.
You can even gain a bit more performance by dropping this iterator entirely: Instead of step(*args).each { block } you can write step(*args) { block }. This iterates the block directly without allocating an iterator instance. That allows even more code optimizations (and it’s already much faster without any optimizations) and is an instant performance boost.
There are a couple more unnecessary .each in your code. You can drop them all. In most cases, it doesn’t make much difference when you already have an iterator (because then it’s a no-op). Somtimes like mentioned in the previous paragraph, it has a signfificant effect.
Similarly, .each.count in the measure blocks should also be just .count. This ends up using BitArray#count which is more optimized than Iterator#count.
I test use without --release, 1 and 2 similar, but 3 60X faster.
Only check code, there should never be such a HUG difference, I suspect it’s a Crystal bug?
@extgcd, can you use crenv install several version (It’s better to have a bigger span), and test those code again? this can confirm whether it is a regression issue.
You should never run performance comparisons for “production” code without --release. Performance measuring is usually pretty meaningless if you disable optimizations.
Hold your horses. Why do you thtink there should not be such a huge difference in performance? And why would it be a regression?
The performance characteristics are pretty similar with Crystal 1.2.0 (that’s the first version with Math.isqrt, so I didn’t test any older ones). If anything, performance has improved since then (which is to be expected).
I don’t think there’s anything wrong here. You can express the same algorithm in different ways. Some are less efficient than others. That’s it. The more efficient ones probably benefit from some powerful optimizations which are unavailable to the other ones.
As you said, perhaps measuring is meaningless without --release, but, i thought, the person who ask the question want to the answer about why some are less efficient than others
Sorry, i didn’t read your comment before previous reply. some idea i thought same as you, it doesn’t make much difference when you already have an iterator (because then it’s a no-op), At least it shouldn’t be #3 60X fast. I test it many times. if you could show the proff e.g #1#2 where couldn’t optimize, where #3 do, that more make sense.
Only Iterator#each is a no-op.
Your transformations remove some iterators entirely which is a completely different level of efficiency.
See my previous comment:
@extgcd mentioned running it on a Snapdragon, which IIRC is a Qualcomm Arm CPU. I don’t have a Linux Arm machine, but running it on the Apple M1, I see an order of magnitude difference in performance between 1 and 2:
One thing that jumps out immediately as a source of slowdown is the use of Enumerable#skip. It works like the OFFSET clause in SQL in that it still has to evaluate that entry even if it isn’t going to do anything with it. This means that even though you have constant-time random access in the BitArray, getting to your starting point happens in linear time.
Based on @straight-shoota’s results in running this code, the iteration in skip also seems to be much slower on Arm than on Intel/AMD CPUs, but I don’t fully understand why. Checking with Benchmark.memory shows 1.2MB allocated per method invocation so I don’t think differences in the number of Iterator allocations are the problem.
I ran this code through Instruments (the LLVM/Clang profiler in Xcode) and it looks like some of the chained iterators may just be difficult to optimize on Arm. You can see in the screenshot below that the next functions are inlined, but all of the top functions are still nested next calls. The only explanation I can imagine is that Intel/AMD processors might be able to hold more state in the CPU cache and registers than Arm processors in this scenario, but I don’t know if that’s true. And even if it is, there is likely an upper bound on how many chained iterators they can handle before running into the same problem and that upper bound is simply not reached in this scenario.
There’s a big difference between 1 and 2. Could this be because aarch64-linux-android isn’t tier 1(not sure)?
After removing ‘each’, the timing for 3 is
00:00:00.048794375
I’m still getting used to using ‘each’. I use it to avoid allocation but didn’t expect that it could cause significant slowdown when not used properly. Most of my toy programs are now as fast as their rust equivalents. I’ll make a separate post for others because I don’t think the slowdowns are related to indexing.
Again, my system is 64 bit Android 13, Qualcomm Snapdragon 695 (2 x Arm Cortex A78 + 6 x Arm Cortex A55), 6 / 128.
Yes. Benchmarks without it aren’t useful, even relative to each other. When optimizations are performed with --release, some operations might be 2x as fast, others might be 20x.
Yes, of course. And I’m actually surprised that my timings are near @jgaskins timings. SD 695 is an old low tier midrange Arm soc. I was expecting M1 to be much faster.
hehe The M1 is actually a year older than the SD 695. And 2 of the 3 timings were ~2x as fast on the M1, which seems about right. I don’t know why the third wasn’t at the same ratio, but if it had to keep going back to RAM a lot, that latency could explain at least part of it. Otherwise, I’m not sure.