Why so big differences?

First, I’m not a developer. I have a shallow knowledge about programming so please be patient with me.

Second, yeah, English isn’t my first language. Not even second.

My question is why are there big differences in timings between these three functions implementing Sieve of Eratosthenes when the main difference is the indexing style(?). I’m sorry if this a very noob question. I don’t really know when to use which style. It would be great if you can point me to a “Crystal Tips and Tricks” page.

PS: I ran these in my Realme 9 pro (Snapdragon 695, 6/128) through termux. Crystal 1.11.2.

require "bit_array"

def sieve_1(limit : UInt32) : BitArray
    is_prime = BitArray.new(limit + 1, true)
    is_prime[0] = is_prime[1] = false
  
    is_prime.each_index.skip(4).step(2).each { |i| is_prime[i] = false }
    (3..Math.isqrt(is_prime.size - 1)).each.step(2).select{ |x| is_prime[x] }.each do |i|
    is_prime.each_index.skip(i*i).step(2*i).each { |j| is_prime[j] = false }
    end
 
    is_prime
end
def sieve_2(limit : UInt32) : BitArray
    is_prime = BitArray.new(limit + 1, true)
    is_prime[0] = is_prime[1] = false
  
    (4...is_prime.size).each.step(2).each { |i| is_prime[i] = false }
    (3..Math.isqrt(is_prime.size - 1)).each.step(2).select{|x| is_prime[x]}.each do |i|
    (i*i...is_prime.size).each.step(2*i).each { |j| is_prime[j] = false }
    end
  
    is_prime
end
def sieve_3(limit : UInt32) : BitArray
    is_prime = BitArray.new(limit + 1, true)
    is_prime[0] = is_prime[1] = false
   
    4.step(to: is_prime.size - 1, by: 2).each { |i| is_prime[i] = false }
    3.step(to: Math.isqrt(is_prime.size - 1), by: 2).select{|x| is_prime[x]}.each do |i|
    (i*i).step(to: is_prime.size - 1, by: 2*i).each {|j| is_prime[j] = false}
    end
  
    is_prime
end
 
elapsed = Time.measure do
   puts sieve_1(10u32**7).each.count{ |x| x}
end
puts elapsed
elapsed = Time.measure do
   puts sieve_2(10u32**7).each.count{ |x| x}
end
puts elapsed
elapsed = Time.measure do
   puts sieve_3(10u32**7).each.count{ |x| x}
end
puts elapsed

At a quick glance, I suspect that the first two are creating arrays and allocating memory. The last one, on the other hand, does not create an array, so I thought there would be no unnecessary memory allocation. (Sorry if I’m wrong)
If you are on Mac or Linux, you can use Instruments.app, perf, valgrind, or other common tools to determine the cause.

1 Like

Could you tell us what kind of performance differences you see?

I can run the code myself and see what comes from that. But your results might be different and we wouldn’t be talking about the same data.

This is the output on my machine (with --release):

664579
00:00:02.656616357
664579
00:00:01.523873195
664579
00:00:00.043423086

So solutions 1 and 2 are in the same ball park, but solution 3 is orders of magnitude faster.

I suspect that the nested loop in solution 3 might be easier for the compiler to optimize. It’s essentially just a StepIterator. The other examples have more complex chained iterators that are harder to optimize.

You can even gain a bit more performance by dropping this iterator entirely: Instead of step(*args).each { block } you can write step(*args) { block }. This iterates the block directly without allocating an iterator instance. That allows even more code optimizations (and it’s already much faster without any optimizations) and is an instant performance boost.

There are a couple more unnecessary .each in your code. You can drop them all. In most cases, it doesn’t make much difference when you already have an iterator (because then it’s a no-op). Somtimes like mentioned in the previous paragraph, it has a signfificant effect.
Similarly, .each.count in the measure blocks should also be just .count. This ends up using BitArray#count which is more optimized than Iterator#count.

2 Likes

Interesting …

Obviously the Big performance different come from following nested block.

# 1
  (3..Math.isqrt(is_prime.size - 1)).each.step(2).select { |x| is_prime[x] }.each do |i|
    is_prime.each_index.skip(i*i).step(2*i).each { |j| is_prime[j] = false }
  end
# 2
  (3..Math.isqrt(is_prime.size - 1)).each.step(2).select { |x| is_prime[x] }.each do |i|
    (i*i...is_prime.size).each.step(2*i).each { |j| is_prime[j] = false }
  end
# 3
  3.step(to: Math.isqrt(is_prime.size - 1), by: 2).select { |x| is_prime[x] }.each do |i|
    (i*i).step(to: is_prime.size - 1, by: 2*i).each { |j| is_prime[j] = false }
  end

I test use without --release, 1 and 2 similar, but 3 60X faster.

Only check code, there should never be such a HUG difference, I suspect it’s a Crystal bug?

@extgcd, can you use crenv install several version (It’s better to have a bigger span), and test those code again? this can confirm whether it is a regression issue.

1 Like

You should never run performance comparisons for “production” code without --release. Performance measuring is usually pretty meaningless if you disable optimizations.

Hold your horses. Why do you thtink there should not be such a huge difference in performance? And why would it be a regression?

The performance characteristics are pretty similar with Crystal 1.2.0 (that’s the first version with Math.isqrt, so I didn’t test any older ones). If anything, performance has improved since then (which is to be expected).

I don’t think there’s anything wrong here. You can express the same algorithm in different ways. Some are less efficient than others. That’s it. The more efficient ones probably benefit from some powerful optimizations which are unavailable to the other ones.

3 Likes

As you said, perhaps measuring is meaningless without --release, but, i thought, the person who ask the question want to the answer about why some are less efficient than others

Sorry, i didn’t read your comment before previous reply. some idea i thought same as you, it doesn’t make much difference when you already have an iterator (because then it’s a no-op), At least it shouldn’t be #3 60X fast. I test it many times. if you could show the proff e.g #1 #2 where couldn’t optimize, where #3 do, that more make sense.

1 Like

Well, It’s seem like the performance related to too many each calls definitely.

Check following screenshot, i delete useless each, get significantly different results.

Left:

╰─ $ cr run 1.cr
664579
00:00:30.760603411
664579
00:00:28.618370458
664579
00:00:00.521438629

Right:

╰─ $ cr run 2.cr
664579
00:00:30.486961448
664579
00:00:00.441668254
664579
00:00:00.441089385

Although the issue is still there, is 60X faster reasonable when remove five useless each (just no-op)?

Only Iterator#each is a no-op.
Your transformations remove some iterators entirely which is a completely different level of efficiency.
See my previous comment:

I consider the iterator convert is the key reason.

e.g.

(4…is_prime.size).each.step(2).each

Range(Int32, Int32) → Range::ItemIterator(Int32, Int32) → Iterator::StepByIterator(Range::ItemIterator(Int32, Int32), Int32, Int32) → Iterator::StepByIterator(Range::ItemIterator(Int32, Int32), Int32, Int32)


(4…is_prime.size).step(2)

Range(Int32, Int32) → Steppable::StepIterator(Int32, Int32, Int32)

#1 same case too.

@extgcd mentioned running it on a Snapdragon, which IIRC is a Qualcomm Arm CPU. I don’t have a Linux Arm machine, but running it on the Apple M1, I see an order of magnitude difference in performance between 1 and 2:

664579
00:00:09.103127417
664579
00:00:01.558068791
664579
00:00:00.044070083

One thing that jumps out immediately as a source of slowdown is the use of Enumerable#skip. It works like the OFFSET clause in SQL in that it still has to evaluate that entry even if it isn’t going to do anything with it. This means that even though you have constant-time random access in the BitArray, getting to your starting point happens in linear time.

Based on @straight-shoota’s results in running this code, the iteration in skip also seems to be much slower on Arm than on Intel/AMD CPUs, but I don’t fully understand why. Checking with Benchmark.memory shows 1.2MB allocated per method invocation so I don’t think differences in the number of Iterator allocations are the problem.

I ran this code through Instruments (the LLVM/Clang profiler in Xcode) and it looks like some of the chained iterators may just be difficult to optimize on Arm. You can see in the screenshot below that the next functions are inlined, but all of the top functions are still nested next calls. The only explanation I can imagine is that Intel/AMD processors might be able to hold more state in the CPU cache and registers than Arm processors in this scenario, but I don’t know if that’s true. And even if it is, there is likely an upper bound on how many chained iterators they can handle before running into the same problem and that upper bound is simply not reached in this scenario.

3 Likes

Sorry for late reply, these are the timings

664579
00:00:10.285220777
664579
00:00:02.999019061
664579
00:00:00.072995937

There’s a big difference between 1 and 2. Could this be because aarch64-linux-android isn’t tier 1(not sure)?

After removing ‘each’, the timing for 3 is

00:00:00.048794375

I’m still getting used to using ‘each’. I use it to avoid allocation but didn’t expect that it could cause significant slowdown when not used properly. Most of my toy programs are now as fast as their rust equivalents. I’ll make a separate post for others because I don’t think the slowdowns are related to indexing.

Again, my system is 64 bit Android 13, Qualcomm Snapdragon 695 (2 x Arm Cortex A78 + 6 x Arm Cortex A55), 6 / 128.

@jgaskins@extgcd

Have you added --release when running?

Yes. Benchmarks without it aren’t useful, even relative to each other. When optimizations are performed with --release, some operations might be 2x as fast, others might be 20x.

Okay, It seems that there is a significant difference for optimization between M1 and AMD Ryzen 7 7840HS.

original version:


 ╰─ $ cr build 1.cr --release

 ╰─ $ ./1
664579
00:00:02.549920594
664579
00:00:01.223750872
664579
00:00:00.035162855

Yep, that’s what I was saying

Yes, of course. And I’m actually surprised that my timings are near @jgaskins timings. SD 695 is an old low tier midrange Arm soc. I was expecting M1 to be much faster.

hehe The M1 is actually a year older than the SD 695. And 2 of the 3 timings were ~2x as fast on the M1, which seems about right. I don’t know why the third wasn’t at the same ratio, but if it had to keep going back to RAM a lot, that latency could explain at least part of it. Otherwise, I’m not sure.

1 Like