Why so big differences?

extgcd · January 21, 2024, 2:41am

First, I’m not a developer. I have a shallow knowledge about programming so please be patient with me.

Second, yeah, English isn’t my first language. Not even second.

My question is why are there big differences in timings between these three functions implementing Sieve of Eratosthenes when the main difference is the indexing style(?). I’m sorry if this a very noob question. I don’t really know when to use which style. It would be great if you can point me to a “Crystal Tips and Tricks” page.

PS: I ran these in my Realme 9 pro (Snapdragon 695, 6/128) through termux. Crystal 1.11.2.

require "bit_array"

def sieve_1(limit : UInt32) : BitArray
    is_prime = BitArray.new(limit + 1, true)
    is_prime[0] = is_prime[1] = false
  
    is_prime.each_index.skip(4).step(2).each { |i| is_prime[i] = false }
    (3..Math.isqrt(is_prime.size - 1)).each.step(2).select{ |x| is_prime[x] }.each do |i|
    is_prime.each_index.skip(i*i).step(2*i).each { |j| is_prime[j] = false }
    end
 
    is_prime
end
def sieve_2(limit : UInt32) : BitArray
    is_prime = BitArray.new(limit + 1, true)
    is_prime[0] = is_prime[1] = false
  
    (4...is_prime.size).each.step(2).each { |i| is_prime[i] = false }
    (3..Math.isqrt(is_prime.size - 1)).each.step(2).select{|x| is_prime[x]}.each do |i|
    (i*i...is_prime.size).each.step(2*i).each { |j| is_prime[j] = false }
    end
  
    is_prime
end
def sieve_3(limit : UInt32) : BitArray
    is_prime = BitArray.new(limit + 1, true)
    is_prime[0] = is_prime[1] = false
   
    4.step(to: is_prime.size - 1, by: 2).each { |i| is_prime[i] = false }
    3.step(to: Math.isqrt(is_prime.size - 1), by: 2).select{|x| is_prime[x]}.each do |i|
    (i*i).step(to: is_prime.size - 1, by: 2*i).each {|j| is_prime[j] = false}
    end
  
    is_prime
end
 
elapsed = Time.measure do
   puts sieve_1(10u32**7).each.count{ |x| x}
end
puts elapsed
elapsed = Time.measure do
   puts sieve_2(10u32**7).each.count{ |x| x}
end
puts elapsed
elapsed = Time.measure do
   puts sieve_3(10u32**7).each.count{ |x| x}
end
puts elapsed

kojix2 · January 21, 2024, 5:43am

At a quick glance, I suspect that the first two are creating arrays and allocating memory. The last one, on the other hand, does not create an array, so I thought there would be no unnecessary memory allocation. (Sorry if I’m wrong)
If you are on Mac or Linux, you can use Instruments.app, perf, valgrind, or other common tools to determine the cause.

straight-shoota · January 21, 2024, 3:16pm

Could you tell us what kind of performance differences you see?

I can run the code myself and see what comes from that. But your results might be different and we wouldn’t be talking about the same data.

This is the output on my machine (with --release):

664579
00:00:02.656616357
664579
00:00:01.523873195
664579
00:00:00.043423086

So solutions 1 and 2 are in the same ball park, but solution 3 is orders of magnitude faster.

I suspect that the nested loop in solution 3 might be easier for the compiler to optimize. It’s essentially just a StepIterator. The other examples have more complex chained iterators that are harder to optimize.

You can even gain a bit more performance by dropping this iterator entirely: Instead of step(*args).each { block } you can write step(*args) { block }. This iterates the block directly without allocating an iterator instance. That allows even more code optimizations (and it’s already much faster without any optimizations) and is an instant performance boost.

There are a couple more unnecessary .each in your code. You can drop them all. In most cases, it doesn’t make much difference when you already have an iterator (because then it’s a no-op). Somtimes like mentioned in the previous paragraph, it has a signfificant effect.
Similarly, .each.count in the measure blocks should also be just .count. This ends up using BitArray#count which is more optimized than Iterator#count.

zw963 · January 21, 2024, 4:00pm

Interesting …

Obviously the Big performance different come from following nested block.

# 1
  (3..Math.isqrt(is_prime.size - 1)).each.step(2).select { |x| is_prime[x] }.each do |i|
    is_prime.each_index.skip(i*i).step(2*i).each { |j| is_prime[j] = false }
  end

# 2
  (3..Math.isqrt(is_prime.size - 1)).each.step(2).select { |x| is_prime[x] }.each do |i|
    (i*i...is_prime.size).each.step(2*i).each { |j| is_prime[j] = false }
  end

# 3
  3.step(to: Math.isqrt(is_prime.size - 1), by: 2).select { |x| is_prime[x] }.each do |i|
    (i*i).step(to: is_prime.size - 1, by: 2*i).each { |j| is_prime[j] = false }
  end

I test use without --release, 1 and 2 similar, but 3 60X faster.

Only check code, there should never be such a HUG difference, I suspect it’s a Crystal bug?

@extgcd, can you use crenv install several version (It’s better to have a bigger span), and test those code again? this can confirm whether it is a regression issue.

straight-shoota · January 21, 2024, 4:27pm

You should never run performance comparisons for “production” code without --release. Performance measuring is usually pretty meaningless if you disable optimizations.

Hold your horses. Why do you thtink there should not be such a huge difference in performance? And why would it be a regression?

The performance characteristics are pretty similar with Crystal 1.2.0 (that’s the first version with Math.isqrt, so I didn’t test any older ones). If anything, performance has improved since then (which is to be expected).

I don’t think there’s anything wrong here. You can express the same algorithm in different ways. Some are less efficient than others. That’s it. The more efficient ones probably benefit from some powerful optimizations which are unavailable to the other ones.

zw963 · January 21, 2024, 6:09pm

As you said, perhaps measuring is meaningless without --release, but, i thought, the person who ask the question want to the answer about why some are less efficient than others

Sorry, i didn’t read your comment before previous reply. some idea i thought same as you, it doesn’t make much difference when you already have an iterator (because then it’s a no-op), At least it shouldn’t be #3 60X fast. I test it many times. if you could show the proff e.g #1 #2 where couldn’t optimize, where #3 do, that more make sense.

zw963 · January 21, 2024, 6:24pm

Well, It’s seem like the performance related to too many each calls definitely.

Check following screenshot, i delete useless each, get significantly different results.

Left:

╰─ $ cr run 1.cr
664579
00:00:30.760603411
664579
00:00:28.618370458
664579
00:00:00.521438629

Right:

╰─ $ cr run 2.cr
664579
00:00:30.486961448
664579
00:00:00.441668254
664579
00:00:00.441089385

Although the issue is still there, is 60X faster reasonable when remove five useless each (just no-op)?

straight-shoota · January 21, 2024, 6:39pm

Only Iterator#each is a no-op.
Your transformations remove some iterators entirely which is a completely different level of efficiency.
See my previous comment:

zw963 · January 21, 2024, 7:12pm

I consider the iterator convert is the key reason.

e.g.

(4…is_prime.size).each.step(2).each

Range(Int32, Int32) → Range::ItemIterator(Int32, Int32) → Iterator::StepByIterator(Range::ItemIterator(Int32, Int32), Int32, Int32) → Iterator::StepByIterator(Range::ItemIterator(Int32, Int32), Int32, Int32)

(4…is_prime.size).step(2)

Range(Int32, Int32) → Steppable::StepIterator(Int32, Int32, Int32)

#1 same case too.

jgaskins · January 21, 2024, 9:43pm

@extgcd mentioned running it on a Snapdragon, which IIRC is a Qualcomm Arm CPU. I don’t have a Linux Arm machine, but running it on the Apple M1, I see an order of magnitude difference in performance between 1 and 2:

664579
00:00:09.103127417
664579
00:00:01.558068791
664579
00:00:00.044070083

One thing that jumps out immediately as a source of slowdown is the use of Enumerable#skip. It works like the OFFSET clause in SQL in that it still has to evaluate that entry even if it isn’t going to do anything with it. This means that even though you have constant-time random access in the BitArray, getting to your starting point happens in linear time.

Based on @straight-shoota’s results in running this code, the iteration in skip also seems to be much slower on Arm than on Intel/AMD CPUs, but I don’t fully understand why. Checking with Benchmark.memory shows 1.2MB allocated per method invocation so I don’t think differences in the number of Iterator allocations are the problem.

I ran this code through Instruments (the LLVM/Clang profiler in Xcode) and it looks like some of the chained iterators may just be difficult to optimize on Arm. You can see in the screenshot below that the next functions are inlined, but all of the top functions are still nested next calls. The only explanation I can imagine is that Intel/AMD processors might be able to hold more state in the CPU cache and registers than Arm processors in this scenario, but I don’t know if that’s true. And even if it is, there is likely an upper bound on how many chained iterators they can handle before running into the same problem and that upper bound is simply not reached in this scenario.

extgcd · January 22, 2024, 8:01am

Sorry for late reply, these are the timings

664579
00:00:10.285220777
664579
00:00:02.999019061
664579
00:00:00.072995937

There’s a big difference between 1 and 2. Could this be because aarch64-linux-android isn’t tier 1(not sure)?

After removing ‘each’, the timing for 3 is

00:00:00.048794375

I’m still getting used to using ‘each’. I use it to avoid allocation but didn’t expect that it could cause significant slowdown when not used properly. Most of my toy programs are now as fast as their rust equivalents. I’ll make a separate post for others because I don’t think the slowdowns are related to indexing.

Again, my system is 64 bit Android 13, Qualcomm Snapdragon 695 (2 x Arm Cortex A78 + 6 x Arm Cortex A55), 6 / 128.

zw963 · January 22, 2024, 2:15pm

@jgaskins ， @extgcd

Have you added --release when running?

jgaskins · January 22, 2024, 5:32pm

Yes. Benchmarks without it aren’t useful, even relative to each other. When optimizations are performed with --release, some operations might be 2x as fast, others might be 20x.

zw963 · January 22, 2024, 6:31pm

Okay, It seems that there is a significant difference for optimization between M1 and AMD Ryzen 7 7840HS.

original version:


 ╰─ $ cr build 1.cr --release

 ╰─ $ ./1
664579
00:00:02.549920594
664579
00:00:01.223750872
664579
00:00:00.035162855

jgaskins · January 22, 2024, 6:41pm

Yep, that’s what I was saying

extgcd · January 23, 2024, 12:14am

Yes, of course. And I’m actually surprised that my timings are near @jgaskins timings. SD 695 is an old low tier midrange Arm soc. I was expecting M1 to be much faster.

jgaskins · January 23, 2024, 12:44am

hehe The M1 is actually a year older than the SD 695. And 2 of the 3 timings were ~2x as fast on the M1, which seems about right. I don’t know why the third wasn’t at the same ratio, but if it had to keep going back to RAM a lot, that latency could explain at least part of it. Otherwise, I’m not sure.

Topic		Replies	Views
Math.isqrt vs Math.sqrt mem size Help & Support	6	298	January 25, 2022
Why it this faster? Help & Support	2	390	February 24, 2022
Arbitrary Arithmetic Library Written in Crystal Community	7	406	June 5, 2023
How to make this faster? Help & Support	10	231	January 23, 2024
Segmented Sieve Memory Usage Help & Support	3	228	December 24, 2021

Why so big differences?

Related Topics