How to make this faster?

I have a toy program for generating first N 100-smooth numbers (i.e. each has all its prime factors less than 100). The Crystal version runs in about 1.2s while the Rust version runs in less than .5s. So rust version is more than 2x faster. My other programs that use arrays(sieve, totients, etc) have around the same perf as their Rust versions but this one isn’t.

Crystal

def get_100_smooths(limit : UInt32) : Array(UInt32)
    primes = Array(UInt32){2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97}
    sp_idxs = Array.new(primes.size, 0)
    cands = primes.dup
    smooths = Array.new(limit.to_i32 + 1, 1u32)
  
    1.upto(to: smooths.size - 1) do |si|
       smooths[si] = cands.min
   
       0.upto(to: cands.size - 1).select { |ci| cands[ci] == smooths[si]}.each do |ci|
          sp_idxs[ci] += 1
          cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
       end
   end
  
   smooths
end
elapsed = Time.measure do
   puts get_100_smooths(5_000_000u32).last
end
puts elapsed

Rust

fn get_100_smooths(lim: u32) -> Vec<u32> {                                                                                       
      const PRIMES: [u32; 25] = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97];   
      let mut sp_idxs = [0usize; PRIMES.len()];
      let mut cands = PRIMES;
      let mut smooths = vec![1u32; (lim + 1) as usize];
   
      (1..smooths.len()).for_each(|si| {
            smooths[si] = *cands.iter().min().unwrap();
    
            (0..cands.len()).for_each(|ci| {
                  if cands[ci] == smooths[si] {
                        sp_idxs[ci] += 1;
                        cands[ci] = smooths[sp_idxs[ci]] * PRIMES[ci];
                  }
             });
      });

      smooths
}
fn main() {
      let timer = std::time::Instant::now();
      let smooths = get_100_smooths(5_000_000);
  
      println!("{:?}", smooths.last());
      println!("{:?}", timer.elapsed());
}

These were ran on a aarch64-linux-android through termux.

Maybe unrelated:
When I compile a rust version (this or the toy programs) compilation takes less than 2s. But with Crystal it takes about 15s. The part “Codegen (bc + obj)” takes about 13s. Is that normal? These are small Crystal programs and 15s doesn’t seem appropriate.

I’m not very experienced with Rust, but it looks like your Rust implementation is using some array types, with a constant, static size. The equivalent in Crystal would probably be StaticArray (I’m not entirely sure about the semantic details in Rust).

Another difference in the implementation is that the inner loop uses an interator in Crystal, while in Rust it’s just a for_each. Using the same shape in Crystal would avoid allocating an iterator, thus free up a lot of time spent allocating and GCing.

This is similar to what I wrote in Why so big differences? - #3 by straight-shoota

If you want best performance, try to avoid Iterator if possible.

1 Like

I test on my local laptop (AMD Ryzen 7 7840HS), Rust version slower than Crystal version.

 ╰─ $ cr build --release a.cr

 ╰─ $ ./a 
2998212735
00:00:00.664267188
 ╰─ $ rustc -C target-cpu=native a.rs -o b
 ╰─ $ ./b
Some(2998212735)
2.838247485s

Is it possible use Rust compiler is incorrect?

I removed ‘select’ from the inner loop and time now is around Rust’s . I also removed it from some of my codes and some had noticeable perf improvement while others didn’t (like sieve, maybe because its just checking for truthiness). It’s a pity that I cannot use Iterators more freely since they can really shorten code(reduce locs) while being clear. Thanks for your time!

I compiled rust with ‘cargo clean && cargo r -r’. And it also passes ‘cargo t’, without ‘-r’. That difference is really weird.

PS:
I’m not sure but I don’t think I see an optimization flag in your Rust compile command.

FYI. I made your original Crystal code almost twice as fast on my system.
Linux kernel 6.5.13, Crystal 1.11.2, AMD Ryzen 9 5900HX

Here’s a more optimized Crystal version.

def get_100_smooths(limit : UInt32) : Array(UInt32)
  primes  = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97] of UInt32
  sp_idxs = Array.new(primes.size, 0)
  cands   = primes.dup
  smooths = Array.new(limit.to_i32 + 1, 1u32)

  (1...smooths.size).each do |si|
    smooths[si] = cands.min

    cands.size.times do |ci|
      if cands[ci] == smooths[si]
        sp_idxs[ci] += 1
        cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
  end end end
  smooths
end

elapsed = Time.measure do
   puts get_100_smooths(5_000_000u32).last
end
puts elapsed

Typical time differences.

➜  crystal run --release forumcode_orig.cr 
2998212735
00:00:00.654938502

➜  crystal run --release forumcode_opt.cr
2998212735
00:00:00.385553579
1 Like

This is about the same as my modified code after following @straight-shoota’s suggestions. I removed ‘select’ and used an ‘if’. I also used a StaticArray for ‘primes’, ‘sp_idxs’ and ‘cands’. Your version runs in around 0.7s while my modified version runs in around 0.5s. It’s really a pity Iterators aren’t as optimized as their imperative equivalents.

With StaticArray.

def get_100_smooths(limit : UInt32) : Array(UInt32)
  primes  = UInt32.static_array(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)
  sp_idxs = StaticArray(UInt32, 25).new(0)
  cands   = primes.dup
  smooths = Array.new(limit.to_i32 + 1, 1u32)

  (1...smooths.size).each do |si|
    smooths[si] = cands.min

    cands.size.times do |ci|
      if cands[ci] == smooths[si]
        sp_idxs[ci] += 1
        cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
  end end end
  smooths
end

elapsed = Time.measure do
   puts get_100_smooths(5_000_000u32).last
end

puts elapsed

Slightly faster times.

➜  crystal run --release forumcode_opt.cr
2998212735
00:00:00.331489153
1 Like

Just for newbies, and to show another alternative. A smidgen less code.
Has same performance.

def get_100_smooths(limit : UInt32) : Array(UInt32)
  primes  = UInt32.static_array(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)
  sp_idxs = StaticArray(UInt32, 25).new(0)
  cands   = primes.dup
  smooths = Array.new(limit.to_i32 + 1, 1u32)

  (1...smooths.size).each do |si|
    smooths[si] = cands.min

    cands.size.times do |ci|
      next unless cands[ci] == smooths[si]
      sp_idxs[ci] += 1
      cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
  end end
  smooths
end
1 Like

The select never shorter you code instead of use only loop in this case.

In fact, use select Iterator with a loop is consider rarely, 10 loops with if/else in it always faster than 4 loops nested 4 select, i hope you understand what I mean .

next in the external Iterator(loop), this may be the most common usage.