How to make this faster?

extgcd · January 22, 2024, 9:55am

I have a toy program for generating first N 100-smooth numbers (i.e. each has all its prime factors less than 100). The Crystal version runs in about 1.2s while the Rust version runs in less than .5s. So rust version is more than 2x faster. My other programs that use arrays(sieve, totients, etc) have around the same perf as their Rust versions but this one isn’t.

Crystal

def get_100_smooths(limit : UInt32) : Array(UInt32)
    primes = Array(UInt32){2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97}
    sp_idxs = Array.new(primes.size, 0)
    cands = primes.dup
    smooths = Array.new(limit.to_i32 + 1, 1u32)
  
    1.upto(to: smooths.size - 1) do |si|
       smooths[si] = cands.min
   
       0.upto(to: cands.size - 1).select { |ci| cands[ci] == smooths[si]}.each do |ci|
          sp_idxs[ci] += 1
          cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
       end
   end
  
   smooths
end
elapsed = Time.measure do
   puts get_100_smooths(5_000_000u32).last
end
puts elapsed

Rust

fn get_100_smooths(lim: u32) -> Vec<u32> {                                                                                       
      const PRIMES: [u32; 25] = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97];   
      let mut sp_idxs = [0usize; PRIMES.len()];
      let mut cands = PRIMES;
      let mut smooths = vec![1u32; (lim + 1) as usize];
   
      (1..smooths.len()).for_each(|si| {
            smooths[si] = *cands.iter().min().unwrap();
    
            (0..cands.len()).for_each(|ci| {
                  if cands[ci] == smooths[si] {
                        sp_idxs[ci] += 1;
                        cands[ci] = smooths[sp_idxs[ci]] * PRIMES[ci];
                  }
             });
      });

      smooths
}
fn main() {
      let timer = std::time::Instant::now();
      let smooths = get_100_smooths(5_000_000);
  
      println!("{:?}", smooths.last());
      println!("{:?}", timer.elapsed());
}

These were ran on a aarch64-linux-android through termux.

Maybe unrelated:
When I compile a rust version (this or the toy programs) compilation takes less than 2s. But with Crystal it takes about 15s. The part “Codegen (bc + obj)” takes about 13s. Is that normal? These are small Crystal programs and 15s doesn’t seem appropriate.

straight-shoota · January 22, 2024, 11:15am

I’m not very experienced with Rust, but it looks like your Rust implementation is using some array types, with a constant, static size. The equivalent in Crystal would probably be StaticArray (I’m not entirely sure about the semantic details in Rust).

Another difference in the implementation is that the inner loop uses an interator in Crystal, while in Rust it’s just a for_each. Using the same shape in Crystal would avoid allocating an iterator, thus free up a lot of time spent allocating and GCing.

This is similar to what I wrote in Why so big differences? - #3 by straight-shoota

If you want best performance, try to avoid Iterator if possible.

zw963 · January 22, 2024, 4:47pm

I test on my local laptop (AMD Ryzen 7 7840HS), Rust version slower than Crystal version.

 ╰─ $ cr build --release a.cr

 ╰─ $ ./a 
2998212735
00:00:00.664267188

 ╰─ $ rustc -C target-cpu=native a.rs -o b
 ╰─ $ ./b
Some(2998212735)
2.838247485s

Is it possible use Rust compiler is incorrect?

extgcd · January 23, 2024, 12:03am

I removed ‘select’ from the inner loop and time now is around Rust’s . I also removed it from some of my codes and some had noticeable perf improvement while others didn’t (like sieve, maybe because its just checking for truthiness). It’s a pity that I cannot use Iterators more freely since they can really shorten code(reduce locs) while being clear. Thanks for your time!

extgcd · January 23, 2024, 12:07am

I compiled rust with ‘cargo clean && cargo r -r’. And it also passes ‘cargo t’, without ‘-r’. That difference is really weird.

PS:
I’m not sure but I don’t think I see an optimization flag in your Rust compile command.

jzakiya · January 23, 2024, 1:36am

FYI. I made your original Crystal code almost twice as fast on my system.
Linux kernel 6.5.13, Crystal 1.11.2, AMD Ryzen 9 5900HX

Here’s a more optimized Crystal version.

def get_100_smooths(limit : UInt32) : Array(UInt32)
  primes  = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97] of UInt32
  sp_idxs = Array.new(primes.size, 0)
  cands   = primes.dup
  smooths = Array.new(limit.to_i32 + 1, 1u32)

  (1...smooths.size).each do |si|
    smooths[si] = cands.min

    cands.size.times do |ci|
      if cands[ci] == smooths[si]
        sp_idxs[ci] += 1
        cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
  end end end
  smooths
end

elapsed = Time.measure do
   puts get_100_smooths(5_000_000u32).last
end
puts elapsed

Typical time differences.

➜  crystal run --release forumcode_orig.cr 
2998212735
00:00:00.654938502

➜  crystal run --release forumcode_opt.cr
2998212735
00:00:00.385553579

extgcd · January 23, 2024, 2:44am

This is about the same as my modified code after following @straight-shoota’s suggestions. I removed ‘select’ and used an ‘if’. I also used a StaticArray for ‘primes’, ‘sp_idxs’ and ‘cands’. Your version runs in around 0.7s while my modified version runs in around 0.5s. It’s really a pity Iterators aren’t as optimized as their imperative equivalents.

jzakiya · January 23, 2024, 3:46am

With StaticArray.

def get_100_smooths(limit : UInt32) : Array(UInt32)
  primes  = UInt32.static_array(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)
  sp_idxs = StaticArray(UInt32, 25).new(0)
  cands   = primes.dup
  smooths = Array.new(limit.to_i32 + 1, 1u32)

  (1...smooths.size).each do |si|
    smooths[si] = cands.min

    cands.size.times do |ci|
      if cands[ci] == smooths[si]
        sp_idxs[ci] += 1
        cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
  end end end
  smooths
end

elapsed = Time.measure do
   puts get_100_smooths(5_000_000u32).last
end

puts elapsed

Slightly faster times.

➜  crystal run --release forumcode_opt.cr
2998212735
00:00:00.331489153

jzakiya · January 23, 2024, 5:43am

Just for newbies, and to show another alternative. A smidgen less code.
Has same performance.

def get_100_smooths(limit : UInt32) : Array(UInt32)
  primes  = UInt32.static_array(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97)
  sp_idxs = StaticArray(UInt32, 25).new(0)
  cands   = primes.dup
  smooths = Array.new(limit.to_i32 + 1, 1u32)

  (1...smooths.size).each do |si|
    smooths[si] = cands.min

    cands.size.times do |ci|
      next unless cands[ci] == smooths[si]
      sp_idxs[ci] += 1
      cands[ci] = smooths[sp_idxs[ci]] * primes[ci]
  end end
  smooths
end

zw963 · January 23, 2024, 8:12am

The select never shorter you code instead of use only loop in this case.

In fact, use select Iterator with a loop is consider rarely, 10 loops with if/else in it always faster than 4 loops nested 4 select, i hope you understand what I mean .

zw963 · January 23, 2024, 8:20am

next in the external Iterator(loop), this may be the most common usage.

Topic		Replies	Views
Kudos to Crystal! News	12	2148	December 29, 2021
Crystal vs Rust: A Comparison	18	10334	January 10, 2022
Faster floating point parsing algorithm	15	732	September 14, 2021
Using Rust inside a Crystal program Help & Support	22	3664	April 17, 2025
Ruby out performs Crystal significantly on this numerical algorithm Help & Support	18	1263	April 29, 2023

How to make this faster?

Related topics