Using CPUs More Efficiently

jzakiya · January 17, 2022, 7:47pm

After watching this presentation (and some referenced links) on using cpu resources more efficiently I was able to increase performance by just changing the sequence of a few lines of code (loc).

The goal is to use as many cpu execution units (EU) simultaneously as possible.
So you order sequential lines of code that can be executed independently.
Here’ one snippet.

Original
def nextp_init(rhi, kmin, modpg, primes, resinvrs)
  nextp = Slice(UInt64).new(primes.size*2)
  r_hi, r_lo = rhi, rhi - 2
  primes.each_with_index do |prime, j|
    k = (prime - 2) // modpg 
    r = (prime - 2) %  modpg + 2 
    r_inv = resinvrs[r].to_u64   
    ri = (r_inv * r_lo - 2) % modpg + 2
    ki = k * (prime + ri) + (r * ri - 2) // modpg 
    ki < kmin ? (ki = (kmin - ki) % prime; ki = prime - ki if ki > 0) : (ki -= kmin)
    nextp[j << 1] = ki.to_u64 
    ri = (r_inv * r_hi - 2) % modpg + 2 
    ki = k * (prime + ri) + (r * ri - 2) // modpg
    ki < kmin ? (ki = (kmin - ki) % prime; ki = prime - ki if ki > 0) : (ki -= kmin)
    nextp[j << 1 | 1] = ki.to_u64
  end
  nextp
end

More Efficient/Faster
def nextp_init(rhi, kmin, modpg, primes, resinvrs)
  nextp = Slice(UInt64).new(primes.size*2)
  r_hi, r_lo = rhi, rhi - 2
  primes.each_with_index do |prime, j|
    k = (prime - 2) // modpg 
    r = (prime - 2) %  modpg + 2
    r_inv = resinvrs[r].to_u64 
    rl = (r_inv * r_lo - 2) % modpg + 2
    rh = (r_inv * r_hi - 2) % modpg + 2 
    kl = k * (prime + rl) + (r * rl - 2)
    kh = k * (prime + rh) + (r * rh - 2)
    kl < kmin ? (kl = (kmin - kl) % prime; kl = prime - kl if kl > 0) : (kl -= kmin)
    kh < kmin ? (kh = (kmin - kh) % prime; kh = prime - kh if kh > 0) : (kh -= kmin)
    nextp[j << 1] = kl.to_u64 
    nextp[j << 1 | 1] = kh.to_u64 
  end
  nextp
end

Changing the code sequence such that the next loc isn’t dependent on input from the previous allows two EUs to run these instructions simultaneously. This one change in this one routine appreciably increased total runtime speed.

Here’s another snippet example from the same program.

Original
    primes.each_with_index do |prime, j| 
      k = nextp.to_unsafe[j << 1]        
      while k < kn                       
        seg[k >> s] |= 1_u64 << (k & bmask)
        k += prime  end                  
      nextp.to_unsafe[j << 1] = k - kn   
                                         
      k = nextp.to_unsafe[j << 1 | 1]    
      while k < kn                       
        seg[k >> s] |= 1_u64 << (k & bmask)
        k += prime  end                   
      nextp.to_unsafe[j << 1 | 1] = k - kn
    end                                   

More Efficient/Faster
    primes.each_with_index do |prime, j|
      kl = nextp.to_unsafe[j << 1]      
      kh = nextp.to_unsafe[j << 1 | 1]  
                                        
      while kl < kn                     
        seg[kl >> s] |= 1_u64 << (kl & bmask)
        kl += prime  end
                        
      while kh < kn     
        seg[kh >> s] |= 1_u64 << (kh & bmask)
        kh += prime  end                
        
      nextp.to_unsafe[j << 1] = kl - kn 
      nextp.to_unsafe[j << 1 | 1] = kh - kn 
    end

These are snippets from the Crystal versions of my twinprimes_ssoz sieve.
But what I’ve seen, the gcc compiled versions for C++|Nim show a much bigger speed improvement (execution time reduction) than languages using llvm, except for Rust which also sees significant improvement.

This got me wondering, are there places in Crystal’s code base where things like this can be done to improve performance and|or memory use?

I also wanted to hip users to these FREE speedups by just writing your code better.

I hope this has been useful.

vlazar · January 18, 2022, 6:23am

Reminded me about GitHub - plasma-umass/coz: Coz: Causal Profiling and Crystal bindings GitHub - RX14/coz.cr

Crystal bindings might not work completely correctly as was mentioned on this forum, because of some issues in debug info which needs to be fixed for results to be accurate. It would be nice to have such tools for optimization.

ComputerMage · January 18, 2022, 7:10pm

Great video!
I am already looking at his book on my O’Reilly Learning subscription.

Topic		Replies	Views
Why it this faster? Help & Support	2	415	February 24, 2022
How to make this faster? Help & Support	10	312	January 23, 2024
No Speed Improvement with Spawn Multithreading？ Help & Support	3	322	July 16, 2023
Goldbach's Conjecture and Crystal Code Help & Support	55	994	March 20, 2025
Array.rotate! can be faster Learning Resources	4	816	November 25, 2019

Using CPUs More Efficiently

Related topics