Crystal vs Truffleruby

I was quite surprised that Truffleruby (23.1.1) was so much faster than Crystal (1.10.1) for this task.
Apparently the Graal VM is doing some very good stuff here.

I installed Truffleruby on my Linux system via rvm (ruby version manager) – https://rvm.io/

I first wrote this in Ruby – residue_gaps.rb

def residue_gaps(pm)
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43]
  i = primes.index(pm)
  modpn = primes[0..i].reduce :*          # modulus is primorial pm#
  res_gaps = Hash.new(0)                  # create hash of residue gaps
  res_gaps[2] = 1; res_gaps[4] = 1        # initial gap counts of 2 and 4
  r1, r2 = 1, 2 + pm                      # initial first two residues
  while r2 < modpn / 2                    # for first half modulus residues
    if r2.gcd(modpn) == 1                 # if r2 is a residue
      res_gaps[r2 - r1] += 2              # compute gap size, double cnt for mirror pair
      r1 = r2                             # set old residue to current residue
    end
    r2 += 2                               # set next odd value to current residue
  end
  res_gaps.to_a.sort!                     # output array of [gap size, gap count]
end

pm   = ARGV[0].to_i                       # read prime value from terminal
print residue_gaps(pm)                    # display array of ai gap values
puts

Then wrote it in Crystal – residue_gaps.cr

def residue_gaps(pm)
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43] of UInt64
  i = primes.index(pm).not_nil!
  modpn = primes[0..i].product            # modulus is primorial pm#
  res_gaps = Hash(Int32, UInt64).new(0)   # create hash of residue gaps
  res_gaps[2] = 1; res_gaps[4] = 1        # initial gap counts of 2 and 4
  r1, r2 = 1u64, 2u64 + pm                # initial first two residues
  while r2 < modpn // 2                   # for first half modulus residues
    if r2.gcd(modpn) == 1                 # if r2 is a residue
      res_gaps[(r2 - r1).to_i32] += 2     # compute gap size, double cnt for mirror pair
      r1 = r2                             # set old residue to current residue
    end
    r2 += 2                               # set next odd value to current residue
  end
  res_gaps.to_a.sort!                     # output array of [gap size, gap count]
end

pm   = ARGV[0].to_u64                     # read prime value from terminal
print residue_gaps(pm)                    # display array of ai gap values
puts

$ crystal build --release residue_gaps.cr

Then run them like this:

$ time ruby residue_gaps.rb 17
[[2, 22275], [4, 22275], [6, 26630], [8, 6812], [10, 7734], [12, 4096], [14, 1406], [16, 432], [18, 376], [20, 24], [22, 78], [24, 20], [26, 2]]
ruby residue_gaps.rb 17  0.05s user 0.01s system 99% cpu 0.095 total

$ time ./residue_gaps 17
[{2, 22275}, {4, 22275}, {6, 26630}, {8, 6812}, {10, 7734}, {12, 4096}, {14, 1406}, {16, 432}, {18, 376}, {20, 24}, {22, 78}, {24, 20}, {26, 2}]
./residue_gaps 17  0.01s user 0.00s system 97% cpu 0.010 total

And here are timing comparisons (in seconds)


   prime inputs     |   11   |   13   |   17   |   19   |   23   |   29   |   31   |
--------------------|--------|--------|--------|--------|--------|--------|--------|
   crystal 1.10.1   |  0.002 |  0.002 |  0.010 |  0.193 |  5.123 | 175.63 |6217.55 |
--------------------|--------|--------|--------|--------|--------|--------|--------|
 truffleruby 23.1.1 |  0.045 |  0.060 |  0.095 |  0.310 |  2.968 |  91.66 |3297.21 |

As the primes increase Truffleruby becomes almost twice as fast as Crystal.
Can Crystal learn something from this?

1 Like

I remembered after posting the code that the Crystal stdlib gcd method is slow for this.
It seems Truffleruby’s stdlib gcd method is highly optimized. Might be something to learn from.

I created a customized coprime? which makes the code increasingly fast than Truffleruby.
But it’s still impressive that Trufferuby is that fast straight out the box!

# Customized gcd to determine coprimality; n > m; m odd
def coprime?(m, n)
  while m|1 != 1; t = m; m = n % m; n = t end
  m > 0
end

def residue_gaps(pm)
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43] of UInt64
  i = primes.index(pm).not_nil!
  modpn = primes[0..i].product            # modulus is primorial pm#
  res_gaps = Hash(Int32, UInt64).new(0)   # create hash of residue gaps
  res_gaps[2] = 1; res_gaps[4] = 1        # initial gap counts of 2 and 4
  r1, r2 = 1u64, 2u64 + pm                # initial first two residues
  while r2 < modpn // 2                   # for first half modulus residues
    if coprime?(r2, modpn)                # if r2 is a residue
      res_gaps[(r2 - r1).to_i32] += 2     # compute gap size, double cnt for mirror pair
      r1 = r2                             # set old residue to current residue
    end
    r2 += 2                               # set next odd value to current residue
  end
  res_gaps.to_a.sort!                     # output array of [gap size, gap count]
end

pm   = ARGV[0].to_u64                     # read prime value from terminal
print residue_gaps(pm)                    # display array of ai gap values
puts

Here are the time comparisons now.


   prime inputs     |   11   |   13   |   17   |   19   |   23   |   29   |   31   |
--------------------|--------|--------|--------|--------|--------|--------|--------|
   crystal 1.10.1   |  0.002 |  0.002 |  0.006 |  0.099 |  2.549 |  85.08 |2978.47 |
--------------------|--------|--------|--------|--------|--------|--------|--------|
 truffleruby 23.1.1 |  0.045 |  0.060 |  0.095 |  0.310 |  2.968 |  91.66 |3297.21 |

Unless I’m mistaken, I think this is TruffleRuby’s implementation for gcd.
I decided to use that implemenation in Crystal, and it’s definitely faster than Crystal’s stdlib. However, your custom coprime? is still much faster.

TruffleRuby’s JIT compiler can do some absolutely unbelievable things when it’s warmed up. The most mind-warping thing for me was how it can optimize out heap allocations in a method. The example Chris Seaton (former lead of the TruffleRuby project) used to give was [a, b].max. Normally, this expression would involve a heap allocation for the array and then iterate through it. When the JIT is warmed up, though, what it ends up executing is more like if you’d written: a > b ? a : b, completely sidestepping the heap allocation because the JIT detects that it isn’t needed.

That said, JIT compilers in general can take a pretty big number of iterations to optimize code that intensely and the performance can be significantly lower until they get to that point. The first executions of a method on TruffleRuby will perform worse than CRuby, not to mention deoptimizations due to new inputs can cause performance to get worse before it gets better. At my day job (a pretty large Ruby shop) we run both CRuby’s YJIT and TruffleRuby. The benefits we see from both on weekdays is limited — by the time it’s warmed up we’re deploying again and the JIT has to start over from scratch.

2 Likes

crystal implementation of gcd is pretty ineffective
there are some loops to remove trailing zero bits but this should be a single instruction instead: c++ - Fastest way to strip trailing zeroes from an unsigned int - Stack Overflow

well I guess you should provide your own implementation if the speed is important. the naive approach like in the example might be better because LLVM can recognize the patterns and optimize them out.

p/s: great news: now crystal gcd also uses the trailing_zeros_count method which is a single instruction: Use `#trailing_zeros_count` in `Int#gcd` by HertzDevil · Pull Request #14069 · crystal-lang/crystal · GitHub

2 Likes

In this case there was apparently no warmup needed, because repeated runs of the code produce essentially the same times.

I’ve noted before for truffleruby, for strictly numerical computational intense code it’s initial optimization is very good. That’s because, as in the case here, it knows all the parameter values before hand, and it doesn’t have to wait for external inputs or interact with external resources (like https requests, database responses, etc).

Because my goal wasn’t really to test jits, I didn’t try running CRuby with the various jits available for it, especally yjit which was recently updated by Shopify (its creators) and now performs much better with Rails apps, etc.

Yes, using trailing_zeroes_count not only makes that code shorter|simpler but should also make it significantly faster.

But this is also a testament on how well this community|forum works.

In many cases I first write code in Ruby then translate it to Crystal (sometime vice versa) specifically to see performance differences. In this case, I couldn’t believe Crystal was inherently slower than truffleruby for this, because ultimately it’s compiled down to code to run on the same cpu.

So thank you @HertzDevil for quickly jumping on this and assessing|fixing the apparent bottleneck.

I don’t think that’s accurate. Ruby is still a dynamic language. TruffleRuby can’t possibly know the parameter values or types before it executes the code.

In some cases it might be able to quickly infer types and optimize for them, but even then that’s not what I saw in running your code. On repeated executions of the residue_gaps method on multiple machines, I saw very different execution times in a repeatable pattern, and it even got about 76% slower before it got faster:

➜  Code ruby -v residue_gaps.rb 29
truffleruby 23.1.1, like ruby 3.2.2, Oracle GraalVM Native [aarch64-darwin]
       user     system      total        real
 133.984666   0.154553 134.139219 (134.120904)
       user     system      total        real
 151.947384   0.179215 152.126599 (152.990778)
       user     system      total        real
 150.383223   0.166429 150.549652 (151.434672)
       user     system      total        real
 236.379312   0.927856 237.307168 (237.050406)
       user     system      total        real
 129.522166   0.050365 129.572531 (129.883487)

This pattern occurred every time I ran it with the same input, including on an Intel machine — runs 1-4 trended more slowly (presumably due to deoptimizations) and the 5th being slightly faster than the 1st. I’m not sure how you were measuring repeated executions to get similar results each time, but what I did was wrap the residue_gaps(pm) call in 5.times { Benchmark.bm { gaps = residue_gaps(pm) } }).

That’s where I work. :smiling_face: The folks working on YJIT and TruffleRuby there are on a sister team to mine.

(Oh. Does Shopify use Crystal for anything?)

1 Like