Ruby out performs Crystal significantly on this numerical algorithm

Yesterday I ran a Ruby version of my Crystal implementation of the
Miller-Rabin primality test. I was surprised to find Ruby is significantly
faster for the same code with the same inputs.

Here’s the Ruby code.

def prime? (n, k = 4)
  primes =  [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
  return primes.include? n if n <= primes.last
  modpn = 23286236435849736090006331688050736307 # 101# (101 primorial)
  return false if modpn.gcd(n) != 1
  return true  if n < primes.last ** 2
  # Start Miller-Rabin primality test
  neg_one_mod = t = d = n - 1       # these are even as n is always odd
  while d.even?; d >>= 1 end        # shift out factors of 2 to make d odd
  witnesses = k.times.map{ 2 + rand(n - 4) } # create k random bases
  witnesses.each do |b|             # perform Miller-Rabin bases tests
    y = b.pow(d, n)                 # y = (b**d) mod n
    s = d
    until y == 1 || y == neg_one_mod || s == t
      y = (y * y) % n               # y = (y**2) mod n
      s <<= 1
    end
    return false unless y == neg_one_mod || s.odd?
  end
  true
end

Here’s the Crystal code.

require "big"

@[Link("gmp")]
lib LibGMP
  fun mpz_powm_sec = __gmpz_powm_sec(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmod(b, e, m) # y = b^e mod m
  y = BigInt.new
  LibGMP.mpz_powm_sec(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

def prime? (n, k = 4)
  primes =  [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
  return primes.includes? n if n <= primes.last
  modpn = 232862364358497360900063316880507363070u128.to_big_i # 101# (101 primorial)
  return false if modpn.gcd(n.to_big_i) != 1
  return true  if n < primes.last ** 2
  # Start Miller-Rabin primality test
  neg_one_mod = t = d = n - 1    # these are even as n is always odd
  d >>= d.trailing_zeros_count   # shift out factors of 2 to make d odd
  witnesses = k.times.map{ 2 + rand(n - 4).to_big_i } # create k random bases
  witnesses.each do |b|          # perform Miller-Rabin bases tests
    y = powmod(b, d, n)          # y = (b**d) mod n
    s = d
    until y == 1 || y == neg_one_mod || s == t
      y = (y &* y) % n           # y = (y**2) mod n
      s <<= 1
    end
    return false unless y == neg_one_mod || s.odd?
  end
  true
end

My system

➜ inxi
CPU: 8-core AMD Ryzen 9 5900HX with Radeon Graphics (-MT MCP-)
speed/min/max: 1262/1200/4679 MHz Kernel: 6.1.22-pclos1 x86_64 Up: 11h 2m
Mem: 6124.9/39489.3 MiB (15.5%) Storage: 953.87 GiB (13.0% used) Procs: 349
Shell: Zsh inxi: 3.3.26

Here are some tests I did with various large digit primes.

# Ruby tests
# $ ruby testfile.rb

def tm; t = Time.now; yield; Time.now - t end

#n = (70965694293 << 200006) - 1      # 60,219 digits
#n = (4111286921397 << 66420) + 5     # 20,008 digits
#n = (4122429552750669 << 16567) + 7  #  5,003 digits
n = (12379 << 12379) - 1             #  3,731 digits
print "\n number = #{n} is prime? "; print " in ", tm{ print prime?(n) }, " secs"
puts

# Crystal tests
# $ crystal run --release testfile.cr

def tm; t=Time.monotonic; yield; (Time.monotonic-t).total_seconds.round(3) end

#n = ("70965694293".to_big_i << 200006) - 1      # 60,219 digits
#n = ("4111286921397".to_big_i << 66420) + 5     # 20,008 digits
#n = ("4122429552750669".to_big_i << 16567) + 7  #  5,003 digits
n = ("12379".to_big_i << 12379) - 1             #  3,731 digits
print "\n number = #{n} is prime? "; print " in ", tm{ print prime?(n) }, " secs"
put

Best times after multiple runs.
Ruby gets progressively faster (or Crystal slower) as the prime digits increase.

Prime Digits  | 3,731 |  5,003 |  20,008 |  60,219  |
--------------|-------|--------|------- -|----------|
Ruby 3.2.2    | 0.959 |  1.693 |  49.001 |  727.575 |
--------------|-------|--------|-------- |----------|
Crystal 1.7.3 | 1.843 |  4.423 | 279.322 | 7600.197 |

This was a big surprise, as I hadn’t run this code in Ruby in a while.
This shows Ruby’s efforts to increase performance has paid off, at least here.

But what’s really concerning is that the performance difference isn’t constant,
but Crystal becomes increasingly relatively slower as the prime digits increase.
I suspect it has to do with the BigInts implementation, as Ruby automatically
uses the appropriate integer type and memory as needed.

So, can the Crystal version be significantly sped up?

(FYI, when I run the Ruby code with TruffleRuby and JRuby it’s slower than Crystal.)

1 Like

Crystal bigint is slow. I tried to brute force one advent of code problem last year (the modulo problem) with bigint and it ran like forever.

mpz_powm_sec is intended to be slow (the sec stands for “secure”) in order to be resilient against timing attacks. Did you try mpz_powm?

4 Likes

Please look at this from the point of view of a user doing numerical algorithms.
I used the resources that are available in the standard documentation for both languages.
I coded the same routine exactly the same (per implementation details).
And Ruby, to my surprise with 3.2.2, creamed Crystal 1.7.3, the compiled language.

I’m a fan and user of both languages, but seeing is believing!

Whatever Ruby did to perform this algorithm in 727 secs (12mins, 7 secs) vs 7600 secs (2hrs, 6min, 40secs) shouldn’t be ignored.

If Crystal is going to market itself as “fast”, to compete with C|C++ and Rust, and if BigInts is known to be slow, it needs to do something about it, as these time comparisons with Ruby are troubling (at least to me).

I know this doesn’t apply to most areas (that aren’t numerically heavy), but I would urge you to put this on the 2.0 TODO list to correct. If that means discarding BigInts, and coming up with something new AND better, so be it. But this doesn’t currently position Crystal for use with many arbitrary integer size algorithms and mathematical fields (like cryptography and number theory), at least if performance is important.

2 Likes

I don’t understand why you see this as a problem in Crystal’s stdlib.
The algorithm you’ve analyzed is a custom binding to mpz_powm_sec, a function implemented by libgmp and currently not exposed in Crystal’s standard library.
@Hertzdevil already suggested to try using mpz_powm instead. That’s what Ruby uses as well, btw.
And for numerical applications this is probably the better choice due to its performance and there’s no need for the security aspect.
This should also be a requirement for a fair comparison between performance of a Ruby and Crystal implementation.

6 Likes

You’re missing the points that I’m emphasizing.

Right off the bat the Ruby implementation is significantly faster because its numerical implementation is (has become) apparently inherently faster. It automatically adjusts for integer sizes (and whatever that physically entails) without the user having to figure out how to manually assign integer types, etc.

The performance problem in the Crystal code isn’t the powmod function, it’s most likely the invocation and use of BigInts to do ALL the calculations with, which I know for a fact is slow, because I’ve done versions constrained to UInt64 values to not have to use them.

Ruby used to have their equivalents with Bignums but did away with them and now just has Ints. It’s clear from these timings, whatever they are doing internally is superior to what Crystal is doing, while probably using the same underlying numerical libraries.

Given this, I have to assume Crystal can do better using the same libraries too.

How to do that is way above my paygrade in competence and skills in compilers, LLVM, wrapping libraries, etc. But if nothing else, understand Ruby approaches and implementations for doing this, and adopt what’s applicable.

The bottom line has to be though, Crystal needs to become allot faster for these types of applications if it wants to compete with other languages for use in those applications.

1 Like

On my M2 I’m getting 0.856s with mpz_powm and 2.293s with mpz_powm_sec. The Ruby snippet takes 33 seconds to run for the same value of n. I also tried the 20,008-digit value and it took 46.3s and 359s on Crystal with the two funs respectively. Did you pass any special command-line arguments to Ruby?

2 Likes

I ran it on my own computer.
The time for both Crystal and Ruby was almost the same. The results were almost the same with or without the ruby --yjit option. I checked if yjit is enabled with `RubyVM::YJIT.enabled?

Prime Digits  | 3,731 |  5,003 |  20,008 |
--------------|-------|--------|------- -|
Ruby 3.2.2    | 0.980 |  1.721 |  50.695 |
Ruby (--yjit) | 0.975 |  1.716 |  50.785 |
--------------|-------|--------|-------- |
Crystal 1.7.3 | 0.971 |  1.711 |  50.518 |

It is strange that all the results are the same, but is it just a coincidence?

# inxi
CPU: 8-core AMD Ryzen 7 5700G with Radeon Graphics (-MT MCP-)
speed/min/max: 1751/1400/4672 MHz Kernel: 5.19.0-38-generic x86_64 Up: 1h 18m
Mem: 6094.6/63674.8 MiB (9.6%) Storage: 4.1 TiB (3.8% used) Procs: 424
Shell: Bash inxi: 3.3.13
1 Like

@kojix2 those Crystal numbers must be wrong.

Do $ crystal build --release yourfile then $ ./yourfile and see what you get.

@HertzDevil I didn’t pass anything to Ruby or Crystal not shown.
When I changed mpz_powm_sec to mpz_powm I get no change.

But the code spends most of the time in the until...end loop anyway, doing
y = (y &* y) % n; s << 1
which are all using BigInts.

It is almost certain then that Ruby isn’t using BigInts from the GMP library.

I built the executable with the `crystal build --release" command. I am sure of that.
And I just checked it on another PC. The time is still almost the same.

I had a similar situation before. At that time, most of the time was consumed by the bound library.
Let’s take a profile to investigate the performance of Crystal’s code.


I am not at all familiar with static languages, so the only time I run the profiler is to examine Crystal. However, looking at the diagram, it appears that the external library libgmp.so is consuming most of the program execution time. Therefore, it is quite possible that the same time is consumed whether it is Ruby or Crystal. In other words, it may not be very appropriate as a benchmark for the language itself.

5 Likes

Okay, thanks for clarifying.
Makes sense as a general observation. It’s just that the specific example in this thread is not a good one to make that point, that confused me.

We can certainly use more tools for scientific computing in Crystal. I don’t think we’ll have much of it in stdlib, so it’s more a thing for external libraries.

2 Likes

Ruby could be built with GMP, otherwise it falls back to its own internal implementation.

1 Like

@kojix2 can you list the specs of the systems you ran the code on.

@HertzDevil I looked at the link you provide for the Ruby pow function in the ruby/bignum.c file, in lines 7124-7170.

It selects 3 different pow functions: int_pow_tmp1, int_pow_tmp2, int_pow_tmp3

As I write this, I haven’t taken the time to study all 7201 loc in the file, to find their definitions and use patterns.

But the code after this, to the end of the file (7172 - 7201), is this.

/*
 *  Bignum objects hold integers outside the range of
 *  Fixnum. Bignum objects are created
 *  automatically when integer calculations would otherwise overflow a
 *  Fixnum. When a calculation involving
 *  Bignum objects returns a result that will fit in a
 *  Fixnum, the result is automatically converted.
 *
 *  For the purposes of the bitwise operations and <code>[]</code>, a
 *  Bignum is treated as if it were an infinite-length
 *  bitstring with 2's complement representation.
 *
 *  While Fixnum values are immediate, Bignum
 *  objects are not---assignment and parameter passing work with
 *  references to objects, not the objects themselves.
 *
 */

void
Init_Bignum(void)
{
    rb_define_method(rb_cInteger, "coerce", rb_int_coerce, 1);

#if USE_GMP
    /* The version of loaded GMP. */
    rb_define_const(rb_cInteger, "GMP_VERSION", rb_sprintf("GMP %s", gmp_version));
#endif

    power_cache_init();
}

So, yes, Ruby is using GMP functions, but not exclusively it seems.

But here’s what I’d like the devs to consider as it imagines what Crystal 2.0 should be.
(For purposes here, when I use compiler I mean all the steps and stages of turning source
code into executable machine code.)

This is the (so-called) 21st Century. I shouldn’t have to (and don’t want to) be required
to manually tell the compiler the size of numbers. If a compiler can give warnings about
numbers being to big, it can just use the appropriate size to handle them, like Ruby does.

The language should be designed to free the programmer from doing (essentially) a form of
assembly language programming. I shouldn’t have to be required to pre-know the size of all
the numbers my program could encounter. The language should be designed to handle that.

Most numeric libraries (like GMP, MPFR, Boost) where designed|created decades ago in|for C|C++. And to use them you have to follow the paradigm from which they where created.

But Ruby has apparently shown we can do better, and not only be locked into what they provide.

In my mind, you may need to create a hybrid structure that allows for a static and dynamic
compiler to get the best performance. This at least should be philosophically considered.
In fact, the melding of the best of Ruby-Crystal could be a killer-solutions language.

But on a more practical note to address the current issues of this code, I think it would be
useful to engage in a discussion with the Ruby devs, including Matz, into what|how drove their
thinking and design process in creating their current numeric|computational structure.

This is a reason why I suggested setting up a numeric|computational working group, to at
least identify the current state-of-the-art in these fields. It’s too much of a job to have
to focus on the details to create one design of a language and simultaneously imagine the
philosophy, technology, and details to create the next better one.

4 Likes

This is what Common Lisp has done since the 80s. Variables can hold anything, but you can also add type restrictions all over the place to speed things up or get certain types of behavior. Of course it’s up to the implementation to actually respect these declarations, but most high performance ones seem to. But it’s also a double-edged sword, because getting code highly tuned for performance means extra verbosity, and it can mean extra steps and cognition on the part of the programmer when messing with code that expects overflows. I’ve encountered this plenty of times in the past. The compiler also has to be much more complex and smart about how it generates code, since it has to add a bunch of extra checks at runtime for overflows and such.

Numbers will also automatically upcast to a big num as-needed in Common Lisp, transparently, so maybe that’s where the Ruby devs got the idea. I believe Python does the same, but I haven’t used Python since 3.0 was still new, so I could be wrong. Again, double-edged sword. It can be really nice sometimes, but it can also be a pain when you want to do lower-level stuff, because sometimes, you want a number to stay an exact type. Some use cases just benefit from more strictness, some cases require overflows to be a thing, and some cases just need to be more aware of memory usage. Not to mention the benefits that static typing brings to finding bugs…

EDIT: for completeness sake, I ran the code on two separate and very different machines as well.

Machine 1

CPU: 10-core Intel Core i9-10850K (-MT MCP-) speed/min/max: 3600/800/5200 MHz
Kernel: 6.2.9 x86_64 Up: 6d 20h 46m Mem: 4439.0/64245.5 MiB (6.9%)
Storage: 4.09 TiB (24.0% used) Procs: 346 Shell: Bash inxi: 3.3.12

Ruby: 0.924378318
Crystal (mpz_powm_sec, crystal build --release --no-debug): 1.715
Crystal (mpz_powm, crystal build --release --no-debug): 0.924

Machine 2

CPU: dual core AMD Ryzen 3 3200U with Radeon Vega Mobile Gfx (-MT MCP-)
speed/min/max: 2114/1400/2600 MHz Kernel: 6.2.9 x86_64 Up: 6d 13h 50m
Mem: 3326.4/9870.8 MiB (33.7%) Storage: 494.65 GiB (30.2% used) Procs: 264
Shell: Bash inxi: 3.3.12

Ruby: 1.639495515
Crystal (mpz_powm_sec, crystal build --release --no-debug): 2.668
Crystal (mpz_powm, crystal build --release --no-debug): 1.532

Ruby is 3.0.6 on both machines.

7 Likes

OK, after I used the correct version, I confirm using

@[Link("gmp")]
lib LibGMP
  fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmodgmp(b, e, m)
  y = BigInt.new
  LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

instead of

@[Link("gmp")]
lib LibGMP
  fun mpz_powm_sec = __gmpz_powm_sec(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmodgmp(b, e, m)
  y = BigInt.new
  LibGMP.mpz_powm_sec(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

produce comparable time with Ruby. :slight_smile:

I also just found out you can use the form without the _sec in Benchmark but not with it.
I assume there are other functions you can use w/o the security versions, which would be nice to document so others aren’t subjected to silent poor performance because of not knowing the version they used is not designed for speed.

4 Likes

Studying the Ruby source reveals it performs math with different integer type sizes, and doesn’t do everything with Bignums when possible. Here is a way to do the same in Crystal. (However, currently Crystal exhibits overflow, and doesn’t do the (y &* y) % n) operations accurately in all cases. See Numeric type math differences, no BigInt.to_(i|u)128 methods · Issue #13301 · crystal-lang/crystal · GitHub)

Here’s an example of how to mimic what Ruby is doing.
Whatever m is, the answer for r is in range (0..m-1).
Thus, for optimal performance, do an initial b % m, and b is now within that range.
Then all subsequent math can be done with that integer type, and should not overflow (if done correctly).

Here’s the code.

require "big"

@[Link("gmp")]
lib LibGMP
  fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmodgmp(b, e, m)
  y = BigInt.new
  LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

def powmodint(b, e, m)
  r = typeof(m).new(1)
  while e > 0
    r = (r &* b) % m if e.odd?
    e >>= 1
    b = (b &* b) % m
  end
  r
end

def powmod(b, e, m)
  if typeof(m) == typeof(BigInt.new)
    powmodgmp(b, e, m)
  else
    b = b % m;  b = typeof(m).new(b)
    powmodint(b, e, m)
  end
end

Here are some tests

require "benchmark"

def powmodtests(b, e, m)
  Benchmark.ips do |x|
    x.report("powmod")    { powmod(b, e, m) }
    x.report("powmodgmp") { powmodgmp(b, e, m) }
    puts
  end
end

def tests(b, e, m)
  puts "r = #{r = powmod(b, e, m)} of class #{r.class} using powmod"
  puts "r = #{r = powmodgmp(b, e, m)} of class #{r.class} using powmodgmp"
  powmodtests(b, e, m)
  puts
end

puts
b = "329832983".to_big_i
e = 4843
m = "498422".to_big_i
tests b, e, m

b = "329832983".to_big_i
e = 4843
m = 498422.to_i64
tests b, e, m

b = "329832983".to_big_i
e = 4843
m = 498422.to_u64
tests b, e, m

Here are the results

➜  crystal-projects crystal run --release powmod-benmarks.cr

r = 107323 of class BigInt using powmod
r = 107323 of class BigInt using powmodgmp

   powmod   3.85M (259.66ns) (± 0.59%)  32.0B/op   1.00× slower
powmodgmp   3.85M (259.55ns) (± 0.28%)  32.0B/op        fastest

r = 107323 of class Int64 using powmod
r = 107323 of class BigInt using powmodgmp

   powmod  14.83M ( 67.41ns) (± 0.34%)  16.0B/op        fastest
powmodgmp   3.69M (270.83ns) (± 0.32%)  48.0B/op   4.02× slower

r = 107323 of class UInt64 using powmod
r = 107323 of class BigInt using powmodgmp

   powmod  14.64M ( 68.32ns) (± 0.64%)  16.0B/op        fastest
powmodgmp   3.72M (268.83ns) (± 0.36%)  48.0B/op   3.93× slower

In fact, it’s very difficult to use different Int and Float types in the Crystal language. at least for me. To be honest, I don’t really know how to do it.

I appreciate where you’re coming from on this, but I disagree on your "should"s. Your particular use-cases are not universal, and your desired qualities in a programming language (seemingly: maximal ease-of-use in arithmetic operations, maximal performance, “best-guess” compilation to ease development, expansive standard library) are only a particular set of goals that could be used when designing and implementing a particular programming language.

I have different qualities that I want Crystal to retain or gain (such as type safety, readability, predictability (how easy it is to reason about what code does), usefulness of error messages, and, like you, performance). Some of those qualities conflict somewhat with your desired qualities, such as how (I expect) that abstracting numeric types so that the underlying representation is unknown would lead to a loss of predictability.

Additionally, there are simply unavoidable limitations that make tradeoffs necessary. The dev team does not have the resources of a Google or even a Mozilla, so not every error message has been handcrafted to be as useful as possible. Arithmetic with numeric types of specific sizes will necessarily be faster and less memory-intensive than arithmetic that involves numeric types that expand as-needed. Using type inference will necessarily lead to slower compilation times than requiring all variables to be explicitly typed. Modern hardware and software help with these things (performance is less of an issue if the machine is faster, LLVM helps the dev team develop and maintain a compiler without excessive effort to handle specific compilation targets), but they only help so much.

I absolutely support your ability to push for Crystal to be designed and implemented according to the goals that you think are important, but the language as it exists is not wrong for attempting to accomplish different goals than yours. As I understand it, Crystal is designed as a general-purpose language with certain goals related to developer experience (easy to write, easy to read, type inference), safety (underlying static typing), and performance (compiled), and I believe that if all of your goals for the language were accomplished then it would be a different, non-general-purpose programming language.

2 Likes

Check following thread.