BigInt performance

➜  ~ crystal -v
Crystal 0.35.1 [5999ae29b] (2020-06-19)

LLVM: 8.0.0
Default target: x86_64-unknown-linux-gnu

Intel Core i7-6700HQ bits: 64 type: 4C/8T; 800/3500 MHz 

I ran a comparison of doing a primalitytest on the same prime value of type u64 and BigInt, and BigInt value was substantially slower than the u64 value.

Does his expose an inefficient implementation of BigInt by Crystal, or this an inherent characteristic to be expected (lived with)?

I would have assumed BigInt would check to see if the value could be represented by the system architecture before translating it to another format.

Also, when running the testcode, I see using htop, for the u64 value the process runs on a single thread close to the max system clock freq until finished, but for BigInt the process is spread over the (8) threads at much lower freqs.

#primetest.cr

require "big"
 
def prime?(n)                     # P3 Prime Generator primality test
  return n | 1 == 3 if n < 5      # n: 2,3|true; 0,1,4|false 
  return false if n.gcd(6) != 1   # this filters out 2/3 of all integers
  pc = typeof(n).new(5)           # first P3 prime candidates sequence value
  until pc*pc > n
    return false if n % pc == 0 || n % (pc + 2) == 0  # if n is composite
    pc += 6                       # 1st prime candidate for next residues group
  end
  true
end

def prime1?(n)                    # P3 Prime Generator primality test
  return n | 1 == 3 if n < 5      # n: 2,3|true; 0,1,4|false 
  return false if n.gcd(6) != 1   # this filters out 2/3 of all integers
  pc = typeof(n).new(5)           # first P3 prime candidates sequence value
  sqrtN = typeof(n).new(Math.sqrt(n))
  until pc > sqrtN
    return false if n % pc == 0 || n % (pc + 2) == 0  # if n is composite
    pc += 6                       # 1st prime candidate for next residues group
  end
  true
end

def tm; t = Time.monotonic; yield; Time.monotonic - t end

# 20 digit prime
n = 12345678901234567891u64          # u64 value
print "\n number = #{n} is prime "; print " in ", tm{ print prime?(n)  }, " secs"
print "\n number = #{n} is prime "; print " in ", tm{ print prime1?(n) }, " secs"
puts

# 20 digit prime
n = "12345678901234567891".to_big_i  # BigInt vaue
print "\n number = #{n} is prime "; print " in ", tm{ print prime?(n)  }, " secs"
print "\n number = #{n} is prime "; print " in ", tm{ print prime1?(n) }, " secs"
puts

Timing results.

# For u64
 number = 12345678901234567891 is prime true in 00:00:10.394791582 secs
 number = 12345678901234567891 is prime true in 00:00:09.589179257 secs

# For BigInt
 number = 12345678901234567891 is prime true in 00:01:58.417328276 secs
 number = 12345678901234567891 is prime true in 00:01:38.544755966 secs

9/10 to 98/118 seconds is a big 10x+ unexpected difference for the same value.

Crystal doesn’t have its own implementation of arbitrarily big integers, it has bindings to libgmp which is a popular library for it written in C. Not much can be done on the Crystal side to make it work faster.

Please see this thread: https://github.com/crystal-lang/crystal/issues/6646

tl;dr: we should add methods that mutate bigint instances to avoid allocating memory when not needed. We map to libgmp but we aren’t providing a way to do optimization tricks (like reusing instances).

I don’t have time to implement this (or anything else, for that matter) but PRs are welcome. Thank you!

3 Likes

With the upcoming PR https://github.com/crystal-lang/crystal/pull/9825 and some other operations I added, I managed to write this code:

# primetest.cr

require "big"

def prime?(n)
  return n | 1 == 3 if n < 5
  return false if n.gcd(6) != 1
  pc = typeof(n).new(5)
  pc_plus_2 = pc + 2
  pc_pow = BigInt.new(0)
  while true
    pc_pow.set(pc)
    pc_pow.mul!(pc_pow)
    break if pc_pow > n

    return false if n.divisible_by?(pc) || n.divisible_by?(pc_plus_2)
    pc.add!(6)
    pc_plus_2.add!(6)
  end
  true
end

def prime1?(n)
  return n | 1 == 3 if n < 5
  return false if n.gcd(6) != 1
  pc = typeof(n).new(5)
  pc_plus_2 = pc + 2
  sqrtN = typeof(n).new(Math.sqrt(n))
  until pc > sqrtN
    return false if n.divisible_by?(pc) || n.divisible_by?(pc_plus_2)
    pc.add!(6)
    pc_plus_2.add!(6)
  end
  true
end

def tm
  t = Time.monotonic; yield; Time.monotonic - t
end

# 20 digit prime
n = "12345678901234567891".to_big_i # BigInt vaue
print "\n number = #{n} is prime "; print " in ", tm { print prime?(n) }, " secs"
print "\n number = #{n} is prime "; print " in ", tm { print prime1?(n) }, " secs"
puts

The idea is to avoid, as much as possible, creating new big int instances over and over.

With that code, the times I get are:

 number = 12345678901234567891 is prime true in 00:00:23.027752339 secs
 number = 12345678901234567891 is prime true in 00:00:17.209176363 secs

Compared to the i64 version:

 number = 12345678901234567891 is prime true in 00:00:08.049221991 secs
 number = 12345678901234567891 is prime true in 00:00:06.345145682 secs

It’s unfortunate that you can’t use the same code to handle both Int64 and BigInt (well, you can, but it will be slow), but at least there’s a way to make it pretty fast.

Do those time look more reasonable? I don’t know what other library we can compare to. I know Go has big integers, maybe port the code to Go and see how that behaves.

Thank you @asterite. As always I’ve learned something new from you.

I wasn’t aware of divisible_by?. Just replacing in my original code n % pc == 0, etc, with n.divisible_by?(pc) produces an appreciable speedup.

Original

 number = 12345678901234567891 is prime true in 00:00:10.082019066 secs
 number = 12345678901234567891 is prime true in 00:00:08.972764222 secs

 number = 12345678901234567891 is prime true in 00:01:56.321716363 secs
 number = 12345678901234567891 is prime true in 00:01:38.831483471 secs

Using n.divisible_by?

 number = 12345678901234567891 is prime true in 00:00:06.536580194 secs
 number = 12345678901234567891 is prime true in 00:00:04.819788844 secs

 number = 12345678901234567891 is prime true in 00:01:52.170111907 secs
 number = 12345678901234567891 is prime true in 00:01:34.532012667 sec

When I compiled your code (0.35.1) I get the following error.

➜  rosettacode crystal run primetest2.cr --release
Showing last frame. Use --error-trace for full trace.

In primetest2.cr:10:12

 10 | pc_pow.set(pc)
             ^--
Error: undefined method 'set' for BigInt

Yes, you can only compile my code on my machine. It’s not there, not even in the form of a PR.

I ran this little benchmark to compare divisible_by with alternatives.

require "big"
require "benchmark"

Benchmark.ips do |x|
  a, n, r = 100000, 1213335, 5
  puts "for i32 value"
  x.report("n % r") { a.times { n % r == 0 } }
  x.report("zero?") { a.times { (n % r).zero? } }
  x.report("divis") { a.times { n.divisible_by?(r) } }
end
puts
Benchmark.ips do |x|
  a, n, r = 100000, 1213335u32, 5
  puts "for u32 value"
  x.report("n % r") { a.times { n % r == 0 } }
  x.report("zero?") { a.times { (n % r).zero? } }
  x.report("divis") { a.times { n.divisible_by?(r) } }
end
puts
Benchmark.ips do |x|
  a, n, r = 100000, 1213335i64, 5
  puts "for i64 value"
  x.report("n % r") { a.times { n % r == 0 } }
  x.report("zero?") { a.times { (n % r).zero? } }
  x.report("divis") { a.times { n.divisible_by?(r) } }
end
puts
Benchmark.ips do |x|
  a, n, r = 100000, 1213335u64, 5
  puts "for u64 value"
  x.report("n % r") { a.times { n % r == 0 } }
  x.report("zero?") { a.times { (n % r).zero? } }
  x.report("divis") { a.times { n.divisible_by?(r) } }
end
puts
Benchmark.ips do |x|
  a, n, r = 100000, "1213335".to_big_i, 5
  puts "for BigInt value"
  x.report("n % r") { a.times { n % r == 0 } }
  x.report("zero?") { a.times { (n % r).zero? } }
  x.report("divis") { a.times { n.divisible_by?(r) } }
end

Here are the results. Not much difference here, so wondering why divisible_by was so much faster in previous examples.

for i32 value
n % r  33.42k ( 29.93µs) (± 7.00%)  0.0B/op   1.02× slower
zero?  33.98k ( 29.43µs) (± 2.19%)  0.0B/op        fastest
divis  33.45k ( 29.89µs) (± 6.94%)  0.0B/op   1.02× slower

for u32 value
n % r  33.24k ( 30.09µs) (± 7.96%)  0.0B/op   1.02× slower
zero?  33.97k ( 29.44µs) (± 2.48%)  0.0B/op        fastest
divis  33.50k ( 29.85µs) (± 7.17%)  0.0B/op   1.01× slower

for i64 value
n % r  33.65k ( 29.71µs) (± 6.95%)  0.0B/op   1.01× slower
zero?  34.05k ( 29.37µs) (± 2.49%)  0.0B/op        fastest
divis  33.52k ( 29.83µs) (± 7.29%)  0.0B/op   1.02× slower

for u64 value
n % r  33.63k ( 29.74µs) (± 6.90%)  0.0B/op   1.01× slower
zero?  34.06k ( 29.36µs) (± 2.30%)  0.0B/op        fastest
divis  33.56k ( 29.80µs) (± 6.54%)  0.0B/op   1.01× slower

for BigInt value
n % r 516.57  (  1.94ms) (± 6.68%)  0.0B/op   1.04× slower
zero? 519.32  (  1.93ms) (± 5.12%)  0.0B/op   1.04× slower
divis 538.47  (  1.86ms) (± 4.32%)  0.0B/op        fastest

Because I overwrite it in my local machine to use a GMP call to compute whether a number is divisible by other.

I’ll share the code later today.

Here you go:

require "big"
require "benchmark"

lib LibGMP
  fun divisible_p = __gmpz_divisible_p(n : MPZ*, d : MPZ*) : Int
end

struct BigInt
  def fast_divisible_by?(other : BigInt)
    LibGMP.divisible_p(self, other) != 0
  end
end

a = 1240.to_big_i
b = 20.to_big_i

p! a.divisible_by?(b), a.fast_divisible_by?(b)

Benchmark.ips do |x|
  x.report("slow") do
    a.divisible_by?(b)
  end
  x.report("fast") do
    a.fast_divisible_by?(b)
  end
end

Output:

slow  20.85M ( 47.96ns) (± 5.18%)  16.0B/op   4.67× slower
fast  97.29M ( 10.28ns) (± 5.71%)   0.0B/op        fastest

In my code I actually replaced divisible_by? with the optimized code.

Note that the slow version allocates memory (16 B per op). That’s because it’s dividing two numbers and producing (allocating) a new bigint which is then compared against zero. The optimized version just lets GMP doing it, which, as you can see, doesn’t involve allocating memory on Crystal’s side (and I guess GMP doesn’t need to allocate memory for that, but I don’t know).

divisible_by? is defined on Int, it just does a.remainder(b) == 0, that’s why you don’t see a big difference in your benchmark because you are essentially benchmarking the same code (except that remainder might be slightly faster than %).

Another note: I started implementing BigInt by binding to LibGMP because I know it’s useful. I never used it, though, so many things right now are unoptimized and incomplete.

3 Likes

As always @asterite you da man! :grin:

Now, is this code going to find its way into 1.0, or before?

Again, I really appreciate you jumping on this issue to improve BigInts performance.

I can’t wait to see the final improvements!

Crystal is going to give all other languages a serious run for the money for numerical processing!!

If you want such code to be in 1.0 please send PRs and contribute. I don’t have time anymore to work on Crystal. So it’s either that, or other people contribute. I don’t think there’s a focus to improve this for 1.0, and maybe not even later on (it seems there are other priorities).

I understand. I’ll see if I can submit some PRs.

I ran my benchmarks with your code and see some interesting differences.
The performance for n % r == 0 and (n % r).zero? is much slower to divisible_by? for i32|i64 input values, but comparable when using u32|u64 values. There may be some room to internally optimize % by checking|converting i32|i64 to u32|u64 where mathematically allowable.

require "big"
require "benchmark"

lib LibGMP
  fun divisible_p = __gmpz_divisible_p(n : MPZ*, d : MPZ*) : Int
end

struct BigInt
  def fast_divisible_by?(other : BigInt)
    LibGMP.divisible_p(self, other) != 0
  end
end

a = 1240.to_big_i
b = 20.to_big_i

p! a.divisible_by?(b), a.fast_divisible_by?(b)
puts
Benchmark.ips do |x|
  x.report("slow") { a.divisible_by?(b) }
  x.report("fast") { a.fast_divisible_by?(b) }
end
puts
Benchmark.ips do |x|
  n, r = 1213335i32, 5i32
  puts "for i32 value"
  x.report("n % r") { n % r == 0 }
  x.report("zero?") { (n % r).zero? }
  x.report("divis") { n.divisible_by?(r) }
  #x.report("fastd") { n.fast_divisible_by?(r) }
end
puts
Benchmark.ips do |x|
  n, r = 1213335u32, 5u32
  puts "for u32 value"
  x.report("n % r") { n % r == 0 }
  x.report("zero?") { (n % r).zero? }
  x.report("divis") { n.divisible_by?(r) }
  #x.report("fastd") { n.fast_divisible_by?(r) }
end
puts
Benchmark.ips do |x|
  n, r = 1213335i64, 5i64
  puts "for i64 value"
  x.report("n % r") { n % r == 0 }
  x.report("zero?") { (n % r).zero? }
  x.report("divis") { n.divisible_by?(r) }
  #x.report("fastd") { n.fast_divisible_by?(r) }
end
puts
Benchmark.ips do |x|
  n, r = 1213335u64, 5u64
  puts "for u64 value"
  x.report("n % r") { n % r == 0 }
  x.report("zero?") { (n % r).zero? }
  x.report("divis") { n.divisible_by?(r) }
  #x.report("fastd") { n.fast_divisible_by?(r) }
end
puts
Benchmark.ips do |x|
  n, r = "1213335".to_big_i, "5".to_big_i
  puts "for BigInt value"
  x.report("n % r") { n % r == 0 }
  x.report("zero?") { (n % r).zero? }
  x.report("divis") { n.divisible_by?(r) }
  x.report("fastd") { n.fast_divisible_by?(r) }
end

Timings.

a.divisible_by?(b)      # => true
a.fast_divisible_by?(b) # => true

slow  18.66M ( 53.60ns) (± 3.81%)  16.0B/op   4.54× slower
fast  84.77M ( 11.80ns) (± 3.21%)   0.0B/op        fastest

for i32 value
n % r 233.73M (  4.28ns) (± 2.43%)  0.0B/op   2.11× slower
zero? 268.30M (  3.73ns) (± 4.16%)  0.0B/op   1.84× slower
divis 492.98M (  2.03ns) (± 3.31%)  0.0B/op        fastest

for u32 value
n % r 593.24M (  1.69ns) (± 3.28%)  0.0B/op   1.00× slower
zero? 591.86M (  1.69ns) (± 3.74%)  0.0B/op   1.00× slower
divis 594.57M (  1.68ns) (± 3.31%)  0.0B/op        fastest

for i64 value
n % r  87.94M ( 11.37ns) (± 3.37%)  0.0B/op   6.73× slower
zero?  89.91M ( 11.12ns) (± 2.66%)  0.0B/op   6.58× slower
divis 592.03M (  1.69ns) (± 3.76%)  0.0B/op        fastest

for u64 value
n % r 596.28M (  1.68ns) (± 3.86%)  0.0B/op        fastest
zero? 594.16M (  1.68ns) (± 3.46%)  0.0B/op   1.00× slower
divis 592.02M (  1.69ns) (± 3.62%)  0.0B/op   1.01× slower

for BigInt value
n % r  16.74M ( 59.73ns) (± 3.49%)  16.0B/op   5.08× slower
zero?  16.69M ( 59.91ns) (± 6.08%)  16.0B/op   5.09× slower
divis  18.60M ( 53.76ns) (± 4.80%)  16.0B/op   4.57× slower
fastd  85.03M ( 11.76ns) (± 2.79%)   0.0B/op        fastest

It’s seems that some benchmark’s results are wrong. LLVM is very good at optimizing and some operations are simply not executed because the output result is not used anywhere in the program. To avoid this, simply assign the result to a variable and puts its value at the end of the program.

require "benchmark"

res_a = Bool
res_b = Bool

Benchmark.ips do |x|
  n, r = 1213335u64, 5u64
  puts "for u64 value"

  # This operation will be skipped by LLVM
  x.report("        n % r") { n % r == 0 }
  # Because the result will be used later in the program, this operation will be executed. 
  x.report("res_a = n % r") {res_a = (n % r == 0) }
  # A dumb assignation to make sure that it doesn't slow down performance.
  x.report("res_b = false") {res_b = false }
end

puts res_a
puts res_b
for u64 value
        n % r 330.19M (  3.03ns) (±12.24%)  0.0B/op        fastest
res_a = n % r  99.61M ( 10.04ns) (± 8.41%)  0.0B/op   3.31× slower
res_b = false 271.73M (  3.68ns) (±11.53%)  0.0B/op   1.22× slower
true
false
1 Like

Remember that divisible_by? uses remainder instead of %. The method % is much slower than remainder. That’s why you get, in the benchmark, that divisible_by? is faster. You should also benchmark n.remainder(r).

Also check the differences between % and remainder when using negative values.