The Crystal Programming Language Forum

How to Parallelize this?

I’ve got a multi-threaded version of this algorithm working in D, Nim, and Rust, and am now trying to do a Crystal version.

The serial section looks like this:

  cnts = [] of typeof(end_num)
  lastwins = [] of typeof(end_num)
  restwins.each_with_index do |r_hi, i|
    l, c = twins_sieve(r_hi, kmin, kmax, kb, start_num, end_num, modpg, primes, resinvrs)
    lastwins << l; cnts << c
    print "\r#{i+1} of #{pairscnt} twinpairs done"
  end

I tried to make it multi-threaded doing this:

  cnts = [] of typeof(end_num)
  lastwins = [] of typeof(end_num)
  restwins.each_with_index do |r_hi, i|  # sieve twinpair restracks
    spawn do 
      l, c = twins_sieve(r_hi, kmin, kmax, kb, start_num, end_num, modpg, primes, resinvrs)
      lastwins << l; cnts << c
      print "\r#{i+1} of #{pairscnt} twinpairs done"
    end
  end

But get this type of error (using -Dpreview-mt)

Unhandled exception: Index out of bounds (IndexError)
  from ???
  from twinprimes_ssoz.cr:101:5 in 'twinprimes_ssoz'
  from /home/jzakiya/crystal/share/crystal/src/crystal/main.cr:106:5 in 'main'
  from __libc_start_main
  from ../sysdeps/x86_64/start.S:122:0 in '_start'
  from ???

But even when I try to simply print out the index like this

  restwins.each_with_index do |r_hi, i|
    spawn do     
      print "\r#{i+1} of #{pairscnt} twinpairs done"
    end
  end

I still get this error

135 of 135 twinpairs doneUnhandled exception: Index out of bounds (IndexError)
  from ???
  from twinprimes_ssoz.cr:101:5 in 'twinprimes_ssoz'
  from /home/jzakiya/crystal/share/crystal/src/crystal/main.cr:106:5 in 'main'
  from __libc_start_main
  from ../sysdeps/x86_64/start.S:122:0 in '_start'
  from ???

Everything compiled as:

$ crystal build test.cr -Ddisable_overflow -Dpreview_mt --release

$ crystal -v
Crystal 0.33.0 [612825a53] (2020-02-14)

LLVM: 8.0.0
Default target: x86_64-unknown-linux-gnu

This part jumps out, you are changing a shared Array, without synchronization. The results can be weird, though I would not think it should break with exception, but possibly give wrong results in occasion.

Can you share a compiling full code?

Here’s the single-threaded version, which produces correct outputs.

Here’s the multi-threaded version with the above problems.

For comparison, here’s the multi-threaded Rust version you can run to see how it should operate (use h/top to see all your system threads working).

 cnts = Array(UInt64).new(restwins.size + 1) {0u64}
  lastwins = Array(UInt64).new(restwins.size + 1) {0u64}
  done = Channel(Nil).new()

  restwins.each_with_index do |r_hi, i|  # sieve twinpair restracks
    spawn do
      l, c = twins_sieve(r_hi, kmin, kmax, kb, start_num, end_num, modpg, primes, resinvrs)
      lastwins[i] = l.to_u64
      cnts[i] = c.to_u64
      print "\r#{i+1} of #{pairscnt} twinpairs done"
      done.send(nil)
    end  
  end
  restwins.size.times do
    done.receive
  end

Something like this should work, though the speedup should not be huge.

The done Channel is used to synchronise the multiple fibers and ensure you only continue after all sieves have been processed

1 Like

Thanks @lribeiro, here’s the cleaned up code I used that’s works. It’s now in the twinprimes_ssoz.cr gist.

  cnts = Array(UInt64).new(pairscnt) {0u64}
  lastwins = Array(UInt64).new(pairscnt) {0u64}
  done = Channel(Nil).new()

  restwins.each_with_index do |r_hi, i|  # sieve twinpair restracks
    spawn do
      l, c = twins_sieve(r_hi, kmin, kmax, kb, start_num, end_num, modpg, primes, resinvrs)
      lastwins[i] = l.to_u64
      cnts[i] = c.to_u64
      print "\r#{i+1} of #{pairscnt} twinpairs done"
      done.send(nil)
    end  
  end
  pairscnt.times { done.receive }  # wait for all threads to finish

Observations:

  • I get the correct results!! :relaxed:
  • the threads run in parallel
  • threaded version x times faster than serial (single thread) version
  • must manually set at runtime CRYSTAL_WORKERS=8 for opt performance
  • performance ~1.6x slower than Rust version for 10^12 (1 trillion) input
  • Rust doesn’t leak mem (no gc), Crystal mem use slowly grows with threads (gc)
  • output pairscnt done counter not monotonic (as in Nim and Rust)
  • this is good algorithm to use to test/develop multi-threading with

Suggestions:

  • create a synch() method to replace doing x.times { done.receive }
  • detect at runtime max system threads to use as default workers as in Rust|Nim
  • fix counting monotonically between threads, as in Rust
  • allow turning off runtime array bounds checking (if not done), as in Rust
  • optimize gc (better yet, eliminate it, as in Rust, Nim getting there)

The key thing for me was to always gets the correct results.
I suspect over time Crystal’s multi-threading will get much faster/simpler.
Iconic Crystal code is much shorter/simpler and (to me) easier to understand.

I’ll continue to play with the code to see if I can make it faster (in all areas).

3 Likes

BTW for those interested in the mathematical basis for the algorithm, see my paper below.

3 Likes

This is an update.

I tweaked the Crystal gist source code so that it now produces correct results over the full unsigned 64-bit integer space. (Some intermediary arithmetic was losing accuracy, had to increase integer types).

Below is some output on my laptop (I7 8 threads, 16GB mem).
The first example starts with the range’s end_num = 2**64 - 1, and I reduce this by one digit on the end for the next two examples. The times get progressively faster because the number of sieving primes become progressively smaller.

As you see, I get this warning about memory block allocation, which doesn’t occur for smaller inputs, but it didn’t prevent the runs from completing and giving correct results.

For the first case, I can modify the sieving details to get around computing the sieving primes (which take allot of time) but right now it’s not a priority, as it requires adding a lot of code for such a minimal use case. It’s an implementation detail to optimize performance, which doesn’t change the structural basis of the algorithm.

I just wanted to make the devs aware of the warnings, to see if they can be eliminated in the future.

➜  crystal-projects CRYSTAL_WORKERS=8 ./twinprimes_ssoz 18446744073709551615 18446744073709000000
threads = 8
using Prime Generator parameters for P5
segment size = 18388 resgroups; seg array is [1 x 288] 64-bits
twinprime candidates = 55164; resgroups = 18388
GC Warning: Repeated allocation of very large block (appr. size 1610616832):
        May lead to memory leak and poor performance
each of 3 threads has nextp[2 x 203280218] array
setup time = 20.48965 secs
perform twinprimes ssoz sieve
1 of 3 twinpairs done
sieve time = 14.048933 secs
total time = 34.538583 secs
last segment = 18388 resgroups; segment slices = 1
total twins = 361; last twin = 18446744073709550772+/-1

➜  crystal-projects CRYSTAL_WORKERS=8 ./twinprimes_ssoz 1844674407370955161 1844674407370900000 
threads = 8
using Prime Generator parameters for P5
segment size = 1839 resgroups; seg array is [1 x 29] 64-bits
twinprime candidates = 5517; resgroups = 1839
each of 3 threads has nextp[2 x 67999020] array
setup time = 6.077505 secs
perform twinprimes ssoz sieve
GC Warning: Repeated allocation of very large block (appr. size 1087987712):
        May lead to memory leak and poor performance
1 of 3 twinpairs done
sieve time = 4.763295 secs
total time = 10.8408 secs
last segment = 1839 resgroups; segment slices = 1
total twins = 41; last twin = 1844674407370954350+/-1

➜  crystal-projects CRYSTAL_WORKERS=8 ./twinprimes_ssoz 184467440737095516 184467440737090000 
threads = 8
using Prime Generator parameters for P5
segment size = 185 resgroups; seg array is [1 x 3] 64-bits
twinprime candidates = 555; resgroups = 185
each of 3 threads has nextp[2 x 22822678] array
setup time = 1.730782 secs
perform twinprimes ssoz sieve
GC Warning: Repeated allocation of very large block (appr. size 365166592):
        May lead to memory leak and poor performance
1 of 3 twinpairs done
sieve time = 1.598124 secs
total time = 3.328906 secs
last segment = 185 resgroups; segment slices = 1
total twins = 3; last twin = 184467440737093488+/-1