Perplexing spawn arithmetic overflow errors

I’ve done everything I can think of to understand what’s causing this, and what I’m left to conclude is when you try to spawn too many threads something in the implementation is broken.

In the example below, the input is 15_000_000_000_000, which uses a generator which causes 1485 threads to be spawned. This works correctly.

➜ CRYSTAL_WORKERS=8 ./twinprimes_ssoz 15000000000000
threads = 8
using Prime Generator parameters for P13
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 741758242500; resgroups = 499500500
each of 1485 threads has nextp[2 x 274802] array
setup time = 0.010818 secs
perform twinprimes ssoz sieve
1485 of 1485 twinpairs done
sieve time = 1040.380192 secs
total time = 1040.39101 secs
last segment = 902612 resgroups; segment slices = 318
total twins = 23097407328; last twin = 14999999998692+/-1

But when the input is 16_000_000_000_000, it switches to using a larger generator that spawns off 22,275 threads, which causes the errors to begin after reaching about 179 threads/spawns.

➜ CRYSTAL_WORKERS=8 ./twinprimes_ssoz 16000000000000                        
threads = 8
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 698125408200; resgroups = 31341208
each of 22275 threads has nextp[2 x 283139] array
setup time = 0.032408 secs
perform twinprimes ssoz sieve
175 of 22275 twinpairs doneUnhandled exception in spawn: Arithmetic overflow (OverflowError)
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
Unhandled exception in spawn: Arithmetic overflow (OverflowError)
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
Unhandled exception in spawn: Arithmetic overflow (OverflowError)

I’ve verified over and over that the internal math of the spawned code does not overflow. I’ve run the code with larger inputs, but restricted it to the generator that spawns the 1485 threads, and it produces correct outputs.

There is something happening with spawning larger number of threads that’s causing the problems, and I haven’t been able to restrict the process to the offending issue(s).

I think this is something that needs to be understood and fixed before 1.0, because really, 22K threads is not allot in many numerical algorithms, like this.

Here’s the code for just the spawning of the multi-threads.

cnts = Array(UInt64).new(pairscnt, 0)
lastwins = Array(UInt64).new(pairscnt, 0)
done = Channel(Nil).new(pairscnt)

restwins.each_with_index do |r_hi, i|
  spawn do
    lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, kb, start_num, end_num, modpg, primes, resinvrs)
    print "\r#{i + 1} of #{pairscnt} twinpairs done"
    done.send(nil)
end end
pairscnt.times { done.receive }  # wait for all threads to finish

Here’s the source code, with compiling and running instructions at the top.

Does the error reproduce without -Dpreview_mt? If yes, this might be relevant for 1.0. But multithreading is currently just a preview.

Yes, it also crashes for the single-threaded version, compiled without -Dpreview_mt

Below is code output for single-threaded versions for same output.

➜  ./twinprimes_ssoz_ser 16000000000000
threads = 8
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 698125408200; resgroups = 31341208
each of 22275 threads has nextp[2 x 283139] array
setup time = 0.033134 secs
perform twinprimes ssoz sieve
179 of 22275 twinpairs doneUnhandled exception: Arithmetic overflow (OverflowError)
  from twinprimes_ssoz_optser.cr:166:16 in 'nextp_init'
  from twinprimes_ssoz_optser.cr:194:11 in 'twinprimes_ssoz'
  from /home/jzakiya/crystal/share/crystal/src/crystal/main.cr:110:5 in 'main'
  from __libc_start_main
  from ../sysdeps/x86_64/start.S:122:0 in '_start'
  from ???

Here is the code for the single-threaded version.

Oh, nice so the single-threaded version doesn’t even use fibers (ofc that would be silly for a computation-bound algorithm). I tried the concurrent version with CRYSTAL_WORKERS=1 and without -Dpreview_mt and the arithemtic overflow reproduces in both cases.

But the no-concurrency version is best for isolating the problem. It suggest that this has nothing to do with threads or fibers because there are none. I even compiled without the fiber cleanup loop, so there is really no concurrency involved. The no-concurrency version with argument 16000000000000 always fails after 175 of 22275 twinpairs are done. And the arithmetic exception always raises at twinprimes_ssoz_ser.cr:166:16 in 'nextp_init'. (always = all the times I ran it)

Note: I saw a GC allocation warning at some point, but I suppose this was just a side effect of many fibers + many error messages simultaneously.

I checked that there shouldn’t (couldn’t) be any overflows in nextp_int because when using the larger generator all those values are much smaller than for the smaller generator (that’s one reason to use the larger generator, you trade off using more threads/sieve iterations, but the range values become much smaller).

No other language version (D, Go, Nim, Rust) show this phenomena, so I know the code is correct. There’s something about the Crystal implementation that is snagging at that point, and I haven’t found a way to isolate it in the code to explicitly reproduce it directly.

I added reporting on the values of prime and modpg and now the error appears three lines down.

The same behaviour reproduces without --release. It’s at 175 of 22275 again (no-concurrency version, arg 16000000000000), but in a different line of nextp_init: twinprimes_ssoz_ser.cr:172:16 in 'nextp_init'.

I changed the loop to only print out the |i, r_hi| pairs, to see what value it crashes at. Nothing in those values provides me a useful clue why it crashes at that point.

Here’s the code change.

restwins.each_with_index do |r_hi, i|
    print "i = #{i}, r_hi = #{r_hi}| "
    l, c = twins_sieve(r_hi, kmin, kmax, kb, start_num, end_num, modpg, primes, resinvrs)
    lastwins << l; cnts << c
    #print "\r#{i + 1} of #{pairscnt} twinpairs done"
  end

Here’s the output upto when it crashes.

./twinprimes_ssoz_optsertest 16000000000000    
threads = 8
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 698125408200; resgroups = 31341208
each of 22275 threads has nextp[2 x 283139] array
setup time = 0.032291 secs
perform twinprimes ssoz sieve
i = 0, r_hi = 31| i = 1, r_hi = 43| i = 2, r_hi = 61| i = 3, r_hi = 73|
i = 4, r_hi = 103| i = 5, r_hi = 109| i = 6, r_hi = 139| 
i = 7, r_hi = 151| i = 8, r_hi = 181| i = 9, r_hi = 193| 
i = 10, r_hi = 199| i = 11, r_hi = 229| i = 12, r_hi = 241| 
i = 13, r_hi = 271| i = 14, r_hi = 283| i = 15, r_hi = 313| 
i = 16, r_hi = 349| i = 17, r_hi = 361| i = 18, r_hi = 421| 
i = 19, r_hi = 433| i = 20, r_hi = 439| i = 21, r_hi = 463| 
i = 22, r_hi = 523| i = 23, r_hi = 571| i = 24, r_hi = 589| 
i = 25, r_hi = 601| i = 26, r_hi = 619| i = 27, r_hi = 643| 
i = 28, r_hi = 661| i = 29, r_hi = 703| i = 30, r_hi = 811| 
i = 31, r_hi = 823| i = 32, r_hi = 829| i = 33, r_hi = 841| 
i = 34, r_hi = 853| i = 35, r_hi = 859| i = 36, r_hi = 883| 
i = 37, r_hi = 943| i = 38, r_hi = 991| i = 39, r_hi = 1009| 
i = 40, r_hi = 1021| i = 41, r_hi = 1033| i = 42, r_hi = 1051| 
i = 43, r_hi = 1063| i = 44, r_hi = 1093| i = 45, r_hi = 1123| 
i = 46, r_hi = 1153| i = 47, r_hi = 1189| i = 48, r_hi = 1219| 
i = 49, r_hi = 1231| i = 50, r_hi = 1249| i = 51, r_hi = 1273| 
i = 52, r_hi = 1279| i = 53, r_hi = 1291| i = 54, r_hi = 1303| 
i = 55, r_hi = 1321| i = 56, r_hi = 1363| i = 57, r_hi = 1369| 
i = 58, r_hi = 1429| i = 59, r_hi = 1453| i = 60, r_hi = 1459| 
i = 61, r_hi = 1483| i = 62, r_hi = 1489| i = 63, r_hi = 1501| 
i = 64, r_hi = 1543| i = 65, r_hi = 1579| i = 66, r_hi = 1609| 
i = 67, r_hi = 1621| i = 68, r_hi = 1669| i = 69, r_hi = 1681| 
i = 70, r_hi = 1693| i = 71, r_hi = 1699| i = 72, r_hi = 1711| 
i = 73, r_hi = 1723| i = 74, r_hi = 1741| i = 75, r_hi = 1789| 
i = 76, r_hi = 1831| i = 77, r_hi = 1849| i = 78, r_hi = 1873| 
i = 79, r_hi = 1879| i = 80, r_hi = 1891| i = 81, r_hi = 1909| 
i = 82, r_hi = 1933| i = 83, r_hi = 1951| i = 84, r_hi = 1999| 
i = 85, r_hi = 2029| i = 86, r_hi = 2071| i = 87, r_hi = 2083| 
i = 88, r_hi = 2089| i = 89, r_hi = 2113| i = 90, r_hi = 2131| 
i = 91, r_hi = 2143| i = 92, r_hi = 2203| i = 93, r_hi = 2209| 
i = 94, r_hi = 2239| i = 95, r_hi = 2269| i = 96, r_hi = 2281| 
i = 97, r_hi = 2293| i = 98, r_hi = 2311| i = 99, r_hi = 2341| 
i = 100, r_hi = 2371| i = 101, r_hi = 2383| i = 102, r_hi = 2413| 
i = 103, r_hi = 2419| i = 104, r_hi = 2449| i = 105, r_hi = 2461| 
i = 106, r_hi = 2479| i = 107, r_hi = 2491| i = 108, r_hi = 2503| 
i = 109, r_hi = 2539| i = 110, r_hi = 2551| i = 111, r_hi = 2581| 
i = 112, r_hi = 2593| i = 113, r_hi = 2623| i = 114, r_hi = 2659| 
i = 115, r_hi = 2689| i = 116, r_hi = 2701| i = 117, r_hi = 2713| 
i = 118, r_hi = 2731| i = 119, r_hi = 2749| i = 120, r_hi = 2791| 
i = 121, r_hi = 2803| i = 122, r_hi = 2833| i = 123, r_hi = 2869| 
i = 124, r_hi = 2881| i = 125, r_hi = 2911| i = 126, r_hi = 2923| 
i = 127, r_hi = 2929| i = 128, r_hi = 2971| i = 129, r_hi = 3001| 
i = 130, r_hi = 3013| i = 131, r_hi = 3121| i = 132, r_hi = 3139| 
i = 133, r_hi = 3151| i = 134, r_hi = 3163| i = 135, r_hi = 3169| 
i = 136, r_hi = 3193| i = 137, r_hi = 3253| i = 138, r_hi = 3259| 
 = 139, r_hi = 3301| i = 140, r_hi = 3319| i = 141, r_hi = 3331| 
i = 142, r_hi = 3361| i = 143, r_hi = 3373| i = 144, r_hi = 3391| 
i = 145, r_hi = 3403| i = 146, r_hi = 3433| i = 147, r_hi = 3463| 
i = 148, r_hi = 3469| i = 149, r_hi = 3529| i = 150, r_hi = 3541| 
i = 151, r_hi = 3559| i = 152, r_hi = 3571| i = 153, r_hi = 3583| 
i = 154, r_hi = 3613| i = 155, r_hi = 3631| i = 156, r_hi = 3673| 
i = 157, r_hi = 3721| i = 158, r_hi = 3739| i = 159, r_hi = 3763| 
i = 160, r_hi = 3769| i = 161, r_hi = 3781| i = 162, r_hi = 3799| 
i = 163, r_hi = 3823| i = 164, r_hi = 3853| i = 165, r_hi = 3919| 
i = 166, r_hi = 3931| i = 167, r_hi = 3979| i = 168, r_hi = 4003| 
i = 169, r_hi = 4009| i = 170, r_hi = 4021| i = 171, r_hi = 4033| 
i = 172, r_hi = 4051| i = 173, r_hi = 4093| i = 174, r_hi = 4129| 
i = 175, r_hi = 4141| i = 176, r_hi = 4159| i = 177, r_hi = 4183| 
i = 178, r_hi = 4189| i = 179, r_hi = 4219| 
Unhandled exception: Arithmetic overflow (OverflowError)
  from twinprimes_ssoz_optsertest.cr:166:16 in 'nextp_init'
  from twinprimes_ssoz_optsertest.cr:194:11 in 'twinprimes_ssoz'
  from /home/jzakiya/crystal/share/crystal/src/crystal/main.cr:110:5 in 'main'
  from __libc_start_main
  from ../sysdeps/x86_64/start.S:122:0 in '_start'
  from ???

I found what’s causing the problem, but don’t know why it’s happening.

I figured it must have been a divide by zero, so I looked at the restwins values past i = 179 and this is what shows.

threads = 8
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 698125408200; resgroups = 31341208
each of 22275 threads has nextp[2 x 283139] array
setup time = 0.035199 secs
perform twinprimes ssoz sieve
modpg = 510510, resinvrs[178] = 0, resinvrs[179] = 253829, resinvrs[180] = 0
^Z
[6]  + 27726 suspended  ./twinprimes_ssoz_optsertest 16000000000000
,

This is causing the problem at: r_inv = resinvrs[r] because for some r its inverse is equated to zero (0).

OK, let me try to see why that’s happening now. :face_with_head_bandage:

Drat, I showed the wrong array values (those were good).
Let me go to bed and look at this with fresh eyes and brain tomorrow.

OK.
What is empirically known is that when the code uses P17, for which restwins has 22275 twinpairs/pairscnt, its inverse array resinvrs has size 510512, which is passed to nextp_init, problems start to occur. I don’t think its the size of the values that are the problem, I think it has something to do with the size of the arrays.

So I stripped out using resinvrs (from the code), and used modinv directly in nextp_init to create the residue inverses. I also stripped out using array indexing for the next array in nextp_init, and just did next << ki.

Here’s the typed of output it now produces. Notice, it’s even crashing for small input values.

➜  ./twinprimes_ssoz_optsertest 16000000000000
threads = 8
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 698125408200; resgroups = 31341208
each of 22275 threads has nextp[2 x 283139] array
setup time = 0.036773 secs
perform twinprimes ssoz sieve
modpg = 510510, restwins[178] = 4189, restwins[179] = 4219, restwins[180] = 4231
restwins.min = 31, restwins.max = 510511, restwins.size = 22275
resinvrs.min = 0, resinvrs.max = 510509, resinvrs.size = 510512
primes.min = 19, primes.max = 3999971, primes.size = 283139
179 of 22275 twinpairs donei = 179, r_hi = 4219| 
Unhandled exception: Arithmetic overflow (OverflowError)
  from /home/jzakiya/crystal/share/crystal/src/gc/boehm.cr:110:5 in 'twinprimes_ssoz'
  from /home/jzakiya/crystal/share/crystal/src/crystal/main.cr:110:5 in 'main'
  from __libc_start_main
  from ../sysdeps/x86_64/start.S:122:0 in '_start'
  from ???
➜  ./twinprimes_ssoz_optsertest 1000000       
threads = 8
using Prime Generator parameters for P5
segment size = 33334 resgroups; seg array is [1 x 521] 64-bits
twinprime candidates = 100002; resgroups = 33334
each of 3 threads has nextp[2 x 165] array
setup time = 0.000203 secs
perform twinprimes ssoz sieve
Unhandled exception: Index out of bounds (IndexError)
  from twinprimes_ssoz_optsertest.cr:228:8 in 'twinprimes_ssoz'
  from /home/jzakiya/crystal/share/crystal/src/crystal/main.cr:110:5 in 'main'
  from __libc_start_main
  from ../sysdeps/x86_64/start.S:122:0 in '_start'
  from ???

The output shows GC|Boehm in its listing.

How do I compile with no GC (turn off the GC)?

https://crystal-lang.org/reference/using_the_compiler/index.html

crystal build -Dgc_none

In this line:

ri = (r_hi * r_inv - 2) % modpg + 2 # compute r's ri for r_hi

I see these values:

typeof(r_hi)                 # => Int32
typeof(r_inv)                # => Int32
r_hi                         # => 4219
r_inv                        # => 509129
modpg                        # => 510510
r_hi * r_inv                 # => Unhandled exception: Arithmetic overflow (OverflowError)

I tried without GC and it had the same effect.

@asterite Owell, that’s really expected to fail. Maybe the confusing problem is just that the stack trace is inaccurate because it points to six lines earlier?

Oh, I see. Yes, backtraces generally don’t work well. I just don’t think it’s related to overflow errors, it happens everywhere.

@asterite if we are ever able to meet in the future I’ll buy you dinner! :blush:

You probably just saved me hours of continued anguish and depression, and the continued demise of my diminishing cognitive capabilities. :smile:

2**31 - 1     = 2,147,783,647
509129 * 4219 = 2,148,015,251 -> overflow for Int32

The fixes are very simple (now that I exactly know the problem), but here’s a simple one (least amount of structural code change). It only requires three lines to be change.

These lines are in the nextp_init method.

# original, which causes problem
r_inv = resinvrs[r]                # an Int32
ri = (r_lo* r_inv - 2) % modpg + 2 # an Int32 result
...
ri = (r_hi* r_inv - 2) % modpg + 2 # an Int32 result

# a simple fix, to make math return UInt64 result
r_inv = resinvrs[r].to_u64          # now UInt64
ri = (r_inv * r_lo - 2) % modpg + 2 # switch mult order
....
ri = (r_inv * r_hi - 2) % modpg + 2 # switch mult order

And the final proof!!

➜ CRYSTAL_WORKERS=8 ./twinprimes_ssoz 16000000000000     
threads = 8
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 698125408200; resgroups = 31341208
each of 22275 threads has nextp[2 x 283139] array
setup time = 0.061425 secs
perform twinprimes ssoz sieve
22275 of 22275 twinpairs done
sieve time = 1066.045843 secs
total time = 1066.107268 secs
last segment = 1456792 resgroups; segment slices = 20
total twins = 24528698600; last twin = 15999999998640+/-1

FYI: 1066 secs = 17mins 46secs

Wow, I have the whole day in front of me now, to do something fun with! :star_struck:

@asterite, U da man!

1 Like

Yay! :tada:

Just in case it’s not clear: the backtrace situation is far from ideal right now. Well, more close to the opposite of that :stuck_out_tongue: . The way I found out it was that was:

  • skip the loop when i < 175 so it reaches that point faster
  • add p 1, p 2, p 3, etc., between every pair of lines I suspected it would be crashing
  • run the program, see what’s the maximum number

It’s tedious but it works. And again, totally not ideal.

3 Likes

FYI, I’ve updated both gists, so they should produce correct results now over the full 64-bit range.

Also, it would be nice to update the compiler reference page to include how to compile without the GC: -Dgc_none.

https://crystal-lang.org/reference/using_the_compiler/index.html

1 Like

FYI on benchmarks for this algorithm between different Crystal versions.
I found (for this use case) it performs faster on 0.34 than for (current) 0.36.1.
The *.exe, though, is somewhat smaller for 0.36.1 than for 0.34.

$ crystal build twinprimes_ssoz.cr -Dpreview_mt --release
$ CRYSTAL_WORKERS=8 ./twinprimes_ssoz 1000000000000

    Input     |     0.34   |    0.36.1
--------------|------------|-----------
1000000000000 |  55.9 secs |  57.2 secs
5000000000000 | 306.5 secs | 319.1 secs

Just a small note: none of the two snippets you provided a bit above compile.