Serious 1.11 regressions with multi-threading

As usual, I test/benchmark each new release with some of my standard apps.

Unfortunately I’ve found. 1.11 has serious regressions, and in fact is broke, doing multi-threading.

Using this parallel twins primes sieve program, 1.11 is much slower than 1.10.1, and it also now stops parallel processing inputs when they become “too big” (reach a certain size).

Examples: This is much slower using 1.11

$ CRYSTAL_WORKERS=16 ./twinprimes_ssoz 10_000_000_000_000

And this input now breaks multi-threading.

$ CRYSTAL_WORKERS=16 ./twinprimes_ssoz 100_000_000_000_000

EDIT: I have 2 versions of this implementation, one that uses a bit_array and one that uses a regular arrray. The gist shows the bit_array version runs in parallel, but the one using a regular array breaks for this input. Both run much slower with 1.11 vs 1.10.1.

3 Likes

Could you please clarify how “breaks multi-threading” manifests? Does the program crash? Or does it deadlock?

There are a couple of minor changes related to multi-threading in 1.11, each of them could have an effect. We’ll have to find which one is at fault.

Here’s the gist of the version that uses a regular array instead of a bit_array.

It will hang (not start?) doing multi-threading for certain input values, requiring a hard kill to stop it.

Here’s its output for a value that does it:

➜  crystal-projects CRYSTAL_WORKERS=16 ./twinprimes_ssoz 100_000_000_000_000
threads = 16
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 4363283778975; resgroups = 195882549
each of 22275 threads has nextp[2 x 664572] array
setup time = 0.025751 secs
perform twinprimes ssoz sieve
1 of 22275 twinpairs done       

It never moves past this point, and needs a CRTL-Z, etc, to stop it.
Using htop show its running only on 1 thread, instead of normally using all the threads.

However, this behavior didn’t start with 1.11.
I posted about this before, and it never got fixed.

As 1.11 only came out yesterday (T Jan 9, 2024) I haven’t had a chance to do extensive testing yet, but from the little I’ve been able to do, 1.11 seems to be much more data dependent than all the other versions.

For example, for an input of 10_000_000_000_000 using the bit_array version, 1.10.1 takes ~155 secs while 1.11 takes ~177 secs. That is consistent over multiple runs. However for some other values, 1.11 is still slower, but not by as much (for smaller values).

And another thing is, the 1.11.0-dev-1 nightly is much faster (besides having a smaller stripped binary).

Even after the limited testing I’ve been able to do so far, it seems clear that 1.11 does not produce consistently better results than 1.10.1 for this class of programs I have.

I just built Crystal 1.11.0 from sources (make release=1 interpreter=1 FLAGS=-Dpreview_mt with LLVM 16 on Ubuntu 20.04).

I built the array and bitarray versions of your program with --release -Dpreview_mt and I can’t get the process to hang. It’s heavily burning my laptop’s CPU and it’s taking forever to run, so I killed it at 105 twinpairs (for 100T) and 234 (for 10T). All my CPU cores were busy.

Note: I’d still recommend to replace the Channel(Nil) with a Channel({Int32, UInt64, UInt64}) to send the calculations, so the main fiber can safely mutate the lastwins and cnts arrays, in addition to wait for all fibers to be done. For example:

chan = Channel({Int32, UInt64, UInt64}).new(pairscnt)

restwins.each_with_index do |r_hi, i|
  twin, cnt = twins_sieve(...)
  chan.send({i, twin, cnt})
end

pairscnt.times do
  i, twin, cnt = chan.receive
  lastwins[i] = twin
  cnts[i] = cnt
end

I’d love if you could investigate the performance regression and give us some pointers into what actually caused your program to slow down. I fail to see why MT would be a culprit as your program doesn’t do anything concurrent; the fibers are CPU-bound and won’t yield until they’re done, so the threads merely allow more fibers to run.

1 Like

I’m wondering if memory isn’t your problem.

Your program spawns an unbounded number of fibers which requires to allocate a bunch of small objects into memory + as many fiber stacks that are a minimum of the OS page size (4KB on Linux). This can be replaced with a fixed number of fibers, for instance the number of CPU and use a Channel to push the work.

You also don’t need i nor the lastwins and cnts arrays: you can just safely compute as you receive the values from the done fiber.

Last but not least, the concurrent atomic add + print from each fiber is killing the performance. Reporting after done.receive allows to skip the atomic and may avoid OS locking stdout (writing to stdout below PIPE_BUF is guaranteed to be atomic on posix targets, hence concurrent writes are blocking each others).

All these changes reduce the execution time from over a minute down to 25s for 100_000_000_000 on my laptop with 8 crystal workers (== System.cpu_count).

237c237,238
<   puts "threads = #{System.cpu_count}"
---
>   nthreads = System.cpu_count.to_i32
>   puts "threads = #{nthreads}"
256,258c257,259
<   cnts = Array(UInt64).new(pairscnt, 0)  # number of twinprimes found per thread
<   lastwins = Array(UInt64).new(pairscnt, 0) # largest twinprime val for each thread
<   done = Channel(Nil).new(pairscnt)
---
>   # start a fixed number of fibers to avoid bloating memory
>   chan = Channel(Int32).new(nthreads * 32)
>   done = Channel({UInt64, UInt64}).new(nthreads) # NOTE: increase buffer size => faster but reports every once in a while
260,261c261
<   threadscnt = Atomic.new(0)             # count of finished threads
<   restwins.each_with_index do |r_hi, i|  # sieve twinpair restracks
---
>   nthreads.times do
263,268c263,282
<       lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
<       print "\r#{threadscnt.add(1)} of #{pairscnt} twinpairs done"
<       done.send(nil)
<   end end
<   pairscnt.times { done.receive }        # wait for all threads to finish
<   print "\r#{pairscnt} of #{pairscnt} twinpairs done"
---
>       while r_hi = chan.receive?         # sieve twinpair restracks
>         twin, cnt = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
>         done.send({twin, cnt})
>       end
>     end
>   end
>
>   # push work from another channel to avoid blocking main fiber
>   spawn do
>     restwins.each { |r_hi| chan.send(r_hi) }
>   end
>
>   threadscnt = 0
>   last_twin = 0_u64
>   pairscnt.times do
>     twin, cnt = done.receive
>     last_twin = twin if twin > last_twin # find largest hi_tp twinprime in range
>     twinscnt += cnt                      # compute number of twinprimes in range
>     print "\r#{threadscnt += 1} of #{pairscnt} twinpairs done"
>   end
270,271d283
<   last_twin = lastwins.max               # find largest hi_tp twinprime in range
<   twinscnt += cnts.sum                   # compute number of twinprimes in range
283a296
>
5 Likes

I just applied your changes.
It compiles with no errors, but hangs where the threads are supposed to run.
I kind of understand the theory of it, but it would be nice if you would explain it in detail.
This is definitely something that doesn’t exist in the documentation; the whys and hows.

Here’s my code section implemented with your changes.
Where did I mess up?

# start a fixed number of fibers to avoid bloating memory
  chan = Channel(Int32).new(nthreads * 32)
  done = Channel({UInt64, UInt64}).new(nthreads) # NOTE: increase buffer size => faster but reports every once in a while

  nthreads.times do
    while r_hi = chan.receive?         # sieve twinpair restracks
      twin, cnt = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      done.send({twin, cnt})
  end end

  # push work from another channel to avoid blocking main fiber
  spawn do
    restwins.each { |r_hi| chan.send(r_hi) }
  end

  threadscnt = 0
  last_twin  = 0_u64
  pairscnt.times do
    twin, cnt = done.receive
    last_twin = twin if twin > last_twin # find largest hi_tp twinprime in range
    twinscnt += cnt                      # compute number of twinprimes in range
    print "\r#{threadscnt += 1} of #{pairscnt} twinpairs done"
  end
  print "\r#{pairscnt} of #{pairscnt} twinpairs done"

OK, it never runs because in the the nthreads.times do loop the while loop never runs because no channel gets activated before it asks for data, so it just sits there,.

Spawning twins_sieve threads have to start first, but this configuration doesn’t give it a chance .

1 Like