Serious 1.11 regressions with multi-threading

jzakiya · January 10, 2024, 3:56pm

As usual, I test/benchmark each new release with some of my standard apps.

Unfortunately I’ve found. 1.11 has serious regressions, and in fact is broke, doing multi-threading.

Using this parallel twins primes sieve program, 1.11 is much slower than 1.10.1, and it also now stops parallel processing inputs when they become “too big” (reach a certain size).

Examples: This is much slower using 1.11

$ CRYSTAL_WORKERS=16 ./twinprimes_ssoz 10_000_000_000_000

And this input now breaks multi-threading.

$ CRYSTAL_WORKERS=16 ./twinprimes_ssoz 100_000_000_000_000

EDIT: I have 2 versions of this implementation, one that uses a bit_array and one that uses a regular arrray. The gist shows the bit_array version runs in parallel, but the one using a regular array breaks for this input. Both run much slower with 1.11 vs 1.10.1.

gist.github.com

https://gist.github.com/jzakiya/2b65b609f091dcbb6f792f16c63a8ac4

twinprimes_ssoz.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# Code originally developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

straight-shoota · January 10, 2024, 5:08pm

Could you please clarify how “breaks multi-threading” manifests? Does the program crash? Or does it deadlock?

There are a couple of minor changes related to multi-threading in 1.11, each of them could have an effect. We’ll have to find which one is at fault.

jzakiya · January 10, 2024, 5:59pm

Here’s the gist of the version that uses a regular array instead of a bit_array.

gist.github.com

https://gist.github.com/jzakiya/cc17e14a2416e426927dfaae585bc889

twinprimes_ssoztest.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# Code originally developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

It will hang (not start?) doing multi-threading for certain input values, requiring a hard kill to stop it.

Here’s its output for a value that does it:

➜  crystal-projects CRYSTAL_WORKERS=16 ./twinprimes_ssoz 100_000_000_000_000
threads = 16
using Prime Generator parameters for P17
segment size = 1572864 resgroups; seg array is [1 x 24576] 64-bits
twinprime candidates = 4363283778975; resgroups = 195882549
each of 22275 threads has nextp[2 x 664572] array
setup time = 0.025751 secs
perform twinprimes ssoz sieve
1 of 22275 twinpairs done

It never moves past this point, and needs a CRTL-Z, etc, to stop it.
Using htop show its running only on 1 thread, instead of normally using all the threads.

However, this behavior didn’t start with 1.11.
I posted about this before, and it never got fixed.

As 1.11 only came out yesterday (T Jan 9, 2024) I haven’t had a chance to do extensive testing yet, but from the little I’ve been able to do, 1.11 seems to be much more data dependent than all the other versions.

For example, for an input of 10_000_000_000_000 using the bit_array version, 1.10.1 takes ~155 secs while 1.11 takes ~177 secs. That is consistent over multiple runs. However for some other values, 1.11 is still slower, but not by as much (for smaller values).

And another thing is, the 1.11.0-dev-1 nightly is much faster (besides having a smaller stripped binary).

Even after the limited testing I’ve been able to do so far, it seems clear that 1.11 does not produce consistently better results than 1.10.1 for this class of programs I have.

ysbaddaden · January 11, 2024, 12:01am

I just built Crystal 1.11.0 from sources (make release=1 interpreter=1 FLAGS=-Dpreview_mt with LLVM 16 on Ubuntu 20.04).

I built the array and bitarray versions of your program with --release -Dpreview_mt and I can’t get the process to hang. It’s heavily burning my laptop’s CPU and it’s taking forever to run, so I killed it at 105 twinpairs (for 100T) and 234 (for 10T). All my CPU cores were busy.

Note: I’d still recommend to replace the Channel(Nil) with a Channel({Int32, UInt64, UInt64}) to send the calculations, so the main fiber can safely mutate the lastwins and cnts arrays, in addition to wait for all fibers to be done. For example:

chan = Channel({Int32, UInt64, UInt64}).new(pairscnt)

restwins.each_with_index do |r_hi, i|
  twin, cnt = twins_sieve(...)
  chan.send({i, twin, cnt})
end

pairscnt.times do
  i, twin, cnt = chan.receive
  lastwins[i] = twin
  cnts[i] = cnt
end

I’d love if you could investigate the performance regression and give us some pointers into what actually caused your program to slow down. I fail to see why MT would be a culprit as your program doesn’t do anything concurrent; the fibers are CPU-bound and won’t yield until they’re done, so the threads merely allow more fibers to run.

ysbaddaden · January 11, 2024, 10:13am

I’m wondering if memory isn’t your problem.

Your program spawns an unbounded number of fibers which requires to allocate a bunch of small objects into memory + as many fiber stacks that are a minimum of the OS page size (4KB on Linux). This can be replaced with a fixed number of fibers, for instance the number of CPU and use a Channel to push the work.

You also don’t need i nor the lastwins and cnts arrays: you can just safely compute as you receive the values from the done fiber.

Last but not least, the concurrent atomic add + print from each fiber is killing the performance. Reporting after done.receive allows to skip the atomic and may avoid OS locking stdout (writing to stdout below PIPE_BUF is guaranteed to be atomic on posix targets, hence concurrent writes are blocking each others).

All these changes reduce the execution time from over a minute down to 25s for 100_000_000_000 on my laptop with 8 crystal workers (== System.cpu_count).

237c237,238
<   puts "threads = #{System.cpu_count}"
---
>   nthreads = System.cpu_count.to_i32
>   puts "threads = #{nthreads}"
256,258c257,259
<   cnts = Array(UInt64).new(pairscnt, 0)  # number of twinprimes found per thread
<   lastwins = Array(UInt64).new(pairscnt, 0) # largest twinprime val for each thread
<   done = Channel(Nil).new(pairscnt)
---
>   # start a fixed number of fibers to avoid bloating memory
>   chan = Channel(Int32).new(nthreads * 32)
>   done = Channel({UInt64, UInt64}).new(nthreads) # NOTE: increase buffer size => faster but reports every once in a while
260,261c261
<   threadscnt = Atomic.new(0)             # count of finished threads
<   restwins.each_with_index do |r_hi, i|  # sieve twinpair restracks
---
>   nthreads.times do
263,268c263,282
<       lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
<       print "\r#{threadscnt.add(1)} of #{pairscnt} twinpairs done"
<       done.send(nil)
<   end end
<   pairscnt.times { done.receive }        # wait for all threads to finish
<   print "\r#{pairscnt} of #{pairscnt} twinpairs done"
---
>       while r_hi = chan.receive?         # sieve twinpair restracks
>         twin, cnt = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
>         done.send({twin, cnt})
>       end
>     end
>   end
>
>   # push work from another channel to avoid blocking main fiber
>   spawn do
>     restwins.each { |r_hi| chan.send(r_hi) }
>   end
>
>   threadscnt = 0
>   last_twin = 0_u64
>   pairscnt.times do
>     twin, cnt = done.receive
>     last_twin = twin if twin > last_twin # find largest hi_tp twinprime in range
>     twinscnt += cnt                      # compute number of twinprimes in range
>     print "\r#{threadscnt += 1} of #{pairscnt} twinpairs done"
>   end
270,271d283
<   last_twin = lastwins.max               # find largest hi_tp twinprime in range
<   twinscnt += cnts.sum                   # compute number of twinprimes in range
283a296
>

jzakiya · January 11, 2024, 7:12pm

I just applied your changes.
It compiles with no errors, but hangs where the threads are supposed to run.
I kind of understand the theory of it, but it would be nice if you would explain it in detail.
This is definitely something that doesn’t exist in the documentation; the whys and hows.

Here’s my code section implemented with your changes.
Where did I mess up?

# start a fixed number of fibers to avoid bloating memory
  chan = Channel(Int32).new(nthreads * 32)
  done = Channel({UInt64, UInt64}).new(nthreads) # NOTE: increase buffer size => faster but reports every once in a while

  nthreads.times do
    while r_hi = chan.receive?         # sieve twinpair restracks
      twin, cnt = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      done.send({twin, cnt})
  end end

  # push work from another channel to avoid blocking main fiber
  spawn do
    restwins.each { |r_hi| chan.send(r_hi) }
  end

  threadscnt = 0
  last_twin  = 0_u64
  pairscnt.times do
    twin, cnt = done.receive
    last_twin = twin if twin > last_twin # find largest hi_tp twinprime in range
    twinscnt += cnt                      # compute number of twinprimes in range
    print "\r#{threadscnt += 1} of #{pairscnt} twinpairs done"
  end
  print "\r#{pairscnt} of #{pairscnt} twinpairs done"

jzakiya · January 11, 2024, 9:16pm

OK, it never runs because in the the nthreads.times do loop the while loop never runs because no channel gets activated before it asks for data, so it just sits there,.

Spawning twins_sieve threads have to start first, but this configuration doesn’t give it a chance .

Topic		Replies	Views
1.0 multi-threading memory use issues Crystal Contrib	32	1677	August 29, 2020
Kudos to Crystal! News	12	2147	December 29, 2021
Crystal vs Go: Notes and Reflections Learning Resources	3	7198	March 3, 2021
Twin Primes Segmented Sieve of Zakiya (SSoZ) Explained News	16	613	June 29, 2022
Multi-threading ceases past certain input values Help & Support	9	524	July 8, 2022

Serious 1.11 regressions with multi-threading

Related topics