Multi-threading ceases past certain input values

jzakiya · May 9, 2022, 9:07pm

I just got Lenovo Legion slim 7 (sweet), AMD Ryzen 9 5900HX, 3.3-4.5 GHz, 8C|16T,
and was rerunning some benchmarks, and noticed a problem with multi-threading.
But the problem also occurs running the same code on my i7 6700HQ, 4C|8T laptop.

When I run the code below, single input values upto 60_000_000_000_000
run with multi-threading, i.e. in parallel. When the input is 61_000_000_000_000
or greater (no matter how few CRYSTAL_WORKERS=n, n > 1, I use), it
only runs one thread (you can see using htop). There is no problem with the
code because ever other language this is done in doesn’t exhibit this behavior.

It’s seems some hardware limit (?) is hit for allowing multi-threading, then it
reverts to single-threading after inputs reach that limit. I’m just guessing here.

I confirmed this behavior with 1.3.2, 1.4.0, and 1.4.1.

CRYSTAL_WORKERS=8 ./twinprimes_ssoz 60000000000000 => uses 8 threads

CRYSTAL_WORKERS=8 ./twinprimes_ssoz 61000000000000 => uses 1 thread

Below is the code with the compiling instructions in it.

gist.github.com

https://gist.github.com/jzakiya/2b65b609f091dcbb6f792f16c63a8ac4

twinprimes_ssoz.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

bararchy · May 11, 2022, 8:24am

Looks like something that should be opened as an issue in the main github repo?

straight-shoota · May 11, 2022, 4:32pm

Let’s figure out what is happening exactly here before opening an issue.

yxhuvud · May 11, 2022, 8:23pm

 restwins.each_with_index do |r_hi, i|  # sieve twinpair restracks
    spawn do
      lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      print "\r#{threadscnt.add(1)} of #{pairscnt} twinpairs done"
      done.send(nil)
  end end
  pairscnt.times { done.receive }        # wait for

So here restwins fibers are created, but pairscnt times are waited for. These numbers may or may not be guaranteed to be identical (I don’t grasp the whole of your code), but at the very least it becomes easier to see that it does the right thing when the same variables are used.

jzakiya · May 11, 2022, 9:24pm

Here pairscnt is the number of elements in restwins, so this is just ensuring to wait until all the data is processed. This should have nothing to do with operation of the threads.

Again, this is not a coding issue. I have personally implemented this program in 5 other languages (D, Go, C++, Nim, Rust) and none have this problem for any valid inputs.

I understand Crystal’s concurrency/multi-threading model/implementation is young so its good this is caught now. On my older i7 6700HQ laptop (circa 2016) I never tried to use inputs this large with Crystal because it took so much longer than most of the other languages did with 8 threads. However with the AMD 5900HX, it has 16 threads at upto ~4.2+ GHz per thread in turbo mode, and so I was checking the Crystal version for these larger values, which now take much less time.

So it’s not the program. Once you hit that observed input threshold, the program continues to run correctly, but only with a single thread. Something in the threading model/implementation is causing the problem. It shouldn’t be data dependent.

RespiteSage · May 12, 2022, 2:43pm

Using htop to show the process tree, I can run your code with values all the way up to 18446744073709551615 (UInt64::Max) and still get 8 threads. However, the 7 child processes are barely doing anything.

EDIT: This is wrong. I managed to forget the multithreading flag when compiling. However, in the process of investigating this I modified the code in a way that actually does run truly multithreaded without the weird behavior. I’m currently trying to minimize the changes to your code so I can understand what’s happening.

RespiteSage · May 12, 2022, 5:03pm

Does this work for you?

gist.github.com

https://gist.github.com/RespiteSage/af8d70c519e8cf2948d90d9cf2836628

twinprimes_ssoz_test.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

I’m still trying to understand what happens with your code, but I think this should work. I’m currently working on a version that uses dedicated worker fibers taking inputs from a channel because I’m running into scheduling issues. As far as I can tell, having 20,000 fibers makes it very unlikely that the status fiber will run; helpfully, the Fiber.yield works well enough, but it’s not a guarantee. It would be very nice if there was a way to prioritize a particular fiber or manage what fibers are running on what threads, but I don’t see a way to do that right now.

EDIT: I’ve updated the gist above to allow switching between your original parallelization code, the code I had before using Fiber.yield, and an implementation that uses a smaller number of workers customized to the number of threads. The two implementations I wrote both work past the limit that your implementation has, and they have similar performance in the limited testing I did. I didn’t have time to run multiple trials to see if this was really the case, but it seems like the worker-based implementation may actually scale better to larger inputs than your original implementation does. Through all this, I still don’t know why your code behaves the way it does. Oh, and you may notice I changed the input parsing; that’s just so I can break up all the zeros in the input so actually tell different inputs apart.

EDIT 2: I tried to do a manual binary search to find what values cause the issue in your implementation using the executable that can switch between implementations, but I’m getting inconsistent results. The value you gave, 61e12, will cause the “single thread” issue for a while and then suddenly start working. Something to note is that when there’s an issue it’s that I’m actually getting 2 threads with high CPU usage and the other 6 threads with basically zero usage, so the threads are there but for some reason some of them aren’t doing anything.

jzakiya · May 12, 2022, 8:49pm

Hey @RespiteSage I ran your first version, and just saw the second, but haven’t run it yet.

Here’s my setup to run the code.
In your favorite terminal (I use Konsole in KDE).

Open a tab and run htop, to see the threads activity.
In 2nd run watch -n1 "grep\"^[c]pu MHz\" /proc/cpuinfo" to see threads speed.
In 3rd run program.

This way I can see/monitor the threads activity and speed while the program runs.

I ran your 1st version, and it was a bit faster than mine, but still failed at the same place.
I’ll run the 2nd version and see was it does (these on the AMD system w/16Ts).

I think the fundamental issue is that Crystal doesn’t yet have a true parallel processing implementation (like OpenMP, Rust, etc) and tries to mimic it with fibers. This is a similar issue with Go, which eats up allot more memory than Crystal (the most of the 6 languages I’ve done).

For a true parallel implementation that directly controls the threads, there wouldn’t be this kind of data dependency based on the values of the inputs.

jzakiya · May 25, 2022, 10:19pm

I just ran a version of the code using a bitarray for the seg array and it works correctly (all threads are working simultaneously), for all large values I tested it with.

So apparently the problem(s) have something to do with memory allocation with the seg array.
Using a bitarray doesn’t have the problem while using 64-bit memory elements do.

Here’s the code using bit_array.

gist.github.com

https://gist.github.com/jzakiya/2b65b609f091dcbb6f792f16c63a8ac4

twinprimes_ssoz.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

jzakiya · July 8, 2022, 5:01pm

This problem still persists with 1.5.0, using the seg array as array of 64-bit mem elements, but doesn’t exist when using a bit_array.

Topic		Replies	Views
Serious 1.11 regressions with multi-threading Help & Support	6	255	January 11, 2024
How to Parallelize this? Help & Support	6	801	March 25, 2020
Perplexing spawn arithmetic overflow errors Help & Support	23	513	March 24, 2021
1.0 multi-threading memory use issues Crystal Contrib	32	1687	August 29, 2020
Timeline for multithreading support Crystal Contrib	31	2285	September 24, 2024

Multi-threading ceases past certain input values

Related topics