Multi-threading ceases past certain input values

I just got Lenovo Legion slim 7 (sweet), AMD Ryzen 9 5900HX, 3.3-4.5 GHz, 8C|16T,
and was rerunning some benchmarks, and noticed a problem with multi-threading.
But the problem also occurs running the same code on my i7 6700HQ, 4C|8T laptop.

When I run the code below, single input values upto 60_000_000_000_000
run with multi-threading, i.e. in parallel. When the input is 61_000_000_000_000
or greater (no matter how few CRYSTAL_WORKERS=n, n > 1, I use), it
only runs one thread (you can see using htop). There is no problem with the
code because ever other language this is done in doesn’t exhibit this behavior.

It’s seems some hardware limit (?) is hit for allowing multi-threading, then it
reverts to single-threading after inputs reach that limit. I’m just guessing here.

I confirmed this behavior with 1.3.2, 1.4.0, and 1.4.1.

CRYSTAL_WORKERS=8 ./twinprimes_ssoz 60000000000000 => uses 8 threads

CRYSTAL_WORKERS=8 ./twinprimes_ssoz 61000000000000 => uses 1 thread

Below is the code with the compiling instructions in it.

Looks like something that should be opened as an issue in the main github repo?

Let’s figure out what is happening exactly here before opening an issue.

 restwins.each_with_index do |r_hi, i|  # sieve twinpair restracks
    spawn do
      lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      print "\r#{threadscnt.add(1)} of #{pairscnt} twinpairs done"
  end end
  pairscnt.times { done.receive }        # wait for

So here restwins fibers are created, but pairscnt times are waited for. These numbers may or may not be guaranteed to be identical (I don’t grasp the whole of your code), but at the very least it becomes easier to see that it does the right thing when the same variables are used.

Here pairscnt is the number of elements in restwins, so this is just ensuring to wait until all the data is processed. This should have nothing to do with operation of the threads.

Again, this is not a coding issue. I have personally implemented this program in 5 other languages (D, Go, C++, Nim, Rust) and none have this problem for any valid inputs.

I understand Crystal’s concurrency/multi-threading model/implementation is young so its good this is caught now. On my older i7 6700HQ laptop (circa 2016) I never tried to use inputs this large with Crystal because it took so much longer than most of the other languages did with 8 threads. However with the AMD 5900HX, it has 16 threads at upto ~4.2+ GHz per thread in turbo mode, and so I was checking the Crystal version for these larger values, which now take much less time.

So it’s not the program. Once you hit that observed input threshold, the program continues to run correctly, but only with a single thread. Something in the threading model/implementation is causing the problem. It shouldn’t be data dependent.

Using htop to show the process tree, I can run your code with values all the way up to 18446744073709551615 (UInt64::Max) and still get 8 threads. However, the 7 child processes are barely doing anything.

EDIT: This is wrong. I managed to forget the multithreading flag when compiling. However, in the process of investigating this I modified the code in a way that actually does run truly multithreaded without the weird behavior. I’m currently trying to minimize the changes to your code so I can understand what’s happening.

Does this work for you?

I’m still trying to understand what happens with your code, but I think this should work. I’m currently working on a version that uses dedicated worker fibers taking inputs from a channel because I’m running into scheduling issues. As far as I can tell, having 20,000 fibers makes it very unlikely that the status fiber will run; helpfully, the Fiber.yield works well enough, but it’s not a guarantee. It would be very nice if there was a way to prioritize a particular fiber or manage what fibers are running on what threads, but I don’t see a way to do that right now.

EDIT: I’ve updated the gist above to allow switching between your original parallelization code, the code I had before using Fiber.yield, and an implementation that uses a smaller number of workers customized to the number of threads. The two implementations I wrote both work past the limit that your implementation has, and they have similar performance in the limited testing I did. I didn’t have time to run multiple trials to see if this was really the case, but it seems like the worker-based implementation may actually scale better to larger inputs than your original implementation does. Through all this, I still don’t know why your code behaves the way it does. Oh, and you may notice I changed the input parsing; that’s just so I can break up all the zeros in the input so actually tell different inputs apart.

EDIT 2: I tried to do a manual binary search to find what values cause the issue in your implementation using the executable that can switch between implementations, but I’m getting inconsistent results. The value you gave, 61e12, will cause the “single thread” issue for a while and then suddenly start working. Something to note is that when there’s an issue it’s that I’m actually getting 2 threads with high CPU usage and the other 6 threads with basically zero usage, so the threads are there but for some reason some of them aren’t doing anything.

Hey @RespiteSage I ran your first version, and just saw the second, but haven’t run it yet.

Here’s my setup to run the code.
In your favorite terminal (I use Konsole in KDE).

  1. Open a tab and run htop, to see the threads activity.
  2. In 2nd run watch -n1 "grep\"^[c]pu MHz\" /proc/cpuinfo" to see threads speed.
  3. In 3rd run program.

This way I can see/monitor the threads activity and speed while the program runs.

I ran your 1st version, and it was a bit faster than mine, but still failed at the same place.
I’ll run the 2nd version and see was it does (these on the AMD system w/16Ts).

I think the fundamental issue is that Crystal doesn’t yet have a true parallel processing implementation (like OpenMP, Rust, etc) and tries to mimic it with fibers. This is a similar issue with Go, which eats up allot more memory than Crystal (the most of the 6 languages I’ve done).

For a true parallel implementation that directly controls the threads, there wouldn’t be this kind of data dependency based on the values of the inputs.

I just ran a version of the code using a bitarray for the seg array and it works correctly (all threads are working simultaneously), for all large values I tested it with.

So apparently the problem(s) have something to do with memory allocation with the seg array.
Using a bitarray doesn’t have the problem while using 64-bit memory elements do.

Here’s the code using bit_array.

This problem still persists with 1.5.0, using the seg array as array of 64-bit mem elements, but doesn’t exist when using a bit_array.