Wait-groups fiber pooling in 1.19

I’ve been testing my Segmented Prime Sieves code with 1.19 to great success.
It seems all the old problems w/fibers are gone, and now the code runs reliably.

The one significant remaining issue is the number of fibers that can run concurrently.

Because Crystal doesn’t have true multi-thread parallelism yet, I’m limited to the size
of numbers I can process the code creates more spawned threads for.

In Rust, D, etc, they just recycle the number of available system primes to a spawn
process until all are done. But in Crystal I have to pre-allocate their number and wait
for them to finish, as shown in this code.

  cnts = Array(UInt64).new(pairscnt, 0)     
  lastwins = Array(UInt64).new(pairscnt, 0) 
  wg = WaitGroup.new(pairscnt)
  
  threadscnt = Atomic.new(0)             
  restwins.each_with_index do |r_hi, i|
    spawn do
      lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      print "\r#{threadscnt.add(1)} of #{pairscnt} twinpairs done"
      wg.done
  end end
  wg.wait  

However, once pairscnt reaches a certain size this error is encountered:

Unhandled exception: Cannot allocate new fiber stack: Cannot allocate memory (RuntimeError)
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in 'main'
  from /lib64/libc.so.6 in '??'
  from /lib64/libc.so.6 in '__libc_start_main'
  from ./primes_ssoz_newref in '_start'
  from ???

My question: Is there a way to do fiber pool sharing to get around this?

Can I create a finite number of fibers that get reused as needed when used ones finish?

The Crystal code runs now much closer to Rust, D, etc times.
But they have consistent low memory use too while Crystal’s grows w/fiber use.
Thus after a point, Crystal can’t process the large inputs values they can.

But again, I’m very pleased with 1.19s fibers improvement.
Can’t wait for true multi-thread parallelism to be on par with the others.

Unless something has changed since this post, fibers already draw from a stack pool.

What you’re likely seeing is the difference between Rust threads beginning execution immediately (the main thread spawning the others would be preempted occasionally, allowing the other threads on the same CPU to run) and Crystal fibers waiting for the main fiber to yield before starting (they’re cooperative and are never preempted). All of the Crystal fibers belonging to the same thread, even with -Dpreview_mt, wait until the main fiber yields. Assuming your twins_sieve is CPU-bound, if you throw an occasional Fiber.yield into your loop every n iterations (where n depends on how fine- or coarse-grained those workloads are) in order to let some of the fibers you’ve accumulated begin execution, it should let them run to completion before spawning more.

The OS limits the virtual memory that can be pre-allocated by the process (see ulimit). Either you up the system limit to start hundred of thousands of fibers (pointless), or you start a certain number of threads to handle the fibers.

When CPU bound, starting significantly more fibers (or threads) than available logical CPUs is waste. Instead, use channels to distribute and collect distributed work to a fixed set of fibers (aka actor model).

Wrong. It has for multiple versions now. You can have every bit of control you may may want. See Fiber::ExecutionContext - Crystal 1.19.0 for details.

1 Like

Here’s a specific example of the manifestation of the problem.
I implement the same algorithm in Rust and Crystal.

For this example, I’m counting the number of primes up to 150 trillion.
Rust takes 56m 54s (on a busy system), using ~3GB of 16GB of system memory.
That’s because the same amount of memory is used for each of 16 threads.
They are recycled to process each new process as necessary until finished.

➜  primes_ssoz git:(master) ✗ ./target/release/primes_ssoz192 150_000_000_000_000 
threads = 16
using Prime Generator parameters for P17
segment size = 3932160 resgroups; seg array is [1 x 61440] 64-bits
prime candidates = 27078803619840; resgroups = 293823824
each of 92160 threads has nextp[1 x 803194] array
setup time = 0.023483559 secs
perform primes ssoz sieve
92160 of 92160 residues done
sieve time = 3414.268745392 secs
total time = 3414.29557705 secs
last segment = 2843984 resgroups; segment slices = 75
total primes = 4745670621117; last prime = 149999999999977
➜  primes_ssoz git:(master) ✗ 

Crystal cannot currently process these size number because of this error.

➜  crystal-projects CRYSTAL_WORKERS=16 ./primes_ssoz_newref 150_000_000_000_000
threads = 16
using Prime Generator parameters for P17
segment size = 3,932,160 resgroups; seg array is [1 x 61,440] 64-bits
prime candidates = 27,078,803,619,840; resgroups = 293,823,824
each of 92160 threads has nextp[1 x 803,194] array
setup time = 0.026696 secs
perform primes ssoz sieve
Unhandled exception: Cannot allocate new fiber stack: Cannot allocate memory (RuntimeError)
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in 'main'
  from /lib64/libc.so.6 in '??'
  from /lib64/libc.so.6 in '__libc_start_main'
  from ./primes_ssoz_newref in '_start'
  from ???
➜  crystal-projects

Rust|D run each process in its own thread, each using the same amount of memory.
Thus a constant amount of total system memory is used for the full process time.

Obviously Crystal is not doing this, as the error message states.
It seems to want to allocate memory for all 92160 processes before executing them,
as it apparently never starts doing the sieve multi-thread processing.

Thus, they are not doing the same type of parallel multi-threading.

Unfortunately, Crystal’s way of doing multi-threading is still not as good as Rust|D.
No matter how much I like Crystal, it’s still an empirical measurable deficiency.

I can write the code faster|easier in Crystal, but use Rust|D for faster performance.

Here’s the Crystal code.

Here’s the Rust code.

Dude. Please. Make an effort.

Try starting 92160 parallel threads in C, D, Rust or Zig and that would crash with the exact same error: OOM because you exhausted the allowed virtual memory. Each thread preallocates 8MB of stack by default on Linux… which is exactly what we preallocate for fiber stacks.

Why the hell would you expect spawning 92160 fibers in Crystal to magically work?

If you have 16 logical cores, then:

  1. resize the default execution context to 16 (it’s set to 1 by default) and spawn exactly 16 fibers to handle the work; the parallel schedulers will spread the fibers across the 16 threads.

  2. distribute work to each fiber just the same as you distribute the work to the 16 threads in the Rust version.

  3. enjoy, the efficiency of adding parallelism will be equivalent in Crystal, because it will work just the same as your Rust version (go figure) :tada:

1 Like

You are not answering my question.

Rust|D, etc, automatically allocates the number of parallel processes over the number of available system threads. I don’t need to know|care how it’s being done. And I don’t need to set external runtime variables like CRYSTAL_WORKERS=X to use all the threads.

I would expect (hope) that Crystal could perform functionally as them. I actually assume it can, but don’t know how to do it currently. Crystal should be (needs to be) able to handle as many parallel processes as a user wants, allocated over a given number of threads. The only system limitations should be processing memory, and time to run.

Instead of getting angry, showing how to do what you said would answer my question.

You are not listening to the answer. If I’m reading your Rust code correctly, you’re using Rayon’s parallel iterator implementation to handle the parallelization for you. That’s simply not the same thing you’re doing in Crystal, so you’re not comparing equivalent code.

The answer, as you’ve already been told, is to stop creating thousands of fibers. It’s a common concurrency pattern to have several workers that ingest data from a queue and output results to another queue, which I believe is exactly what @ysbaddaden suggested. There should be one fiber per worker, not one fiber per unit of data.

EDIT:

Because it was bugging me still after I replied, here are a couple other points:

  • Rayon is a Rust library, not part of the standard library. Is it more mature than any parallelization capabilities in Crystal? Yes, of course. As has been discussed many times here, Rust has had a great deal more resources directed into the development of the language and the ecosystem, and there’s nothing to do about that except try to write good Crystal libraries now.
  • Your lack of knowledge about concurrency is not a deficiency in the language.
1 Like

It’s amazing! I ask a simple question on how to do something and get lectures on what I don’t know.

I obviously know exactly what Rust is doing because I programmed it to do it, and the same for D, C++, Nim and every other language I’ve implemented the algorithm in.

I’m making a really simple request.

Provide me Crystal coding examples to be able to run the code that will run correctly to completion like the Rust, etc code does?

Try adding this hack and replacing your residues.each_with_index with residues.par_each_with_index(16) (or whatever number of threads you want to run in parallel), and removing the spawn inside the block:

module Enumerable(T)
  def par_each_with_index (nfibers, &block : T, Int32 ->)
    free = Atomic.new(nfibers)
    each_with_index do |elt, idx|
      while free.get == 0
        Fiber.yield
      end
      spawn do
        block.call elt, idx
        free.add(1)
      end
      free.add(-1)
    end
    while free.get != nfibers
      Fiber.yield
    end
  end
end

Caveat: my concurrency/multithreading knowledge isn’t 0 but very near. Use at your own risk.

Your question was already answered toward the beginning of the thread. This isn’t the first time you’ve ignored a reply containing exactly what you’ve asked for. It’s at least the third you’ve ignored from me specifically.

It comes across as if you’re intentionally avoiding actually receiving help. I legitimately can’t tell if you’re trolling at this point.

This reads as complaining that 92k Crystal fibers uses more memory than 16 Rust threads. Since you also aren’t posting any code that shows how you’re setting up the work in Rust, there’s no way for anyone to read any further into it.

This is subjective cosplaying as objective. It may not fit what you’re trying to do with it as well as you’d like, but that doesn’t make it “deficient” and it’s certainly not “empirically” or “measurably” so. Crystal’s approach to scheduling work is a huge improvement over manually futzing with threads for the kinds of work most people are doing, which often involves short bursts of work followed by waiting on I/O.

Although I don’t fully understand all the technical details of the thread, based on my current knowledge, it seems to simply express a desire for a practical library like Rayon. In that sense, I agree.

Even in Rust, which has far more development resources, Rayon itself is not part of the standard library. And in open source, if you really want something, the answer is often simple: you have to roll up your sleeves and build it yourself.

Or, if you’re feeling shamelessly lazy, you can ask AI to help prototype it.