Wait-groups fiber pooling in 1.19

jzakiya · January 19, 2026, 8:24pm

I’ve been testing my Segmented Prime Sieves code with 1.19 to great success.
It seems all the old problems w/fibers are gone, and now the code runs reliably.

The one significant remaining issue is the number of fibers that can run concurrently.

Because Crystal doesn’t have true multi-thread parallelism yet, I’m limited to the size
of numbers I can process the code creates more spawned threads for.

In Rust, D, etc, they just recycle the number of available system primes to a spawn
process until all are done. But in Crystal I have to pre-allocate their number and wait
for them to finish, as shown in this code.

  cnts = Array(UInt64).new(pairscnt, 0)     
  lastwins = Array(UInt64).new(pairscnt, 0) 
  wg = WaitGroup.new(pairscnt)
  
  threadscnt = Atomic.new(0)             
  restwins.each_with_index do |r_hi, i|
    spawn do
      lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      print "\r#{threadscnt.add(1)} of #{pairscnt} twinpairs done"
      wg.done
  end end
  wg.wait

However, once pairscnt reaches a certain size this error is encountered:

Unhandled exception: Cannot allocate new fiber stack: Cannot allocate memory (RuntimeError)
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in 'main'
  from /lib64/libc.so.6 in '??'
  from /lib64/libc.so.6 in '__libc_start_main'
  from ./primes_ssoz_newref in '_start'
  from ???

My question: Is there a way to do fiber pool sharing to get around this?

Can I create a finite number of fibers that get reused as needed when used ones finish?

The Crystal code runs now much closer to Rust, D, etc times.
But they have consistent low memory use too while Crystal’s grows w/fiber use.
Thus after a point, Crystal can’t process the large inputs values they can.

But again, I’m very pleased with 1.19s fibers improvement.
Can’t wait for true multi-thread parallelism to be on par with the others.

jgaskins · January 20, 2026, 3:09am

Unless something has changed since this post, fibers already draw from a stack pool.

What you’re likely seeing is the difference between Rust threads beginning execution immediately (the main thread spawning the others would be preempted occasionally, allowing the other threads on the same CPU to run) and Crystal fibers waiting for the main fiber to yield before starting (they’re cooperative and are never preempted). All of the Crystal fibers belonging to the same thread, even with -Dpreview_mt, wait until the main fiber yields. Assuming your twins_sieve is CPU-bound, if you throw an occasional Fiber.yield into your loop every n iterations (where n depends on how fine- or coarse-grained those workloads are) in order to let some of the fibers you’ve accumulated begin execution, it should let them run to completion before spawning more.

ysbaddaden · January 20, 2026, 8:55am

The OS limits the virtual memory that can be pre-allocated by the process (see ulimit). Either you up the system limit to start hundred of thousands of fibers (pointless), or you start a certain number of threads to handle the fibers.

When CPU bound, starting significantly more fibers (or threads) than available logical CPUs is waste. Instead, use channels to distribute and collect distributed work to a fixed set of fibers (aka actor model).

Wrong. It has for multiple versions now. You can have every bit of control you may may want. See Fiber::ExecutionContext - Crystal 1.19.0 for details.

jzakiya · January 20, 2026, 7:30pm

Here’s a specific example of the manifestation of the problem.
I implement the same algorithm in Rust and Crystal.

For this example, I’m counting the number of primes up to 150 trillion.
Rust takes 56m 54s (on a busy system), using ~3GB of 16GB of system memory.
That’s because the same amount of memory is used for each of 16 threads.
They are recycled to process each new process as necessary until finished.

➜  primes_ssoz git:(master) ✗ ./target/release/primes_ssoz192 150_000_000_000_000 
threads = 16
using Prime Generator parameters for P17
segment size = 3932160 resgroups; seg array is [1 x 61440] 64-bits
prime candidates = 27078803619840; resgroups = 293823824
each of 92160 threads has nextp[1 x 803194] array
setup time = 0.023483559 secs
perform primes ssoz sieve
92160 of 92160 residues done
sieve time = 3414.268745392 secs
total time = 3414.29557705 secs
last segment = 2843984 resgroups; segment slices = 75
total primes = 4745670621117; last prime = 149999999999977
➜  primes_ssoz git:(master) ✗

Crystal cannot currently process these size number because of this error.

➜  crystal-projects CRYSTAL_WORKERS=16 ./primes_ssoz_newref 150_000_000_000_000
threads = 16
using Prime Generator parameters for P17
segment size = 3,932,160 resgroups; seg array is [1 x 61,440] 64-bits
prime candidates = 27,078,803,619,840; resgroups = 293,823,824
each of 92160 threads has nextp[1 x 803,194] array
setup time = 0.026696 secs
perform primes ssoz sieve
Unhandled exception: Cannot allocate new fiber stack: Cannot allocate memory (RuntimeError)
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in '??'
  from ./primes_ssoz_newref in 'main'
  from /lib64/libc.so.6 in '??'
  from /lib64/libc.so.6 in '__libc_start_main'
  from ./primes_ssoz_newref in '_start'
  from ???
➜  crystal-projects

Rust|D run each process in its own thread, each using the same amount of memory.
Thus a constant amount of total system memory is used for the full process time.

Obviously Crystal is not doing this, as the error message states.
It seems to want to allocate memory for all 92160 processes before executing them,
as it apparently never starts doing the sieve multi-thread processing.

Thus, they are not doing the same type of parallel multi-threading.

Unfortunately, Crystal’s way of doing multi-threading is still not as good as Rust|D.
No matter how much I like Crystal, it’s still an empirical measurable deficiency.

I can write the code faster|easier in Crystal, but use Rust|D for faster performance.

Here’s the Crystal code.

gist.github.com

https://gist.github.com/jzakiya/b8316f22efb82cb636cff2ad0ab02530

primes_ssoz.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Compile as: $ crystal build --release -Dpreview_mt --mcpu native primes_ssoz_newref.cr
# To reduce binary size do: $ strip twinprimes_ssoz
# Thread workers default to 4, set to system max for optimum performance.
# Single val: $ CRYSTAL_WORKERS=8 ./primes_ssoz val1
# Range vals: $ CRYSTAL_WORKERS=8 ./primes_ssoz val1 val2
# val1 and val2 can be entered as either: 123456789 or 123_456_789

This file has been truncated. show original

Here’s the Rust code.

gist.github.com

https://gist.github.com/jzakiya/4b04f8ab9f911c6fc91ed616ae6d4281

primes_ssoz.rs

// This Rust source file is a multiple threaded implementation to perform an
// extremely fast Segmented Sieve of Zakiya (SSoZ) to find Primes <= N.

// Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
// Output is the number of twin primes <= N, or in range N1 to N2; the last
// twin prime value for the range; and the total time of execution.

// Code originally developed on a System76 laptop with an Intel I7 6700HQ cpu,
// 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
// probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

ysbaddaden · January 20, 2026, 8:57pm

Dude. Please. Make an effort.

Try starting 92160 parallel threads in C, D, Rust or Zig and that would crash with the exact same error: OOM because you exhausted the allowed virtual memory. Each thread preallocates 8MB of stack by default on Linux… which is exactly what we preallocate for fiber stacks.

Why the hell would you expect spawning 92160 fibers in Crystal to magically work?

If you have 16 logical cores, then:

resize the default execution context to 16 (it’s set to 1 by default) and spawn exactly 16 fibers to handle the work; the parallel schedulers will spread the fibers across the 16 threads.
distribute work to each fiber just the same as you distribute the work to the 16 threads in the Rust version.
enjoy, the efficiency of adding parallelism will be equivalent in Crystal, because it will work just the same as your Rust version (go figure)

jzakiya · January 20, 2026, 9:54pm

You are not answering my question.

Rust|D, etc, automatically allocates the number of parallel processes over the number of available system threads. I don’t need to know|care how it’s being done. And I don’t need to set external runtime variables like CRYSTAL_WORKERS=X to use all the threads.

I would expect (hope) that Crystal could perform functionally as them. I actually assume it can, but don’t know how to do it currently. Crystal should be (needs to be) able to handle as many parallel processes as a user wants, allocated over a given number of threads. The only system limitations should be processing memory, and time to run.

Instead of getting angry, showing how to do what you said would answer my question.

RespiteSage · January 20, 2026, 9:59pm

You are not listening to the answer. If I’m reading your Rust code correctly, you’re using Rayon’s parallel iterator implementation to handle the parallelization for you. That’s simply not the same thing you’re doing in Crystal, so you’re not comparing equivalent code.

The answer, as you’ve already been told, is to stop creating thousands of fibers. It’s a common concurrency pattern to have several workers that ingest data from a queue and output results to another queue, which I believe is exactly what @ysbaddaden suggested. There should be one fiber per worker, not one fiber per unit of data.

EDIT:

Because it was bugging me still after I replied, here are a couple other points:

Rayon is a Rust library, not part of the standard library. Is it more mature than any parallelization capabilities in Crystal? Yes, of course. As has been discussed many times here, Rust has had a great deal more resources directed into the development of the language and the ecosystem, and there’s nothing to do about that except try to write good Crystal libraries now.
Your lack of knowledge about concurrency is not a deficiency in the language.

jzakiya · January 20, 2026, 10:29pm

It’s amazing! I ask a simple question on how to do something and get lectures on what I don’t know.

I obviously know exactly what Rust is doing because I programmed it to do it, and the same for D, C++, Nim and every other language I’ve implemented the algorithm in.

I’m making a really simple request.

Provide me Crystal coding examples to be able to run the code that will run correctly to completion like the Rust, etc code does?

Carlos · January 21, 2026, 12:01am

Try adding this hack and replacing your residues.each_with_index with residues.par_each_with_index(16) (or whatever number of threads you want to run in parallel), and removing the spawn inside the block:

module Enumerable(T)
  def par_each_with_index (nfibers, &block : T, Int32 ->)
    free = Atomic.new(nfibers)
    each_with_index do |elt, idx|
      while free.get == 0
        Fiber.yield
      end
      spawn do
        block.call elt, idx
        free.add(1)
      end
      free.add(-1)
    end
    while free.get != nfibers
      Fiber.yield
    end
  end
end

Caveat: my concurrency/multithreading knowledge isn’t 0 but very near. Use at your own risk.

jgaskins · January 21, 2026, 12:03am

Your question was already answered toward the beginning of the thread. This isn’t the first time you’ve ignored a reply containing exactly what you’ve asked for. It’s at least the third you’ve ignored from me specifically.

It comes across as if you’re intentionally avoiding actually receiving help. I legitimately can’t tell if you’re trolling at this point.

This reads as complaining that 92k Crystal fibers uses more memory than 16 Rust threads. Since you also aren’t posting any code that shows how you’re setting up the work in Rust, there’s no way for anyone to read any further into it.

This is subjective cosplaying as objective. It may not fit what you’re trying to do with it as well as you’d like, but that doesn’t make it “deficient” and it’s certainly not “empirically” or “measurably” so. Crystal’s approach to scheduling work is a huge improvement over manually futzing with threads for the kinds of work most people are doing, which often involves short bursts of work followed by waiting on I/O.

kojix2 · January 21, 2026, 7:03am

Although I don’t fully understand all the technical details of the thread, based on my current knowledge, it seems to simply express a desire for a practical library like Rayon. In that sense, I agree.

Even in Rust, which has far more development resources, Rayon itself is not part of the standard library. And in open source, if you really want something, the answer is often simple: you have to roll up your sleeves and build it yourself.

Or, if you’re feeling shamelessly lazy, you can ask AI to help prototype it.

Topic		Replies	Views
Timeline for multithreading support Crystal Contrib	31	2776	September 24, 2024
Multi-thread program: Unhandled exception: Cannot allocate new fiber stack: Cannot allocate memory (Errno) Help & Support	21	1979	March 8, 2021
1.0 multi-threading memory use issues Crystal Contrib	32	1942	August 29, 2020
How to Parallelize this? Help & Support	6	876	March 25, 2020
Fiber is danger or not on Crystal?	31	2707	September 27, 2019

Wait-groups fiber pooling in 1.19

Related topics