Timeline for multithreading support

Where does implementing true parallelism (vs concurrency) fit into 2.0?

That’s the one thing that I would really like to see in Crystal asap.

2 Likes

Multithreading is already available as a preview feature. At the moment I’m not aware of any breaking changes that would be necessary for promoting it to a standard feature. So it could probably be a 1.x feature, not 2.0.
But there’s also no clear path for that yet.

There are still memory leak issues with current concurrency model, which is closer to GO, than true parallelism available in C++, Nim, Rust, etc.

See these threads for examples.

I would love to see a roadmap to provide true multi-threaded parallelism, even if its for 3.0, that doesn’t use channels as the means to synchronize operations.

There are a whole class of parallel algorithms|applications I would love to implement in Crystal that run best now in Rust|Nim|C++.

I’ve extracted this thread from the original topic because it’s a separate discussion.

I may be missing some definition. I understand “true parallelism” is usually used as a pleonasm and just emphasizes that code execution happens in parallel, instead of just concurrently. Does “true parallelism” have a different meaning for you than just “parallelism”? What is the difference?

Yes, that’s basically it.

Sometime I see people refer to multi-threading and concurrency together.

What I mean is operations that can run simultaneously in a thread|core, independent of others, with its own dedicated memory, which releases the memory upon completion, which requires no communication between threads.

Do you really mean “simultaneously in a thread|core”? As I would understand that it seems quite a challenge. One thread/core can usually only execute a single operation at a time, not multiple ones simultaneously.
But you probably mean each execution context runs in a different thread, right? And they run simultaneously without communication between them.

That should actually be pretty much doable with fibers in preview_mt. The hard part of multi-threading is communication. If you don’t need that, it’s just like spawning a number of separate processes.

Yes. Let me use code examples so I can be crystal clear (pun intended). :blush:

The following Twin Primes Sieve algorithm is done in Rust, Nim, C++, and Crystal.

Rust (fastest)

C++ and Nim have literally same speed because both use GCC to compile.

Nim

C++

Crystal currently is slowest and has memory use (eats memory) issues.

Crystal

Rust uses the crate Rayon (which uses other crates) for parallelism.
C++ uses|compiles OpenMP (https://www.openmp.org/) in this implementation.
Nim uses its own parallelism model, which compiles to C code.

To spawn parallel threads in Rust I use Rayon’s .par_iter() method:

  let (lastwins, cnts): (Vec<_>, Vec<_>) = { // store outputs in these arrays
    let counter = RelaxedCounter::new();
    restwins.par_iter().map( |r_hi| {
      let out = twins_sieve(*r_hi, kmin, kmax, ks, start_num, end_num, modpg,
                            &primes.to_vec(), &resinvrs);
      print!("\r{} of {} twinpairs done", counter.increment(), pairscnt);
      out
    }).unzip()
  };

For C++ I use OpenMP’s parallel for pragma.

  #pragma omp parallel for
  for (int indx = 0; indx < pairscnt; indx += 1) { // for nextp|seg twinpair row indexes
     twins_sieve(indx);           // sieve the selected twinpair restracks
     cout << "\r" << (indx + 1) << " of " << pairscnt << " threads done";
  }                               // update 'twinscnt' w/seg twinprime cnts

For Nim I use it’s parallel spawn structure.

  #parallel:                       # perform in parallel
  for indx, r_hi in restwins:      # for each twin pair row index
    spawn twins_sieve(r_hi.uint, Kmin, Kmax, Kb, start_num, end_num, modpg, primes, resinvrs, indx)
    stdout.write("\r", (indx + 1), " of ", pairscnt, " twinpairs done")
  sync()                           # when all the threads finish

But (currently) for Crystal I have to setup channels, which caused memory leaks until I get help to fix that, but it still has other problems, and stops working once the input values get beyond a certain size.

  cnts = Array(UInt64).new(pairscnt, 0)      # the number of twinprimes found per thread
  lastwins = Array(UInt64).new(pairscnt, 0)  # the largest twinprime val for each thread
  done = Channel(Nil).new(pairscnt)

  restwins.each_with_index do |r_hi, i|      # sieve twinpair restracks
    spawn do
      lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      print "\r#{i + 1} of #{pairscnt} twinpairs done"
      done.send(nil)
  end end
  pairscnt.times { done.receive }  # wait for all threads to finish

Ruby has had forever fibers to spawn independent processes that can be performed in parallel if they involve IO stuff, that doesn’t conflict with the GIL. It also has threads which can only run one-at-a-time.

Ractors in Ruby 3.0 are meant to be its foray into nonGIL blocked parallelism.

Looking at how theses languages have|are implementing this capability, I think it also would be best for Crystal to offload this functionality into a separate official shard.

I know this is a hard problem, but you have many examples of various approaches to study. From my perspective as a user, I would like it to be simple to use, performant, with no side effects (eats memory, etc).

I hope this has clarified things.

If there is a specific problem with Crystal’s multi-threading, please open an issue at Issues · crystal-lang/crystal · GitHub, ideally with a reduced reproducible example.

The issue to me, as the basis for considering features in Crystal 2.x, and later, goes beyond specific problems with the current multi-threading model.

It’s a more fundamental question of what should be the model for Crystal to implement a competitive parallel processing model that will perform as good, or better, than Rust, Nim, C++, et al?

Because Rust uses LLVM too, that may be the best model to assess first.

But I know there were long forum threads on this topic in Rust, and it seems they have settled on Rayon as the de facto implementation to use for most parallel processing applications.

I know Crystal’s core team and resources are way smaller than for Rust, which makes it even more imperative to consider these fundamental questions first.

I’d rather wait for a kickass implementation for 2.5, or greater, than to continue with the current model, which probably will never be as easy to use and performant as Rust, et al.

Have you tried to write that same program In Go? Crystal uses the same concurrency model as Go.

If the problems you are seeing are that it works slowly or that there are memory leaks, those are performance issues and potential bugs. But there’s nothing wrong about the concurrency model used by Crystal (Go is very popular and fast) and that will likely never change.

I will try to run your code on my machine when I have time.

Oh, I was about to optimize that program by using parallel_map to spawn one fiber per thread. Luckily I searched the forums because it seems I already did that: 1.0 multi-threading memory use issues - #15 by asterite

Multithreading support exists in the runtime. What’s lacking is standard library support like Go’s work group, maybe parallel map, etc. But those are relatively easy compared to implementing the runtime to support multithreading. But, those things can also come in shards, similar to how to achieve parallelism in Rust you use Rayon. I’m not saying that’s ideal, but it’s definitely possible.

2 Likes

Yes, I wrote|have a Go version, and it has an even worse memory model (it gorges on memory) for this application.

I posted this problem to the Go forum, and the response was the latest version of Go (at that time) “fixed” that problem, but you had to write your code using modules, which wasn’t shown how to do. It just became too big a PITA for me to be bothered with, so I just gave up on Go (for now).

And that’s another thing which has made me appreciate Rust.

Rust is a very pedantic language, and hard to learn, but they know that. It has excellent documentation, which for allot of people, is still not enough to understand how to actually program real situations the Rust way. But they took my original code, and between 4 or 5 people, rewrote or tweaked specific sections to create the current version in the gist. So that implementation is a group effort, from some Rust gurus, to do stuff to my original code that’s nowhere in the written documentation.

But again, Rust has way more years of people-development-hours (and now a Foundation) behind it compared to Crystal, and it shows.

But Crystal is sooo much easier to program in. For ME this is the one area of functionality I’m waiting on more than anything else for it to improve in.

1 Like

Big Specific Wish/Feature for Parallelism.

Eliminate use of CRYSTAL_WORKERS=n to set number of useable threads at runtime.

$ CRYSTAL_WORKERS=8 ./your_program

This is very user unfriendly, looks ugly, requires the user to know how many threads are useable on each system the code is to run on, and no other language requires its users to do anything like this. The default should be to use all the available threads on a system.

Please search GOMAXPROCS on the Internet. I’m not saying being able to configure this is good, but the claim that no other language does this is false.

The point is to make doing parallel processing better (simpler, easier, faster) for users. Whether some other languages force their users to do something similar (maybe to be mathematically accurate I should have said ...I am aware of) doesn’t change the point.

As a user, it’s just cumbersome and annoying to have to do this (and sometimes I forget). ALL of the languages I have personally used don’t require you to do this.

1 Like

If you add export CRYSTAL_WORKERS=8 to your shell config file, such as ~/.zshrc or ~/.bashrc, you won’t need to specify it every time.

I’d be very careful with such a default. I don’t think there is a reasonable default that fits for all use cases. Some applications are fine and most efficient with a minimal number of threads. Others require as many as possible for maximum efficiency. But then there’s also limitations on shared system resources.

I totally agree that it’s not very user friendly that you have to use an environment variable. But multithreading is only a preview feature right now, therefore it should be acceptable to require some extra config.

For the future, we’ll have to iterate on ideas how this can be improved. For example, I could imagine offering different thread-allocation strategies that can be selected at compile time (potentially still allowing runtime override).

3 Likes

For the future, we’ll have to iterate on ideas how this can be improved. For example, I could imagine offering different thread-allocation strategies that can be selected at compile time (potentially still allowing runtime override).

I can definitely see it being possible to set up completely new thread pools during runtime. I’m doing some exploration around this in my nested_scheduler shard that I hope some day will give inspiration for #6468. For that issue to be solved it will need to be possible (and default) to work inside the current pool of threads, but it turned out easier as a first step to get something to work with to instead always create new threads.

GitHub - yxhuvud/nested_scheduler: Shard for creating separate groups of fibers in a hierarchical way and to collect results and errors in a structured way. for reference. It is missing a lot of features yet, I havn’t yet built support for collecting values or even exceptions, and cancellation/deadlines is pretty far out too. And I imagine there will be a lot of variations of wanted strategies to handle all of that - sometimes people want results, sometimes not, etc.

3 Likes

Each program has it’s own design considerations.
One needs workers = Cpu.count.
Another needs workers = 2 * disks.
Some prefer low worker counts because they don’t do many parallel operations.

Crystal seems to prefer low resource usage or not hogging the entire system or both.
For some applications this is correct. Others the norm is to hog the entire system (webserver).

Perhaps there should be an optional per program override for the default number of workers

# Only used if the environment variable is empty
CRYSTAL_WORKERS_DEFAULT = ->() { ... }
# or
Thread.workers_default = ->() { 128 }
# or
Scheduler.workers_default = ..
# or
...
4 Likes