Timeline for multithreading support

jzakiya · August 11, 2021, 3:23pm

Where does implementing true parallelism (vs concurrency) fit into 2.0?

That’s the one thing that I would really like to see in Crystal asap.

straight-shoota · August 11, 2021, 8:38pm

Multithreading is already available as a preview feature. At the moment I’m not aware of any breaking changes that would be necessary for promoting it to a standard feature. So it could probably be a 1.x feature, not 2.0.
But there’s also no clear path for that yet.

jzakiya · August 11, 2021, 11:31pm

There are still memory leak issues with current concurrency model, which is closer to GO, than true parallelism available in C++, Nim, Rust, etc.

See these threads for examples.

I would love to see a roadmap to provide true multi-threaded parallelism, even if its for 3.0, that doesn’t use channels as the means to synchronize operations.

There are a whole class of parallel algorithms|applications I would love to implement in Crystal that run best now in Rust|Nim|C++.

straight-shoota · August 12, 2021, 8:55am

I’ve extracted this thread from the original topic because it’s a separate discussion.

I may be missing some definition. I understand “true parallelism” is usually used as a pleonasm and just emphasizes that code execution happens in parallel, instead of just concurrently. Does “true parallelism” have a different meaning for you than just “parallelism”? What is the difference?

jzakiya · August 12, 2021, 9:56pm

Yes, that’s basically it.

Sometime I see people refer to multi-threading and concurrency together.

What I mean is operations that can run simultaneously in a thread|core, independent of others, with its own dedicated memory, which releases the memory upon completion, which requires no communication between threads.

straight-shoota · August 13, 2021, 9:59am

Do you really mean “simultaneously in a thread|core”? As I would understand that it seems quite a challenge. One thread/core can usually only execute a single operation at a time, not multiple ones simultaneously.
But you probably mean each execution context runs in a different thread, right? And they run simultaneously without communication between them.

That should actually be pretty much doable with fibers in preview_mt. The hard part of multi-threading is communication. If you don’t need that, it’s just like spawning a number of separate processes.

jzakiya · August 13, 2021, 9:00pm

Yes. Let me use code examples so I can be crystal clear (pun intended).

The following Twin Primes Sieve algorithm is done in Rust, Nim, C++, and Crystal.

Rust (fastest)

gist.github.com

https://gist.github.com/jzakiya/b96b0b70cf377dfd8feb3f35eb437225

twinprimes_ssoz.rs

// This Rust source file is a multiple threaded implementation to perform an
// extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

// Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
// Output is the number of twin primes <= N, or in range N1 to N2; the last
// twin prime value for the range; and the total time of execution.

// Code originally developed on a System76 laptop with an Intel I7 6700HQ cpu,
// 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
// probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

C++ and Nim have literally same speed because both use GCC to compile.

Nim

gist.github.com

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

twinprimes_ssoz.nim

#[
 This Nim source file is a multiple threaded implementation to perform an
 extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.
 
 Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
 Output is the number of twin primes <= N, or in range N1 to N2; the last
 twin prime value for the range; and the total time of execution.
 
 This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning

This file has been truncated. show original

C++

gist.github.com

https://gist.github.com/jzakiya/fa76c664c9072ddb51599983be175a3f

twinprimes_ssoz.cpp

/*
 This C++ source file is a multiple threaded implementation to perform an
 extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.
 
 Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
 Output is the number of twin primes <= N, or in range N1 to N2; the last
 twin prime value for the range; and the total time of execution.
 
 Code originally developed on a System76 laptop with an Intel I7 6700HQ cpu,
 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning

This file has been truncated. show original

Crystal currently is slowest and has memory use (eats memory) issues.

Crystal

gist.github.com

https://gist.github.com/jzakiya/2b65b609f091dcbb6f792f16c63a8ac4

twinprimes_ssoz.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# Code originally developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

Rust uses the crate Rayon (which uses other crates) for parallelism.
C++ uses|compiles OpenMP (https://www.openmp.org/) in this implementation.
Nim uses its own parallelism model, which compiles to C code.

To spawn parallel threads in Rust I use Rayon’s .par_iter() method:

  let (lastwins, cnts): (Vec<_>, Vec<_>) = { // store outputs in these arrays
    let counter = RelaxedCounter::new();
    restwins.par_iter().map( |r_hi| {
      let out = twins_sieve(*r_hi, kmin, kmax, ks, start_num, end_num, modpg,
                            &primes.to_vec(), &resinvrs);
      print!("\r{} of {} twinpairs done", counter.increment(), pairscnt);
      out
    }).unzip()
  };

For C++ I use OpenMP’s parallel for pragma.

  #pragma omp parallel for
  for (int indx = 0; indx < pairscnt; indx += 1) { // for nextp|seg twinpair row indexes
     twins_sieve(indx);           // sieve the selected twinpair restracks
     cout << "\r" << (indx + 1) << " of " << pairscnt << " threads done";
  }                               // update 'twinscnt' w/seg twinprime cnts

For Nim I use it’s parallel spawn structure.

  #parallel:                       # perform in parallel
  for indx, r_hi in restwins:      # for each twin pair row index
    spawn twins_sieve(r_hi.uint, Kmin, Kmax, Kb, start_num, end_num, modpg, primes, resinvrs, indx)
    stdout.write("\r", (indx + 1), " of ", pairscnt, " twinpairs done")
  sync()                           # when all the threads finish

But (currently) for Crystal I have to setup channels, which caused memory leaks until I get help to fix that, but it still has other problems, and stops working once the input values get beyond a certain size.

  cnts = Array(UInt64).new(pairscnt, 0)      # the number of twinprimes found per thread
  lastwins = Array(UInt64).new(pairscnt, 0)  # the largest twinprime val for each thread
  done = Channel(Nil).new(pairscnt)

  restwins.each_with_index do |r_hi, i|      # sieve twinpair restracks
    spawn do
      lastwins[i], cnts[i] = twins_sieve(r_hi, kmin, kmax, ks, start_num, end_num, modpg, primes, resinvrs)
      print "\r#{i + 1} of #{pairscnt} twinpairs done"
      done.send(nil)
  end end
  pairscnt.times { done.receive }  # wait for all threads to finish

Ruby has had forever fibers to spawn independent processes that can be performed in parallel if they involve IO stuff, that doesn’t conflict with the GIL. It also has threads which can only run one-at-a-time.

Ractors in Ruby 3.0 are meant to be its foray into nonGIL blocked parallelism.

Looking at how theses languages have|are implementing this capability, I think it also would be best for Crystal to offload this functionality into a separate official shard.

I know this is a hard problem, but you have many examples of various approaches to study. From my perspective as a user, I would like it to be simple to use, performant, with no side effects (eats memory, etc).

I hope this has clarified things.

straight-shoota · August 13, 2021, 9:31pm

If there is a specific problem with Crystal’s multi-threading, please open an issue at Issues · crystal-lang/crystal · GitHub, ideally with a reduced reproducible example.

jzakiya · August 13, 2021, 10:26pm

The issue to me, as the basis for considering features in Crystal 2.x, and later, goes beyond specific problems with the current multi-threading model.

It’s a more fundamental question of what should be the model for Crystal to implement a competitive parallel processing model that will perform as good, or better, than Rust, Nim, C++, et al?

Because Rust uses LLVM too, that may be the best model to assess first.

But I know there were long forum threads on this topic in Rust, and it seems they have settled on Rayon as the de facto implementation to use for most parallel processing applications.

I know Crystal’s core team and resources are way smaller than for Rust, which makes it even more imperative to consider these fundamental questions first.

I’d rather wait for a kickass implementation for 2.5, or greater, than to continue with the current model, which probably will never be as easy to use and performant as Rust, et al.

asterite · August 13, 2021, 11:01pm

Have you tried to write that same program In Go? Crystal uses the same concurrency model as Go.

If the problems you are seeing are that it works slowly or that there are memory leaks, those are performance issues and potential bugs. But there’s nothing wrong about the concurrency model used by Crystal (Go is very popular and fast) and that will likely never change.

I will try to run your code on my machine when I have time.

asterite · August 13, 2021, 11:53pm

Oh, I was about to optimize that program by using parallel_map to spawn one fiber per thread. Luckily I searched the forums because it seems I already did that: 1.0 multi-threading memory use issues - #15 by asterite

Multithreading support exists in the runtime. What’s lacking is standard library support like Go’s work group, maybe parallel map, etc. But those are relatively easy compared to implementing the runtime to support multithreading. But, those things can also come in shards, similar to how to achieve parallelism in Rust you use Rayon. I’m not saying that’s ideal, but it’s definitely possible.

jzakiya · August 14, 2021, 12:21am

Yes, I wrote|have a Go version, and it has an even worse memory model (it gorges on memory) for this application.

gist.github.com

https://gist.github.com/jzakiya/fbc77b8fdd12b0581a0ff7c2476373d9

twinprimes_ssoz.go

// This Go source file is a multiple threaded implementation to perform an
// extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

// Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
// Output is the number of twin primes <= N, or in range N1 to N2; the last
// twin prime value for the range; and the total time of execution.

// Code originally developed on a System76 laptop with an Intel I7 6700HQ cpu,
// 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
// probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

I posted this problem to the Go forum, and the response was the latest version of Go (at that time) “fixed” that problem, but you had to write your code using modules, which wasn’t shown how to do. It just became too big a PITA for me to be bothered with, so I just gave up on Go (for now).

And that’s another thing which has made me appreciate Rust.

Rust is a very pedantic language, and hard to learn, but they know that. It has excellent documentation, which for allot of people, is still not enough to understand how to actually program real situations the Rust way. But they took my original code, and between 4 or 5 people, rewrote or tweaked specific sections to create the current version in the gist. So that implementation is a group effort, from some Rust gurus, to do stuff to my original code that’s nowhere in the written documentation.

But again, Rust has way more years of people-development-hours (and now a Foundation) behind it compared to Crystal, and it shows.

But Crystal is sooo much easier to program in. For ME this is the one area of functionality I’m waiting on more than anything else for it to improve in.

jzakiya · August 22, 2021, 2:51pm

Big Specific Wish/Feature for Parallelism.

Eliminate use of CRYSTAL_WORKERS=n to set number of useable threads at runtime.

$ CRYSTAL_WORKERS=8 ./your_program

This is very user unfriendly, looks ugly, requires the user to know how many threads are useable on each system the code is to run on, and no other language requires its users to do anything like this. The default should be to use all the available threads on a system.

asterite · August 22, 2021, 3:08pm

Please search GOMAXPROCS on the Internet. I’m not saying being able to configure this is good, but the claim that no other language does this is false.

jzakiya · August 22, 2021, 3:30pm

The point is to make doing parallel processing better (simpler, easier, faster) for users. Whether some other languages force their users to do something similar (maybe to be mathematically accurate I should have said ...I am aware of) doesn’t change the point.

As a user, it’s just cumbersome and annoying to have to do this (and sometimes I forget). ALL of the languages I have personally used don’t require you to do this.

jgaskins · August 22, 2021, 10:49pm

If you add export CRYSTAL_WORKERS=8 to your shell config file, such as ~/.zshrc or ~/.bashrc, you won’t need to specify it every time.

straight-shoota · August 23, 2021, 11:45am

I’d be very careful with such a default. I don’t think there is a reasonable default that fits for all use cases. Some applications are fine and most efficient with a minimal number of threads. Others require as many as possible for maximum efficiency. But then there’s also limitations on shared system resources.

I totally agree that it’s not very user friendly that you have to use an environment variable. But multithreading is only a preview feature right now, therefore it should be acceptable to require some extra config.

For the future, we’ll have to iterate on ideas how this can be improved. For example, I could imagine offering different thread-allocation strategies that can be selected at compile time (potentially still allowing runtime override).

yxhuvud · August 27, 2021, 12:29pm

For the future, we’ll have to iterate on ideas how this can be improved. For example, I could imagine offering different thread-allocation strategies that can be selected at compile time (potentially still allowing runtime override).

I can definitely see it being possible to set up completely new thread pools during runtime. I’m doing some exploration around this in my nested_scheduler shard that I hope some day will give inspiration for #6468. For that issue to be solved it will need to be possible (and default) to work inside the current pool of threads, but it turned out easier as a first step to get something to work with to instead always create new threads.

GitHub - yxhuvud/nested_scheduler: Shard for creating separate groups of fibers in a hierarchical way and to collect results and errors in a structured way. for reference. It is missing a lot of features yet, I havn’t yet built support for collecting values or even exceptions, and cancellation/deadlines is pretty far out too. And I imagine there will be a lot of variations of wanted strategies to handle all of that - sometimes people want results, sometimes not, etc.

Didactic.Drunk · March 12, 2022, 2:56am

Each program has it’s own design considerations.
One needs workers = Cpu.count.
Another needs workers = 2 * disks.
Some prefer low worker counts because they don’t do many parallel operations.

Crystal seems to prefer low resource usage or not hogging the entire system or both.
For some applications this is correct. Others the norm is to hog the entire system (webserver).

Perhaps there should be an optional per program override for the default number of workers

# Only used if the environment variable is empty
CRYSTAL_WORKERS_DEFAULT = ->() { ... }
# or
Thread.workers_default = ->() { 128 }
# or
Scheduler.workers_default = ..
# or
...

zw963 · September 21, 2024, 4:38pm

Shall we move on to continue discuss on this topic?

Topic		Replies	Views
Crystal multithreading support Crystal Contrib	24	1708	December 15, 2024
Crystal and parallelism Help & Support	7	296	February 10, 2025
Multi-threading in 1.0 Crystal Contrib	8	1323	March 4, 2020
Blog: Crystal: Concurrency with easier syntax than Go News	8	1816	September 8, 2020
Any updates regarding parallelism since 2019 blog post? Help & Support	5	555	May 23, 2023

Timeline for multithreading support

Related topics