Can CRYSTAL_WORKERS=X be set in source code?

I, as a user, want full programmable control over the total resource base of the hardware I’m using, including the use of its threads.

Crystal is currently the only compiled language I am aware of that doesn’t provide by default, full use of a system’s threads. In all the other languages I’ve written apps in (C|C++, D, Go, Nim, Rust) they use the number of threads the system has.

I realize Crystal’s multi-threading model is young, and really a concurrency model based on fibers and not a true parallel processing model based on threads. I hope sometime (soon) Crystal will have a true parallel programming model. It needs to to compete against these other languages in those domains where it’s necessary.

Maybe back in circa 2010/11 a default of 4 threads made sense when most of the systems in existence were Intel cpu based 32-bit 2C|4T systems. A decade later most of the (commercially available, home, mobile) systems are at least 64-bit 4C|8T. Within the next 5-10 years, the base systems will be 8|12|16 threads based AMD|Arm|et al systems.

Limiting any access to system resources shouldn’t be apart of the language. Any limitation is an arbitrary assessment somebody made of what a user needs to be able to control on their own system.

I would love for Crystal to implement|mimic the ease and performance Rust has with its Rayon crate for parallel programming. IMO, its better than OpenMP for C|C++, at least for what I’ve used it for. It’s so simple to use, and hides all the technicalities from the user. I would urge to at least philosophically understand its approach to performant|safe parallel programming.

Thus, I want to be able to programmably control all the threads my system has, in a safe and performant way. Should be real easy, right! :grin:

This may be a bit of a hack, but maybe something like this would work?

#!/usr/bin/env crystal

unless options = ENV["LAUNCHED_WITH_OPTIONS"]?
  puts "Launching with options..."
  system "LAUNCHED_WITH_OPTIONS=1 #{Process.executable_path}"
  exit
end

puts "Launched with options!"

Every one of those languages has far more developers working on it than Crystal, with orders of magnitude more money poured into their development. And while you’re right about Go, at least since 1.5 (GOMAXPROCS defaulted to 1 for 6 years after Go hit 1.0) and I can’t speak to Nim, last time I used C, C++, D, and Rust they all required you to create, manage, and collect your own threads. Do they have concurrency primitives now that don’t have a 1:1 mapping with POSIX threads?

Additionally, the number of cores on a machine isn’t sufficient context for a decision like this. It seems like the right move, especially for your specific use cases, but that carries an assumption that it’s the only thing using significant CPU on the whole machine, which won’t be true for folks deploying web services with it. For example, the JVM does exactly what you’re asking for here and causes problems in containerized deployment environments (it consumes CPU and RAM based on what the machine has rather than what’s available to the container). These kinds of things all need to be considered.

4 Likes

Crystal lets you redefine main for exactly this kind of thing:

fun main(argc : Int32, argv : UInt8**) : Int32
        LibC.setenv("CRYSTAL_WORKERS", "8", 1)
        Crystal.main(argc, argv)
end
8 Likes

Fortran and C mostly does not have that either. You have to jump very high up in the current standards to find threading support. It typically not a language construct but a runtime aspect of it. Crystal Lang is a much more managed runtime than C so to support threading is a bigger undertaking in crlang than in C.

1 Like

@jzakiya In the program you are trying this on, did running it with 8 workers instead of 4 led to a performance improvement?

Oh yes, it’s designed to run faster with more threads. If you want I can show you the results. Even better, I can show you the results with the Rust version that can do (now) everything I want in parallel (with an 8 thread system).

I think having the ability to use less than the max system threads can be useful, but shouldn’t be default. And it should all be under program control.

That may apply to the kind of software you write, but it hardly works as a general rule.

I certainly wouldn’t want some random simple CLI tool spin up ridiculously many threads by default just because it happens to run on a machine with 50 cores.

We should give developers more flexibility for defining the multithreading characteristics of a program. As a simple first step I would consider a compile-time variable to specify the default value (can still be overriden at runtime, but it would provide a default that adapts to the kind of program). A special value representing the number of available cores might also be a good idea to improve that story (specifically for high-computing use cases mentioned by @jzakiya).

3 Likes

Strong disagree. Having to tune to avoid contention is no better than having to tune to use as many resources as you possibly can. A number like 4 that’s some reasonable way in between 1 core and all the cores is a decent compromise between our two use cases.

Even stronger disagree here. I already mentioned above how the JVM controlling how many resources it uses makes it unstable in containerized environments. The person running it, not the person who wrote it, should control how many resources it gets.

Is there something stopping you from putting export CRYSTAL_WORKERS=8 in your shell rc file? This would give you exactly what you want.

2 Likes

For CLI tooling, I would be surprised if it used anything but one core by default TBH. Multicore or not should be a easy override on the command line (like --target=multicore:4). None of the regular CLI tools I use spins up additional threads.

I think there is another case to be made for allowing this to be set a runtime by the program. In the case where you distribute a Crystal program, it would be nice to be able to specify the number of workers for the user or allow the user to set the count via some sort of app config.

2 Likes

Let me say it one more time, I want anyone who uses my program to be able to programmatically control the use of whatever threads exist on the system it’s run on.

Please stop talking about what I can do on my system, external to the source code.

I don’t want anybody to have to do anything other than run the program and have it optimally use all the threads available on whatever OS|hardware its run on in the easiest way possible.

This is the first time I’ve seen you say this, and even using the site search I can’t find any instance where you’ve expressed this as the reason behind your complaint, but I’ve seen you talk a half-dozen times about how you don’t like setting it every time you personally run your programs.

I was giving you a reasonable solution to the specific problem you’ve been expressing. You ignored my replies rather than clarifying why you wanted it both in this thread and in another one last year where I gave you the same advice.

If you want to do things in code, you could provide a bootstrap script that will set up the environment variable for the user before running your program. This is a very common. A few examples:

2 Likes

In an attempt to take this discussion in a useful direction: should the thread pool be created explicitly, with the caller having to add the fibers to it (maybe do some block magic to make it prettier)? That would allow arbitrary code to be run before initialization, which would solve @jzakiya’s problem. The overhead compared to the actual logic in the fibers should be negligible for non-toy programs.

I think a “default” factory method that reads the environment variable with a fallback to 4 would be useful. If more complex configuration is desired in the future, this would provide a cleaner way instead of cramming complex data types into environment variables. Also, if someone wants to have a few fibers share 2 threads, while another set share another 2 threads, they could do that.

I tried with @pseudonym answer, it works! @jzakiya maybe you can try this, but it maybe still not be able to programmatically control the use of number of cpu as parameter of CLI tool.

Each program has it’s own design considerations.
One needs workers = Cpu.count.
Another needs workers = 2 * disks.
Some prefer low worker counts because they don’t do many parallel operations.

Perhaps there should be an optional per program override for the default number of workers

# Only used if the environment variable is empty
CRYSTAL_WORKERS_DEFAULT = ->() { ... }
# or
Thread.workers_default = ->() { 128 }
# or
Scheduler.workers_default = ..
# or
...

(Apparently I previously posted this in the wrong thread)

I think what we should do is have an environment variable that is read at compile time that controls the number of threads, with a possibility to rely on the number of cpus. Then OP could compile their program like that and distribute it, and make sure it uses all cpus. Problem solved.

Yeah, I think that’s what I suggested in Can CRYSTAL_WORKERS=X be set in source code? - #21 by straight-shoota

For some use cases you may need more flexibility, though. For that you could override Crystal::Scheduler.worker_count. That should allow you to read the value from a configuration file, for example. Maybe we should document this method as part of the stdlib API.
Or for maximum flexibility of thread creation, you can override Crystal::Scheduler.init_workers.

1 Like

How to set 2x disks? Or cpu_count.clamp(1, 4) for a service that has at most 4 active Fibers?

Yes please. This seems like the simplest and most comprehensive solution.

My opinion here is that I agree that there should be a reasonably small number of threads created by default, and that number should be possible to modify using a global variable.

But one size does not fit all, and I do agree with jzakiya that programmatic control over threads are good to have. A single global value does not give enough control. I think it should be possible to set up separate thread pools with separate scheduling that run specific tasks. Both for once-off computations and for long running tasks.

I want something like

NestedScheduler::ThreadPool.nursery(thread_count: 16) do |pool|
  16.times do 
    pool.spawn { perform_slow_computation }
  end
end

which would then spawn 16 threads and start a fiber for each thread. Then the created threads would live until all fibers have completed.

The example works using GitHub - yxhuvud/nested_scheduler: Shard for creating separate groups of fibers in a hierarchical way and to collect results and errors in a structured way. which implements it (by overriding a lot of private stuff. Expect it to break for a while on every new Crystal release), which is a first stab at trying to get #6468 somewhere. There is much lacking (and probably many bugs), but the big showstopper stopping me from starting the discussion of upstreaming it into crystal proper is that it doesn’t yet support setting up a nursery where the fibers use the same thread pool as the parent. Which I feel is a necessary feature to have as setting up threads is too costly for many use cases.

3 Likes