Here I want to conceptually explain my parallel implementation so that people reading this thread (now|future) will understand how to use these techniques.
First, threads|cores
are what your hardware supplies, and fibers
are communication abstractions.
The biggest difference between my parallel implementation, and the previous one, is the use of Channels
to only pass information, not data.
You should ideally prevent|minimize intra thread communication
so that each thread is not dependent on information|data from another.
For this application, thereās no need for information|data sharing between threads, so Channels
donāt need to be used for that purpose.
Here, each thread does arithmetic processing and output display, and only needs to indicate outside each thread its finished, by sending a message
.
Thus for this task, we do in parallel the math to process|display a batch
of seeds used to perform the collatz
algorithm.
So for BATCH_SIZE = 2_000_000
there are 500 batches
of 2_000_000 seeds that need processing.
Thus spawn
allocates each batch to a separate thread until all 500 are done.
So done = Channel(Nil).new(batches)
creates a channel to send 500 done
messages, one for each thread.
All we needed is a way to indicate the spawn
ed threads are all done
.
Thus we use a Channel
to pass a message that each thread has finished: done.send(nil)
Outside the spawn
ing process we sit and wait for 500 done
messages to know weāre finished: batches.times { done.receive }
Thatās it. Simple!!
Spawing Threads
The spawn
ing process allocates a batch
to an unused thread, in a round robin fashion. If you set CRYSTAL_WORKERS
to the max number of threads you get maximum performance, and max system memory use. You can also set a pool of threads to separate tasks, but donāt make their total greater than your hardware threads number.
Memory Use
Total system memory use here is a function of results|BATCH_SIZE
size and the number of threads used. A smaller BATCH_SIZE
means smaller results
Slices, which means lower memory (and vice versa) per thread. And fewer used threads means lower system memory use, and vice versa. Thus if system memory use is a concern pick a smaller BATCH_SIZE
and/or allocate fewer threads.
This covers most of the conceptual basics for Crystal parallel coding design, at least for this simple task.
I would encourage you to understand the conceptual flow of your problem as much as possible first before coding.
The simpler you can make your problem be, the simpler you can implement it in code, and the more likely it will be correct.
Extra: A Wish
I really enjoyed this thread, and personally learned new stuff. I think extracting the flow and information of this thread into a blog post, or even better a video, would be an excellent tool to teach people how to program for performance using Crystal.
This kind of tutorial information would go a long way to show people coming from other languages the benefits|utility|simplicity of using Crystal.