1.0 multi-threading memory use issues

This is a request to fix, I hope fixable, issue with multi-threading before 1.0.

I have a multi-threaded twinprimes sieve that I’ve done now in Nim, Rust, D, and Crystal.

The Crystal implementation is the slowest, but that’s not the immediate issue of concern.

Each thread does an independent process of the sieve.
The issue with Crystal is it doesn’t release the memory used by each thread, so as each
thread’s process runs, you see memory usage (via htop, etc) steadily increase until the sieve
finishes, then the active memory use drops back to the initial amount.

For all the other language implementations the memory use hits some constant amounts
which doesn’t deviate much over the entire process.

This problem severely restricts the input size for the sieve because the larger the range values the higher the memory use, until it eats up all the usable memory and hangs the system (if I don’t abort first).

I don’t know if this is a garbage collector issue, and/or some other structural problem.
I think it would be good to fix this issue before 1.0 to at least present Crystal as a viable
language for performing multi-thread applications, regarding memory use too.

I am sure this also affects speed, because it overflows the systems cache causing Crystal
to be substantially slower (almost twice as slow to Rust depending on input size).

So fixing the memory leak issues will probably have a significant benefit on performance too.

Thanks for your consideration.

4 Likes

Do you have any sample code?
It would be nice to have some reproducible code to work with to figure out what is going on.

Here’s a gist of the code. Just follow the instructions to compile and run.

FYI, here are the Nim and Rust versions you can compile to see the
difference in performance and memory use compared to Crystal version.

Nim

Rust (the fastest so far)

What number are you using as input?

A good number to use is 1 trillion (1_000_000_000_000),
as its large enough to see the memory use steadily increase,
but small enough to finish quickly. Of course, this all depends
on how many threads your system has and its clock speed. YMMV.

I think the problem with the memory is that spawn accesses those big arrays. Because those are in the stack of the method that’s being executed, and they are referenced by spawn, its fiber can’t be freed… or something like that, I don’t have much time to think right now.

Try to never share memory and instead communicate data using channels. Always.

With this change it ends up consuming about 94MB:

2 Likes

Could you share those numbers?

Hey @asterite, I ran your version, and it did stabilize the memory issues, but is slower than the original.

Here are some quick times run on my old 2011 Lenovo, I5 2|4 cores|threads, 2.9 GHz.

Input: 1_000_000_000_000

Rust: 135 secs
Nim: 139 secs
Crystal original: 175 secs
Crystal Asterite: 185 secs

Are you using MT? And if so, how many workers?

I won’t have time to optimize this, but, as usual, a profiler can help.

Also, it’s impossible for Crystal to always be faster than other languages. And it’s very hard for Crystal to beat Rust in these kind of things.

1 Like

Here is how its compiled:

$ crystal build twinprimes_ssoz.cr -Dpreview_mt --release

Is that the MT you’re referring too?

Like I said, I’m bringing attention to this so it can be improved in the future.
Rust hit 1.0 in 2015, and has many more resources than Crystal, so I expect it to be way ahead. But people will always compare languages, so its always good to know where you stand and need to get better.

Now that I see how to do it for stable memory I can choose which to use.
Thank you very much, as always. Someone else may be able to focus on this.
As Crystal matures I’m sure it will get better all around.

2 Likes

Yes, thanks, that’s what I meant with MT.

The default number of workers (threads) is 4. I just tried it with 8 and it’s a bit faster. For me it goes from 83s to 66s.

What happens if you set the env var CRYSTAL_WORKERS to 8 ? (or how many cpus you have)

Also, I was profiling this and I see a lot of union types which might slow down things. Here’s the signature generated by the compiler:

*nextp_init<Int32, (Int32 | UInt64), Int32, (Array(Int32) | Array(UInt64)), Array(Int32)>:Array(UInt64)

Having a union of Int32 | UInt64 will be slow. Same for that array of union. If possible, try to annotate the method arguments with the types you expect. With numeric computations this is extremely important.

2 Likes

Here’s a version without unions though it’s not working faster for me: https://gist.github.com/asterite/22860653aa8ec9ba57c1811e646a9e4a

That’s probably the limit of Crystal and the current MT implementation, GC, etc.

FYI, It has same speed as your first version, but now leaks memory like the original,
so it’s an all around no win.

In general, the more threads you have to use the faster the results, so set CRYSTAL_WORKERS=Max_threads for fastest operation.

It would be much nicer in the future for multi-threading to just default using the available number of threads like in D/Nim/Rust.

1 Like

@jzakiya I have some new code. Can you try it?

I checked your Rust code. It doesn’t use channels like Crystal. It does a parallel map to compute things, then uses that to compute the final results. Of course that’s faster than using channels because there’s no need to communicate anything between fibers.

So, one has to be careful about comparing benchmarks that use different algorithms.

Also, I replaced Array with Slice because I think that’s more similar to Rust’s vec, plus they never grow so it’s okay to use Slice.

There’s no parallel map so I’m using an implementation courtesy of @waj

In my machine it ends up running a bit faster and consuming just 20MB of memory.

Can you try it on your machine and compare the time/memory to Rust and Nim? I’m not sure this will beat them, but maybe the times will be similar.

3 Likes

Here are the results for an input of 1 trillion (1_000_000_000_000):

Asterite Version 1: 185.1 secs; constant memory use
Asterite Version 2: 185.1 secs; increasing memory use
Asterite Version 3: 172.5 secs; constant memory use
Original Version: 174.5 secs; increasing memory use

So the latest version is a total win in both performance and memory use.

But I gotta say, the code seems like magic, because there’s nothing in the docs which would ever lead one to write anything like it to optimize performance/memory for parallel processing (multi-threading).

Actually, the algorithm is the same, what is being tested is the implementation for each language. That’s why this is a good algorithm to use to test the languages variations.

As you’ve demonstrated, to get the best current Crystal implementation you had to engage in guru wizardry I (most others) could never ever conceive of doing.

Also, Crystal’s parallel model is too reliant on just channels, as in many cases (like this one) there is no absolute need for communication between threads

Of course, Rust’s not having garbage collection and direct memory ownership provides and advantage here, which may be something to consider in the development of Crystal’s parallel process paradigm.

Rust has basic multi-threading baked into its core, but uses external crates like Rayon to do all the extra development. This is maybe the best thing to do for Crystal too, to create an official shard that contains the latest/greatest capabilities for multi-threading.

So the newest version is both faster and memory conservative, but I have no Idea how/why you did what you did. A good writeup (and copious documentation) would be very helpful for us mere mortals to understand the details of the implementation, and how/when to use them for future cases.

3 Likes

It’s not guru wizardry. The language is incomplete. The std has no facolities to help writing concurrent or multithreaded code other than the primitives to build those (spawn and channel). Eventually there will be, and then no wizardry will be needed.

I think the next main slowness, as you pointed out, is the memory management. I believe Rust uses jemalloc which is extremely fast. We use bohem GC and it’s probably slower, but this is just a guess.

And no, there will never be manual memory management in Crystal. Similar to how there will never be manual memory management in Ruby

1 Like

Another improvement could be having a pool of arrays so these are reused. I think that should give a big boost, but it’s harder to code.

And what about -Dgc_none? Looks like I still can use it if I don’t need GC. src/gc/none.cr

2 Likes

Sure, you can do it. Then a program that allocates memory can only run for a few seconds before all your computer’s memory is consumed.

To be clear, -Dgc_none is a compiler dev tool to try out code without involving the more complex GC. In reality it never makes sense to use it, and I’d like that flag to be eventually removed from the language (same as the --prelude option).