1.0 multi-threading memory use issues

FYI, It has same speed as your first version, but now leaks memory like the original,
so it’s an all around no win.

In general, the more threads you have to use the faster the results, so set CRYSTAL_WORKERS=Max_threads for fastest operation.

It would be much nicer in the future for multi-threading to just default using the available number of threads like in D/Nim/Rust.

1 Like

@jzakiya I have some new code. Can you try it?

I checked your Rust code. It doesn’t use channels like Crystal. It does a parallel map to compute things, then uses that to compute the final results. Of course that’s faster than using channels because there’s no need to communicate anything between fibers.

So, one has to be careful about comparing benchmarks that use different algorithms.

Also, I replaced Array with Slice because I think that’s more similar to Rust’s vec, plus they never grow so it’s okay to use Slice.

There’s no parallel map so I’m using an implementation courtesy of @waj

In my machine it ends up running a bit faster and consuming just 20MB of memory.

Can you try it on your machine and compare the time/memory to Rust and Nim? I’m not sure this will beat them, but maybe the times will be similar.

3 Likes

Here are the results for an input of 1 trillion (1_000_000_000_000):

Asterite Version 1: 185.1 secs; constant memory use
Asterite Version 2: 185.1 secs; increasing memory use
Asterite Version 3: 172.5 secs; constant memory use
Original Version: 174.5 secs; increasing memory use

So the latest version is a total win in both performance and memory use.

But I gotta say, the code seems like magic, because there’s nothing in the docs which would ever lead one to write anything like it to optimize performance/memory for parallel processing (multi-threading).

Actually, the algorithm is the same, what is being tested is the implementation for each language. That’s why this is a good algorithm to use to test the languages variations.

As you’ve demonstrated, to get the best current Crystal implementation you had to engage in guru wizardry I (most others) could never ever conceive of doing.

Also, Crystal’s parallel model is too reliant on just channels, as in many cases (like this one) there is no absolute need for communication between threads

Of course, Rust’s not having garbage collection and direct memory ownership provides and advantage here, which may be something to consider in the development of Crystal’s parallel process paradigm.

Rust has basic multi-threading baked into its core, but uses external crates like Rayon to do all the extra development. This is maybe the best thing to do for Crystal too, to create an official shard that contains the latest/greatest capabilities for multi-threading.

So the newest version is both faster and memory conservative, but I have no Idea how/why you did what you did. A good writeup (and copious documentation) would be very helpful for us mere mortals to understand the details of the implementation, and how/when to use them for future cases.

3 Likes

It’s not guru wizardry. The language is incomplete. The std has no facolities to help writing concurrent or multithreaded code other than the primitives to build those (spawn and channel). Eventually there will be, and then no wizardry will be needed.

I think the next main slowness, as you pointed out, is the memory management. I believe Rust uses jemalloc which is extremely fast. We use bohem GC and it’s probably slower, but this is just a guess.

And no, there will never be manual memory management in Crystal. Similar to how there will never be manual memory management in Ruby

1 Like

Another improvement could be having a pool of arrays so these are reused. I think that should give a big boost, but it’s harder to code.

And what about -Dgc_none? Looks like I still can use it if I don’t need GC. src/gc/none.cr

2 Likes

Sure, you can do it. Then a program that allocates memory can only run for a few seconds before all your computer’s memory is consumed.

To be clear, -Dgc_none is a compiler dev tool to try out code without involving the more complex GC. In reality it never makes sense to use it, and I’d like that flag to be eventually removed from the language (same as the --prelude option).

Hey @asterite, after studying your code, I just replaced the Arrays with Slices in the two methods twins_sieve and nextp_init in my original version, as you did in your code, and here are the results, for input of 1_000_000_000_000.

My original using Arrays: 174.5 secs
My original using Slices: 161.1 secs

The memory still grows as before, but that’s seem to be due to, as your stated, the incomplete multi-threading model.

I want to say I really appreciate how you jumped on this issue and made the code much better. I have no doubt Crystal will catch up to Nim, Rust, et al, in multi-processing very soon.
Please feel free to use this application as you please as a test case.
Here is the updated gist of my code using Slices.

4 Likes

Cool!

Yeah, I love optimizing things.

It strange that the memory keeps goigg up for you. In my case it never goes beyond 20MB. Maybe a difference of OS? No idea.

There already is, one can call LibC.malloc as easily as any other C function.

My laptop for these results is a 2011 Lenovo, I5 2C|4T, 2.9 GHz, Linux kernel 5.7.15.
Crystal 0.35.1

Actually, I want to experiment with that flag as part of my old idea to use ARC (automatic Reference Counting) as possible option to GC.

2 Likes

And how do you use in crystal? I’ve got no idea and I thought it would be impossible due to the strong type system

Yes, to be absolutely clear, doing so is inherently unsafe and can easily lead to memory leaks! LibC.malloc returns a raw Pointer, which you can then wrap in a slice and pass around like any other, but doing so discards guarantees the compiler offers (memory safety).

That’s where ARC should help.

1 Like

ARC would be a dream come true, for sure! I just don’t feel capable of implementing it personally.

1 Like

I am planning to do it but don’t have enough time. It may be equal or even may be more work than debugger support (I spent 6 months on it).
But I am still willing to work on it as it is an interesting idea to implement.

3 Likes

You mean ARC like the ones implemented in Swift & ObjectiveC with the LLVM support? ( retain and release). That would be awesome, I think.

1 Like

You may want to follow the development of Nim’s ARC.

https://forum.nim-lang.org/t/5734

2 Likes

Yes, exactly that approach.

1 Like