1.0 multi-threading memory use issues

jzakiya · August 18, 2020, 6:22pm

This is a request to fix, I hope fixable, issue with multi-threading before 1.0.

I have a multi-threaded twinprimes sieve that I’ve done now in Nim, Rust, D, and Crystal.

The Crystal implementation is the slowest, but that’s not the immediate issue of concern.

Each thread does an independent process of the sieve.
The issue with Crystal is it doesn’t release the memory used by each thread, so as each
thread’s process runs, you see memory usage (via htop, etc) steadily increase until the sieve
finishes, then the active memory use drops back to the initial amount.

For all the other language implementations the memory use hits some constant amounts
which doesn’t deviate much over the entire process.

This problem severely restricts the input size for the sieve because the larger the range values the higher the memory use, until it eats up all the usable memory and hangs the system (if I don’t abort first).

I don’t know if this is a garbage collector issue, and/or some other structural problem.
I think it would be good to fix this issue before 1.0 to at least present Crystal as a viable
language for performing multi-thread applications, regarding memory use too.

I am sure this also affects speed, because it overflows the systems cache causing Crystal
to be substantially slower (almost twice as slow to Rust depending on input size).

So fixing the memory leak issues will probably have a significant benefit on performance too.

Thanks for your consideration.

wontruefree · August 18, 2020, 6:52pm

Do you have any sample code?
It would be nice to have some reproducible code to work with to figure out what is going on.

jzakiya · August 18, 2020, 7:35pm

Here’s a gist of the code. Just follow the instructions to compile and run.

gist.github.com

https://gist.github.com/jzakiya/2b65b609f091dcbb6f792f16c63a8ac4

twinprimes_ssoz.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

jzakiya · August 19, 2020, 9:45pm

FYI, here are the Nim and Rust versions you can compile to see the
difference in performance and memory use compared to Crystal version.

Nim

gist.github.com

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

twinprimes_ssoz.nim

#[
 This Nim source file is a multiple threaded implementation to perform an
 extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

 Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
 Output is the number of twin primes <= N, or in range N1 to N2; the last
 twin prime value for the range; and the total time of execution.

 This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning

This file has been truncated. show original

Rust (the fastest so far)

gist.github.com

https://gist.github.com/jzakiya/b96b0b70cf377dfd8feb3f35eb437225

twinprimes_ssoz.rs

// This Rust source file is a multiple threaded implementation to perform an
// extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

// Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
// Output is the number of twin primes <= N, or in range N1 to N2; the last
// twin prime value for the range; and the total time of execution.

// This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
// 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
// probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

asterite · August 19, 2020, 9:57pm

What number are you using as input?

jzakiya · August 19, 2020, 10:05pm

A good number to use is 1 trillion (1_000_000_000_000),
as its large enough to see the memory use steadily increase,
but small enough to finish quickly. Of course, this all depends
on how many threads your system has and its clock speed. YMMV.

asterite · August 20, 2020, 1:18pm

I think the problem with the memory is that spawn accesses those big arrays. Because those are in the stack of the method that’s being executed, and they are referenced by spawn, its fiber can’t be freed… or something like that, I don’t have much time to think right now.

Try to never share memory and instead communicate data using channels. Always.

With this change it ends up consuming about 94MB:

gist.github.com

https://gist.github.com/asterite/921d8b74928edbc5c5b496f3c89a213c

twinprimes.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

asterite · August 20, 2020, 1:18pm

Could you share those numbers?

jzakiya · August 20, 2020, 7:50pm

Hey @asterite, I ran your version, and it did stabilize the memory issues, but is slower than the original.

Here are some quick times run on my old 2011 Lenovo, I5 2|4 cores|threads, 2.9 GHz.

Input: 1_000_000_000_000

Rust: 135 secs
Nim: 139 secs
Crystal original: 175 secs
Crystal Asterite: 185 secs

asterite · August 20, 2020, 8:33pm

Are you using MT? And if so, how many workers?

I won’t have time to optimize this, but, as usual, a profiler can help.

Also, it’s impossible for Crystal to always be faster than other languages. And it’s very hard for Crystal to beat Rust in these kind of things.

jzakiya · August 20, 2020, 8:47pm

Here is how its compiled:

$ crystal build twinprimes_ssoz.cr -Dpreview_mt --release

Is that the MT you’re referring too?

Like I said, I’m bringing attention to this so it can be improved in the future.
Rust hit 1.0 in 2015, and has many more resources than Crystal, so I expect it to be way ahead. But people will always compare languages, so its always good to know where you stand and need to get better.

Now that I see how to do it for stable memory I can choose which to use.
Thank you very much, as always. Someone else may be able to focus on this.
As Crystal matures I’m sure it will get better all around.

asterite · August 20, 2020, 8:53pm

Yes, thanks, that’s what I meant with MT.

The default number of workers (threads) is 4. I just tried it with 8 and it’s a bit faster. For me it goes from 83s to 66s.

What happens if you set the env var CRYSTAL_WORKERS to 8 ? (or how many cpus you have)

Also, I was profiling this and I see a lot of union types which might slow down things. Here’s the signature generated by the compiler:

*nextp_init<Int32, (Int32 | UInt64), Int32, (Array(Int32) | Array(UInt64)), Array(Int32)>:Array(UInt64)

Having a union of Int32 | UInt64 will be slow. Same for that array of union. If possible, try to annotate the method arguments with the types you expect. With numeric computations this is extremely important.

asterite · August 20, 2020, 9:00pm

Here’s a version without unions though it’s not working faster for me: https://gist.github.com/asterite/22860653aa8ec9ba57c1811e646a9e4a

That’s probably the limit of Crystal and the current MT implementation, GC, etc.

jzakiya · August 20, 2020, 9:38pm

FYI, It has same speed as your first version, but now leaks memory like the original,
so it’s an all around no win.

In general, the more threads you have to use the faster the results, so set CRYSTAL_WORKERS=Max_threads for fastest operation.

It would be much nicer in the future for multi-threading to just default using the available number of threads like in D/Nim/Rust.

asterite · August 21, 2020, 12:41pm

@jzakiya I have some new code. Can you try it?

gist.github.com

https://gist.github.com/asterite/3207fe5e31c0b944fd57c754f9dfa56f

maybe_faster.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

I checked your Rust code. It doesn’t use channels like Crystal. It does a parallel map to compute things, then uses that to compute the final results. Of course that’s faster than using channels because there’s no need to communicate anything between fibers.

So, one has to be careful about comparing benchmarks that use different algorithms.

Also, I replaced Array with Slice because I think that’s more similar to Rust’s vec, plus they never grow so it’s okay to use Slice.

There’s no parallel map so I’m using an implementation courtesy of @waj

In my machine it ends up running a bit faster and consuming just 20MB of memory.

Can you try it on your machine and compare the time/memory to Rust and Nim? I’m not sure this will beat them, but maybe the times will be similar.

jzakiya · August 21, 2020, 8:37pm

Here are the results for an input of 1 trillion (1_000_000_000_000):

Asterite Version 1: 185.1 secs; constant memory use
Asterite Version 2: 185.1 secs; increasing memory use
Asterite Version 3: 172.5 secs; constant memory use
Original Version: 174.5 secs; increasing memory use

So the latest version is a total win in both performance and memory use.

But I gotta say, the code seems like magic, because there’s nothing in the docs which would ever lead one to write anything like it to optimize performance/memory for parallel processing (multi-threading).

Actually, the algorithm is the same, what is being tested is the implementation for each language. That’s why this is a good algorithm to use to test the languages variations.

As you’ve demonstrated, to get the best current Crystal implementation you had to engage in guru wizardry I (most others) could never ever conceive of doing.

Also, Crystal’s parallel model is too reliant on just channels, as in many cases (like this one) there is no absolute need for communication between threads

Of course, Rust’s not having garbage collection and direct memory ownership provides and advantage here, which may be something to consider in the development of Crystal’s parallel process paradigm.

Rust has basic multi-threading baked into its core, but uses external crates like Rayon to do all the extra development. This is maybe the best thing to do for Crystal too, to create an official shard that contains the latest/greatest capabilities for multi-threading.

So the newest version is both faster and memory conservative, but I have no Idea how/why you did what you did. A good writeup (and copious documentation) would be very helpful for us mere mortals to understand the details of the implementation, and how/when to use them for future cases.

asterite · August 22, 2020, 11:14am

It’s not guru wizardry. The language is incomplete. The std has no facolities to help writing concurrent or multithreaded code other than the primitives to build those (spawn and channel). Eventually there will be, and then no wizardry will be needed.

I think the next main slowness, as you pointed out, is the memory management. I believe Rust uses jemalloc which is extremely fast. We use bohem GC and it’s probably slower, but this is just a guess.

And no, there will never be manual memory management in Crystal. Similar to how there will never be manual memory management in Ruby

asterite · August 22, 2020, 11:20am

Another improvement could be having a pool of arrays so these are reused. I think that should give a big boost, but it’s harder to code.

ComputerMage · August 22, 2020, 2:00pm

And what about -Dgc_none? Looks like I still can use it if I don’t need GC. src/gc/none.cr

asterite · August 22, 2020, 2:26pm

Sure, you can do it. Then a program that allocates memory can only run for a few seconds before all your computer’s memory is consumed.

To be clear, -Dgc_none is a compiler dev tool to try out code without involving the more complex GC. In reality it never makes sense to use it, and I’d like that flag to be eventually removed from the language (same as the --prelude option).

Topic		Replies	Views
Kudos to Crystal! News	12	2162	December 29, 2021
Timeline for multithreading support Crystal Contrib	31	2285	September 24, 2024
How to Parallelize this? Help & Support	6	801	March 25, 2020
Perplexing spawn arithmetic overflow errors Help & Support	23	513	March 24, 2021
Multi-threading ceases past certain input values Help & Support	9	524	July 8, 2022

1.0 multi-threading memory use issues

Related topics