Kudos to Crystal!

jzakiya · December 19, 2021, 8:43pm

The last couple of days I took some time to recompile some code (my segmented parallel threaded Twin Primes sieve) in the newly updated versions of Crystal (1.2.2), Rust (1.57), and Nim (1.6.2).

Previously (and still) the Rust version is the fastest, but Crystal is now better in memory management, at least for this one app.

This program counts the number of Twin Primes within the (unsigned) 64-bit number space, i.e. numbers upto 2**64 - 1 = 18,446,744,073,709,551,615.

As the number ranges get bigger, it generates more sieving primes, and each thread consumes more system memory for processing.

Just over a year ago Crystal crashed pretty quickly as the input numbers grew, because it was eating up memory like pacman. But with some help from this forum, and the latest version, it now can process large inputs that Rust and Nim currently crash on, because they need more system memory to finish.

So I just want to congratulate the Crystal devs for continually improving it to the point where I see it now usable for heavy duty numerical computation applications as well.

Here’s an example of a run Rust and Nim crashed on because they were increasing memory usage that eventually ate up all my 16GB of system memory. Crystal hit about ~11/12 GB, and stayed there when all the threads were most active over the computationally heavy part of the run.

➜  crystal-projects CRYSTAL_WORKERS=8 ./twinprimes_ssoz 18400000000000000000 18400000000000005950
threads = 8
using Prime Generator parameters for P5
segment size = 199 resgroups; seg array is [1 x 4] 64-bits
twinprime candidates = 597; resgroups = 199
GC Warning: Repeated allocation of very large block (appr. size 1610616832):
        May lead to memory leak and poor performance
each of 3 threads has nextp[2 x 203034842] array
setup time = 22.434504 secs
perform twinprimes ssoz sieve
3 of 3 twinpairs done
sieve time = 18.054866 secs
total time = 40.48937 secs
last segment = 199 resgroups; segment slices = 1
total twins = 6; last twin = 18400000000000005942+/-1

fork-bomber · December 20, 2021, 8:04am

That’s very cool!

Would it be possible to access your code for all the langs to replicate your results ?

Thanks

jzakiya · December 20, 2021, 5:07pm

Here are gists of my different working versions so far (someone else did Java too).
The Go version technically works, but the code has lots of memory use issues that I don’t know how to fix (and its forum wasn’t helpful in fixing them either).
I also haven’t updated the D implementation to mimic the rest.

My system is:
System76 Gazelle laptop (2016), Intel i7-6700HQ, 2.6 - 3.5 GHz, 4C/8T, 16GB mem
GCC: 11.2.0 (C++, Nim)
LLVM: 12.0.1

The more memory you have the larger numbers/ranges you can do physically (not accounting for time).

I haven’t optimized the implementations for large numbers/ranges because it would add more code beyond the necessity to demonstrate the algorithm works. Please feel free to optimize the code to your hearts content, since it’s open source.

I’m very interested in seeing results on different hardware I don’t have to run it on (AMD, ARM, M1, PowerPC, RISC, etc).

Crystal
https://gist.github.com/jzakiya/2b65…792f16c63a8ac4

Rust
https://gist.github.com/jzakiya/b96b…eb3f35eb437225

Nim
https://gist.github.com/jzakiya/6c7e…dd62e3e3b2341e

C++
https://gist.github.com/jzakiya/fa76…599983be175a3f

D
https://gist.github.com/jzakiya/ae93…c7f97ff8ad0f61

Go
https://gist.github.com/jzakiya/fbc7…0ff7c2476373d9

beta-ziliani · December 20, 2021, 6:49pm

Awesome to hear that @jzakiya ! I wonder if you tried Julia, since it’s one of the go-to languages for such tasks. Because, you know, why staying happy with the results so far when you can try to find someone to beat you once again!

jzakiya · December 20, 2021, 8:44pm

Just what I need, more work to do.

Julia is on my wish list, but I don’t feel like (right now) learning a new language in my life.
In its forum I was trying to entice/challenge someone to do a faster Julia version, with no luck. I’d also like to see an (optimized) Java, Swift, Kotlin, C#, Zig…too. Anything compiled.

What’s actually more important to me is for the math that forms the foundation of the algorithm be academically recognized and taught in schools (universities and under).

I’d settle to at least see it included in the prime sieves list in Wikipedia (I tried to enter it but they wouldn’t let me because I was the creator).

What’s interesting in this case, Rust/C++ don’t have GCs and steadily eat memory, but Nim does (not sure what default is for 1.6.2 though) and still eats memory too, while Crystal doesn’t (after I implemented it not to). But this gets into language and compiler design that I leave to those gurus to ponder and understand.

fork-bomber · December 21, 2021, 9:14am

Thanks @jzakiya for ponying up your code! Insightful!

A kind of related question for folks here: Has there ever been an ‘official’ Crystal ‘entry’ to the Computer Language Benchmarks Game ?

The general consensus on those kind of microbenchmarks is that they are of questionable value to do language comparisons but I know from personal experience how influential they can be.

Would be neat to have the ability to use the ‘state of the art’ (!) language comparison suite I think.

Thoughts appreciated!

beta-ziliani · December 21, 2021, 5:51pm

Of course I didn’t mean it to put more work on you

jzakiya · December 21, 2021, 6:51pm

Oh I know, I was just playing with you.

But I wouldn’t mind seeing old school Ada, FORTRAN and Pascal/Delphi versions under my Christmas tree too. Santa needs to get them elves aworking.

FYI
I rebooted and only opened a terminal, and the base system uses ~500MB of my 16GB system memory, and reran all the different languages.

Crystal consistently topped out using 2.5 - 3.5 GB less memory than the rest. The setup times were roughly the same, which involved single core processing stuff.

The differences were in the parallel sieving processing. Crystal is about 6-9% slower (for those number sizes/ranges) but it hits it’s top memory use and just stays there like a rock until finished. (I used htop to monitor this.)

Like I said, Crystal has come a looong way these past couple of years, and I can’t wait to see what 2022 brings (with Crystal, politics and climate change excluded).

jzakiya · December 22, 2021, 9:41pm

CRYSTAL_WORKERS

This application has made me realize the utility of having CRYSTAL_WORKERS.

I can increase the effective number range processing by adjusting the number of CRYSTAL_WORKERS I can manually set to use at runtime.

To use 8 threads simultaneously, I reduced the input numbers from 64 to 62 bits.
Using htop to monitor memory/threads activity, I was able to eventually get this to run for a range of 19,505,950. This was on the razor’s edge of where it would run/crash, and backing off using CRYSTAL_WORKERS=7 allowed it to run all the time, though slower.

➜  crystal-projects CRYSTAL_WORKERS=8 ./twinprimes_ssoz 4410000000000000000 4410000000019505950
threads = 8
using Prime Generator parameters for P7
segment size = 92887 resgroups; seg array is [1 x 1452] 64-bits
twinprime candidates = 1393305; resgroups = 92887
GC Warning: Repeated allocation of very large block (appr. size 1610616832):
        May lead to memory leak and poor performance
each of 15 threads has nextp[2 x 102886522] array
setup time = 10.332723 secs
perform twinprimes ssoz sieve
GC Warning: Repeated allocation of very large block (appr. size 1646186496):
        May lead to memory leak and poor performance
2 of 15 twinpairs doneGC Warning: Repeated allocation of very large block (appr. size 1646186496):
        May lead to memory leak and poor performance
15 of 15 twinpairs done
sieve time = 26.278561 secs
total time = 36.611284 secs
last segment = 92887 resgroups; segment slices = 1
total twins = 13835; last twin = 4410000000019503222+/-1

I was able to reliably increase the range to 20+M by using 7 workers, as below.

➜  crystal-projects CRYSTAL_WORKERS=7 ./twinprimes_ssoz 4410000000000000000 4410000000020505950
threads = 8
using Prime Generator parameters for P7
segment size = 97649 resgroups; seg array is [1 x 1526] 64-bits
twinprime candidates = 1464735; resgroups = 97649
GC Warning: Repeated allocation of very large block (appr. size 805310464):
        May lead to memory leak and poor performance
each of 15 threads has nextp[2 x 102886522] array
setup time = 11.12939 secs
perform twinprimes ssoz sieve
GC Warning: Repeated allocation of very large block (appr. size 1646186496):
        May lead to memory leak and poor performance
1 of 15 twinpairs doneGC Warning: Repeated allocation of very large block (appr. size 1646186496):
        May lead to memory leak and poor performance
15 of 15 twinpairs done
sieve time = 33.82635 secs
total time = 44.95574 secs
last segment = 97649 resgroups; segment slices = 1
total twins = 14564; last twin = 4410000000020502132+/-1

Notice in both instances, the number of Twin Primes are about 1% of the number of twin candidates in each range.

Being able to manually set/limit the usable number of threads at runtime is a feature I’m not aware other languages have. In cases like this, I can now maximize memory use by limiting the number of simultaneous run threads, to get results for larger ranges.

Again, kudos to the devs for thinking about this.
(Though I suggest changing the default from 4 workers to the system number when compiling for multi-threading, then users won’t need to manually increase workers to get max performance, which is the usual desired case.)

I hope this little tip is helpful to others doing multi-threading if they encounter similar memory use issues.

ADDITION
Using only 2 workers I was able to process a range of over 1 billion 64-bit numbers, with headroom to spare. I think this is really damn impressive!

➜  crystal-projects CRYSTAL_WORKERS=2 ./twinprimes_ssoz 18400000000000000000 18400000001000005950
threads = 8
using Prime Generator parameters for P7
segment size = 131072 resgroups; seg array is [1 x 2048] 64-bits
twinprime candidates = 71429010; resgroups = 4761934
GC Warning: Repeated allocation of very large block (appr. size 805310464):
        May lead to memory leak and poor performance
each of 15 threads has nextp[2 x 203034841] array
setup time = 21.855971 secs
perform twinprimes ssoz sieve
1 of 15 twinpairs doneGC Warning: Repeated allocation of very large block (appr. size 3248558080):
        May lead to memory leak and poor performance
10 of 15 twinpairs doneGC Warning: Repeated allocation of very large block (appr. size 3248558080):
        May lead to memory leak and poor performance
15 of 15 twinpairs done
sieve time = 385.933012 secs
total time = 407.788983 secs
last segment = 43342 resgroups; segment slices = 37
total twins = 670310; last twin = 18400000001000005188+/-1

shirleyquirk · December 23, 2021, 7:13pm

Hi, this was brought to our attention on the Nim forum and I took a peek at why the Nim version was crashing.
Nim makes a copy of the primes array for each thread, which results in the oom crash, so for the Nim code to work on those large inputs, instead of passing primes to each worker thread, it’s enough to make that seq global. Pseudocode fix on that linked thread.

How does Crystal know it’s safe to pass the primes array by reference?

I do like how CRYSTAL_WORKERS works, that’s very handy. The equivalent for Nims threadpool is setMaxPoolSize

straight-shoota · December 23, 2021, 9:55pm

It doesn’t. Crystal has no mechanic that would duplicate the arguments to a spawned method, which is the case in Nim if I understand correctly?
There is nothing special happening with spawn in Crystal. It just create a new fiber and when that runs, it executes the block body. The array is shared between all fibers because the local var is closured; that always happens when a local variable from an outer scope is accessed within a proc (which the fiber body is).

jzakiya · December 24, 2021, 12:36am

I posted this question in the Rust forum, it was same issue (copy vs reference), just needed changing passing into twins_sieve: &primes.to_vec() → &primes

Rust now use a smidgen less memory for those inputs: ~11GB vs ~11.7 GB max.

I guess also need to make array resinvrs global in Nim too?

jzakiya · December 29, 2021, 10:25pm

FYI.

Today (Wed 2021/12/29) I updated code to significantly improve performance, and lower memory use. Here’s the gist link again, with updated code.

gist.github.com

https://gist.github.com/jzakiya/2b65b609f091dcbb6f792f16c63a8ac4

twinprimes_ssoz.cr

# This Crystal source file is a multiple threaded implementation to perform an
# extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.

# Inputs are single values N, or ranges N1 and N2, of 64-bits, 0 -- 2^64 - 1.
# Output is the number of twin primes <= N, or in range N1 to N2; the last
# twin prime value for the range; and the total time of execution.

# This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
# 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. Parameter tuning
# probably needed to optimize for other hardware systems (ARM, PowerPC, etc).

This file has been truncated. show original

I only had to add 1 loc, and change 4 others.

Originally sozpg generated|returned all the primes <= sqrt(end_num).
For small ranges over large input values, allot of those primes aren’t needed|used because they are larger than the input range (end_num - start_num). The only primes needed are those who have at least one multiple within the inputs range.

Now sozpg still has to generate all those primes, but it now only returns those with multiples within range, which greatly reduces the number of primes used in each thread, which means each thread also uses less memory.

Here’s an example to show performance gain for these 64-bit input numbers.

Original code.

➜  crystal-projects CRYSTAL_WORKERS=8 ./twinprimes_ssozgist 18446000000000000000 18446000000010991000
threads = 8
using Prime Generator parameters for P5
segment size = 65536 resgroups; seg array is [1 x 1024] 64-bits
twinprime candidates = 1099104; resgroups = 366368
each of 3 threads has nextp[2 x 203276347] array
setup time = 21.260179 secs
perform twinprimes ssoz sieve
3 of 3 twinpairs done
sieve time = 23.309584 secs
total time = 44.569763 secs
last segment = 38688 resgroups; segment slices = 6
total twins = 7366; last twin = 18446000000010989208+/-1

New Code.

➜  crystal-projects CRYSTAL_WORKERS=8 ./twinprimes_ssozgist1 18446000000000000000 18446000000010991000
threads = 8
using Prime Generator parameters for P5
segment size = 65536 resgroups; seg array is [1 x 1024] 64-bits
twinprime candidates = 1099104; resgroups = 366368
each of 3 threads has nextp[2 x 4168873] array
setup time = 21.493201 secs
perform twinprimes ssoz sieve
3 of 3 twinpairs done
sieve time = 0.485734 secs
total time = 21.978935 secs
last segment = 38688 resgroups; segment slices = 6
total twins = 7366; last twin = 18446000000010989208+/-1

The original code takes a total of 44.56 secs, with the sieve taking 23.31 secs
The new code takes at total of 21.97 secs, with the sieve taking 0.4857 secs

The original code used 203,276,347 primes per thread:
each of 3 threads has nextp[2 x 203276347] array

The new code used only 4,168,873 primes/thread:
each of 3 threads has nextp[2 x 4168873] array

I could reduce the primes generation using look-up tables, but that’s another story.
Just thought I’d pass this along for those interested in the code|algorithm.

Topic		Replies	Views
1.0 multi-threading memory use issues Crystal Contrib	32	1688	August 29, 2020
Twin Primes Segmented Sieve of Zakiya (SSoZ) Explained News	16	616	June 29, 2022
Timeline for multithreading support Crystal Contrib	31	2293	September 24, 2024
How to Parallelize this? Help & Support	6	801	March 25, 2020
Perplexing spawn arithmetic overflow errors Help & Support	23	513	March 24, 2021

Kudos to Crystal!

Related topics