Twin Primes Segmented Sieve of Zakiya (SSoZ) Explained

I just released this new paper which prominently features Crystal code.

Twin Primes Segmented Sieve of Zakiya (SSoZ) Explained

Over the past few years I’ve gotten help in developing the code from the Forum and thought
I’d share the results of the refined code’s comparative performance to some other languages.

I previously had posted about it here

Most recently I brought up a problem I observed with multi-threading here

It seems using Crystal’s BitArray eliminates that problem.
After its implementation speedup by @Hertzdevil with 1.4.0 (April 2022) I now use bitarrays for the reference implementations.

Allot of refinements and performance increases have occurred as Crystal has developed.
The most prominent area still needing development to match Rust’s, et al, performance level is for true parallel processing|multi-threading. But that’s known and should improve (hopefully soon).

Here are the Crystal sources for the code in the paper.

twinprimes_ssoz

cousinprimes_ssoz

I would love to see what times people get on different hardware, like an M1, ARMs, et al.

9 Likes

Excellent work. At least on this area, we can say that Crystal is (practically) as fast as C.

Combined with its more simple syntax and approach, those are some of the reasons I like Crystal so much.

2 Likes

The Crystal version was by far the easiest|most fun code to do, though I initially developed the code in 2017/18 in Nim. Back then (5|4 years ago) Crystal wasn’t “there” as a viable language to do this with.

The Rust code had heavy input from its forum, as I still don’t have an organic understanding of some of it (around the parallel stuff), i.e., I still wouldn’t be able to write that code by myself from scratch. But the Rust forum people are (mostly) really nice and patient and love to code, so many jumped in to help improve the code and show how to do things the Rust Way.

For what my focus is on (currently) Crystal is 95% there, as I can do these types of applications with only native components (no external shards needed). But I think for multi-threading stuff it may be easier to develop it externally in a shard until its matures, then possibly bring it back in.

Even though Rust has the native components to do multi-threading (threads, locks, mutexes, etc) almost everybody uses the Rayon crate for serious work. And for C++ multi-threading was done with OpenMP, which comes standard with gcc/g++ on most systems. It would be true nirvana to be able to write the Crystal parallel code with parallel iterators like in Rust, and achieve its performance too.

I started writing this paper last year, in December 2021, and waited until I got enough $$$ to buy the Lenovo with the AMD Ryzen 9 5900HX in May 2022, with fast 16 threads, to do comparisons with. I am glad I waited, because you can clearly see the difference in performance using more threads.

Hopefully when people read the paper and see how simple the Crystal code is (though I wrote all the versions to look structurally the same) they’ll consider using it to at least prototype their applications with, and then translate it to another language they think will perform better. I also hope people will create versions in other languages. The more the merrier.

And again, I’d really like people to post times for the benchmark examples for other hardware, especially Apple M1s, etc.

4 Likes

Nice!

Just ran it on my M1 Max with 8 threads (it has 10 cores, but 2 are underclocked to save battery on low-demand workloads) and it took 25m9s with an input of 60000000000000.

➜  Code time CRYSTAL_WORKERS=8 ./twin_ssoz 60000000000000
threads = 10
using Prime Generator parameters for P17
segment size = 1572864 resgroups for seg bitarray
twinprime candidates = 2617970280750; resgroups = 117529530
each of 22275 threads has nextp[2 x 523764] array
setup time = 0.030346 secs
perform twinprimes ssoz sieve
22275 of 22275 twinpairs done
sieve time = 1508.82154 secs
total time = 1508.851886 secs
last segment = 1137594 resgroups; segment slices = 75
total twins = 84209699420; last twin = 59999999999868+/-1
CRYSTAL_WORKERS=8 ./twin_ssoz 60000000000000  11406.56s user 28.68s system 757% cpu 25:08.97 total
``

What the heck are the specs on that M1?
I ran it on my Lenovo AMD 5900HX with 16 threads cpu and it took 1791.5 secs (29m51s).

Which one of these systems do you have?

Can you post times for all the benchmark values in the paper for an apples-to-apples :wink: comparison.

Yeah, the performance is absurd. It’s the M1 Max with 32GB RAM, but the entire M1 line has the same per-core performance.

Apple doesn’t advertise clock speed but Geekbench says the clock speed for the performance cores is 3.2GHz. But not all the cores run at that speed — 2 of them are underclocked to (I think) 2.0GHz and when the CPU isn’t under load the kernel schedules processes to them to conserve power.

Sure, I’ll give it a shot.

@jzakiya thanks for the work and write up. Very interesting!

Can you please upload your pdf to some other service besides academia.edu? The experience they are providing me is vastly inferior to others I’m used to like arxiv.org or research gate!

1 Like

Can you describe the problem?
When you get to the page just scroll down to read the paper, and/or download it.

I also have my papers on https://vixra.org/, and will upload this one there too, probably later today.

But try Academia.edu again, because it’s been viewed/downloaded over 100x now and yours is the first complaint I’ve received about getting access to the paper. If you’d like, I can also email it to you.

It asked me to create an account so I never downloaded the paper.

At the bottom of the home page there is a arrow that says scroll down to read paper.
So just do that and you can see the paper, and download it for FREE.

I was on mobile, I could not see the images, the browser kept freezing, I would scroll a page, wait a minute for an add to load, scroll half a page, the browser would be unresponsive to all touch events for 30s, then another add would load, then I’d scroll another half page.

I created an account but was told I need to upgrade from the free account to download the “Free PDF”.

Now I’m back at my laptop and having a much better browser experience.

Oh, to be clear I’m using a fairly new Galaxy phone, had a nice 5G connection that was responsive on other websites, and was on Chrome.

Ultimately it was the ‘pay to download a pdf you are looking at right now’ thing that pushed me over the edge. I’m glad to hear others are having a reasonable experience though. I was ready to write off academia.edu as a scam.

Can you share this paper directly here? i could not download it, because i can’t register it use credit card.


EDIT: Oops, i found the download link, thank you!

Here are the links to the paper on vixra.org.
Actually, the pdf looks better here than on academia.edu.

1 Like

I found over the weekend a few places in the code giving me arithmetic overflow errors on certain inputs. I’ve updated the code to eliminate the problems, but I don’t see why they existed (I thought I accounted for their possibilities, but guess I didn’t).

Here are the gists again to the updated code.
twinprimes_ssoz
cousinprimes_ssoz

Here’s the nature of the problem.
I was testing for values at the upper 64-bit range, (2^64 -1 = 18446744073709551615)
When running the code with these inputs it gives the correct results (need 16GB of mem).

CRYSTAL_WORKERS=16 ./twinprimes_ssoz 18436744073709551615 18436744073709000000
threads = 16
using Prime Generator parameters for P5
segment size = 18388 resgroups for seg bitarray
twinprime candidates = 55167; resgroups = 18389
each of 3 threads has nextp[2 x 331650] array
setup time = 14.445965 secs
perform twinprimes ssoz sieve
3 of 3 twinpairs done
sieve time = 0.00788 secs
total time = 14.453845 secs
last segment = 18389 resgroups; segment slices = 1
total twins = 395; last twin = 18436744073709550662+/-1

But when I used values at the absolute upper limit I got arithmetic overflow errors.

CRYSTAL_WORKERS=16 ./twinprimes_ssoz 18446744073709551615 18446744073709000000
threads = 16
using Prime Generator parameters for P5
segment size = 18388 resgroups for seg bitarray
twinprime candidates = 55164; resgroups = 18388
prime =  555829Unhandled exception: Arithmetic overflow (OverflowError)
  from crystal/share/crystal/src/kernel.cr:25:1 in 'twinprimes_ssoz'
  from twinprimes_ssozgist.cr:284:1 in '__crystal_main'
  from crystal/share/crystal/src/crystal/main.cr:115:5 in 'main'
  from /lib64/libc.so.6 in '??'
  from /lib64/libc.so.6 in '__libc_start_main'
  from ../sysdeps/x86_64/start.S:117 in '_start'
  from ???

That error message was of no help, but I eventually placed some puts lines on certain values in places and figured it out.

The first place causing an error was in the routine sozpg, which generates sieving primes.

n, rem = start_num.divmod prime 
primes << prime if (res_0 <= prime <= val) && (prime &* (n + 1) <= end_num || rem == 0)
                                                     ^

Adding the & eliminated that overflow, but I don’t see why it was occurring, because both prime and end_num are u64 values.

The other overflow came in twins_sieve, where again, adding &s fixed the overflows.

ki += 1    if ((ki &* modpg) &+ r_hi - 2)  < start_num
k_max -= 1 if ((k_max - 1) &* modpg &+ r_hi) > end_num

Again ki, k_max, start|end_num are all u64s.

1: Can someone explain why these overflows occur.
2: Will 1.5.0 fix having to explicitly put these & operations in the code?

After the fixes the code runs correctly for these inputs.

CRYSTAL_WORKERS=16 ./twinprimes_ssoz 18446744073709551615 18446744073709000000
threads = 16
using Prime Generator parameters for P5
segment size = 18388 resgroups for seg bitarray
twinprime candidates = 55164; resgroups = 18388
each of 3 threads has nextp[2 x 203280218] array
setup time = 18.342097 secs
perform twinprimes ssoz sieve
3 of 3 twinpairs done
sieve time = 5.344191 secs
total time = 23.686288 secs
last segment = 18388 resgroups; segment slices = 1
total twins = 361; last twin = 18446744073709550772+/-1

FYI: 18446744073709550772+/-1 is the largest twin prime within 2^64.

If I’m understanding your code correctly, here we have some prime less than sqrt(end_num) (because we’re assuming that the first conditional passes in order for us to get the overflow). We then get start_num.divmod prime, which gives us n and rem such that prime * n + rem == start_num (which is just what divmod does). Then you’re doing prime * (n + 1), which is start_num - rem + prime. So all that’s needed to overflow is for start_num - rem + prime to be greater than UInt64::MAX. Since the value of rem depends on the value of prime, let’s focus on potentially problematic values of prime. Any prime in this context that might overflow UInt64 has to conform to (UInt64::MAX - start_num) < prime <= sqrt(end_num). For end_num == 18446744073709551615 and start_num == 18446744073709000000, a quick automated search starting at UInt64::MAX - start_num finds that prime == 555829 (with start_num % prime == 1742) does cause the overflow: 18446744073709000000 + 555829 - 1742 > UInt64::MAX.

I’m also not sure your &* does what you want it to. It seems like you’re just avoiding the overflow exception by introducing what is likely an error into your code. Since you’re using UInt64 everywhere, end_num must be less than or equal to UInt64::MAX, so any overflow in prime * (n + 1) <= end_num is logically violating the intended check. As I mentioned above, prime * (n + 1) here is equivalent to start_num - rem + prime, so it should be safe to change your check to start_num <= end_num - prime + rem. Since prime should always be positive and rem < prime, end_num - prime + rem is less than end_num and therefore less than UInt64::MAX.

I’m not going to try and follow the logic and constraints related to your second overflow location due to the difficulty I’m having following your code, but I hope this is a satisfactory explanation of the behavior you’re seeing.

2 Likes

@RespiteSage thank you, Thank You, THANK YOU! :smiley:

One reason I was so perplexed by this error is because every other language implementation (including Rust) didn’t flag it. But once you gave your analysis, it was like a lightbulb went off in my head, and I realized all the other implementations were silently failing too, as you suggested.

This explains why for those inputs the routine sozpg was returning all the primes up to the sqrt(UInt64:MAX) when most should have been filtered out. Following your analysis here’s all I had to do to fix it.

from
primes << prime if (res_0 <= prime <= val) && (prime &* (n + 1) <= end_num || rem == 0)
to
primes << prime if (res_0 <= prime <= val) && (prime * n <= end_num - prime || rem == 0)

And here’s the difference in the outputs (not run under optimum conditions).
Notice the decrease in the sieving primes used, and reduced residue sieve time.

CRYSTAL_WORKERS=16 ./twinprimes_ssoz 18446744073709551615 18446744073709000000
threads = 16
using Prime Generator parameters for P5
segment size = 18388 resgroups for seg bitarray
twinprime candidates = 55164; resgroups = 18388
each of 3 threads has nextp[2 x 203280218] array # max number of sieve primes
setup time = 18.539833 secs
perform twinprimes ssoz sieve
3 of 3 twinpairs done
sieve time = 6.765588 secs
total time = 25.305421 secs
last segment = 18388 resgroups; segment slices = 1
total twins = 361; last twin = 18446744073709550772+/-1

CRYSTAL_WORKERS=16 ./twinprimes_ssoz 18446744073709551615 18446744073709000000
threads = 16
using Prime Generator parameters for P5
segment size = 18388 resgroups for seg bitarray
twinprime candidates = 55164; resgroups = 18388
each of 3 threads has nextp[2 x 330664] array # correct filtered number of sieve primes
setup time = 17.112455 secs
perform twinprimes ssoz sieve
3 of 3 twinpairs done
sieve time = 0.008516 secs
total time = 17.120971 secs
last segment = 18388 resgroups; segment slices = 1
total twins = 361; last twin = 18446744073709550772+/-1

So now I’ll go back and change all the coded versions, and put out an updated paper to reflect it.

This forum and language are the best!! :star_struck:

3 Likes