There is a way to optimize this program?

I saw a Twitter post comparing the speed of a function implemented on Java and Go. (x.com)

The post says:

now the main question is why the Java version is faster than the GO one
Java took 2.1 min and GO took 3.26 min

I wanted to implement it on Crystal just to see how this goes:

def collatz(seed : UInt32)
  steps = 0_u32
  while seed > 1
    while seed%2 == 0
      steps += 1
      seed /= 2
    end
    if seed > 1
      steps +=1
      seed = seed*3 + 1
    end
  end
  return steps
end

seed = 1_u32
start = Time.measure do
  while seed != 1_000_000_000
    collatzResult = collatz(seed)
    if seed%1_000_000 == 0
      puts "Seed #{seed} Steps: #{collatzResult}"
    end
  seed += 1
  end
end
puts "collatz took: #{start}"

(Probably not a very good implementation due to my lack of Crystal knowledge)

I compiled it using crystal build --release --no-debug --mcpu native test.cr and then I ran the binary.
It took 00:08:45.836527390 to complete.
This is way too much execution time and not even close to the time that the Go code (from the twitter post) took on my machine, which was 3m52.714522569s. Not because of this I’m going to stop using Crystal, I’m just wondering why it takes soooo long compared to the Go version.

This is the Go code:

package main

import (
  "fmt"
  "time"
)

func collatz(seed int) int {
  var steps = 0
  for seed > 1 {
    for seed%2 == 0 {
      steps++
      seed /= 2
    }
    if seed > 1 {
      steps++
      seed = seed*3 + 1
    }
  }
  return steps
}

func main() {
  start := time.Now()
  for seed := 1; seed <= 1_000_000_000; seed++ {
    collatzResult := collatz(seed)
    if seed%1_000_000 == 0 {
      fmt.Printf("Seed: %d Steps %d\r", seed, collatzResult)
    }
  }
  fmt.Printf("%s took %v\n", "collatz", time.Since(start))
}

Part of the problem I think is seed /= 2 is not the same between Go/Java and Crystal. E.g. in Go/Java 1 / 2 is 0 while in Crystal it’s 0.5. For this kind of floor division you need to use //. Also from what I read the int type in Go is 64 bits on 64 bit machines, vs the Crystal code using UInt32.

I lowered the initial max seed to 25 million and it took 15.38s.

I then changed the division line to seed //= 2, and changed the typing of the integers to Int64 vs UInt32, and that brought it down to around 8.1s.

The next catch is that Crystal always wants to give you a “correct” result. So if you were to do like:

foo = Int64::MAX
foo += 4

You get a Unhandled exception: Arithmetic overflow (OverflowError), whereas doing this in Go:

var foo int64 = math.MaxInt64
foo = foo + int64(4)

You get -9223372036854775805.

You can get the same value Go does by using the wrapping positive operator as per Operators - Crystal. E.g.

foo = Int64::MAX
foo &+= 4

So knowing this, updating all the operator logic to use the wrapping version, the final result I got was around 5.2s, which is a bit better than Go, which was around 6.3s.

EDIT: For the full billion records, my updated Crystal code was 00:04:25.825707214, whereas the Go code was 5m13.056021727s.

13 Likes

on my local, crystal version is slower than go.

go version:

collatz took 3m12.440701372s

real    3m12.443s
user    3m12.582s
sys     0m0.040s

Crystal version

collatz took: 00:04:10.769206254

Consider puts compare to fmt.Printf maybe unfair, i remove the puts code in crystal version, but still slower than go version.

 ╰─ $ ./1
collatz took: 00:04:04.742989327

Any idea?

def collatz(seed : Int64)
  steps = 0_i64

  while seed > 1
    while seed % 2 == 0
      steps &+= 1
      seed //= 2
    end
    if seed > 1
      steps &+= 1
      seed = seed*3 + 1
    end
  end

  steps
end

seed = 1_i64
start = Time.measure do
  while seed != 1_000_000_000
    collatzResult = collatz(seed)
    if seed % 1_000_000 == 0
      collatzResult
    end
    seed += 1
  end
end
puts "collatz took: #{start}"

You missed using wrapper operators in the seed = seed * 3 + 1 line.

You means this?

def collatz(seed : Int64)
  steps = 0_i64

  while seed > 1
    while seed % 2 == 0
      steps &+= 1
      seed //= 2
    end
    if seed > 1
      steps &+= 1
      seed = seed * 3 &+ 1
    end
  end

  steps
end

seed = 1_i64
start = Time.measure do
  while seed != 1_000_000_000
    collatzResult = collatz(seed)
    if seed % 1_000_000 == 0
      collatzResult
    end
    seed &+= 1
  end
end
puts "collatz took: #{start}"

Still slow than golang version.

 ╰─ $ ./1
collatz took: 00:04:02.554176834

BTW: if change seed = seed * 3 &+ 1 to seed = (seed &* 3) &+ 1, code seem like not work, and complete almost immediately.

 ╰─ $ ./1
collatz took: 00:00:00.000000040

Maybe LLVM can just totally optimize everything away if all math is using wrapping operator? :person_shrugging: In my example I didn’t have the seed += 1 in the while loop be wrapping, only the stuff within #collatz.

EDIT: Yea I bet that’s what is happening. If you add back in puts "Seed #{seed} Steps: #{collatzResult}", it runs as expected.

Same as before.

Although, forget to say, i build binary use crystal build --release 1.cr by default, after i try with crystal build --release --no-debug --mcpu native 1.cr, time took from 00:04:06.331007188 reduce to 00:03:25.761518151, it already very close to go version. 3m12.440701372s.

what is the mcpu stand for? what is the default value for this option? should i set it to native as always?

BTW: I use AMD Cpu, this is probably the reason crystal version slightly slower than go version.

Add puts back OR add --no-debug option impact not too much.

But, --mcpu native can improve performance obviously, i really want to know what is the downside of set this as default.

Downside is that if you compile program on cpu which support say AVX-512, resulting binary won’t work on CPU without AVX-512.

1 Like

Very little of the standard library is automatically vectorizable without explicit SIMD. We are quite far from that

Thanks for share, i add --mcpu native as default for my crystal build, and try build a lucky app on my local AMD 7840hs laptop, which genreated a slightly small binary, but when copy to my remote (Intel) VPS and run it, i get Illegal instruction (core dumped), haha

You could still use the flag, but give it the CPU of the machine it’ll run on instead of native.


I asked about mcpu and such a while ago with the other cores, so just throwing down some of my notes here for posterity.

  • Can get a list of valid LLVM targets via clang -print-targets.
  • You can then use one of those architectures with llc, for example llc -march=x86-64 -mattr=help.
    • You can then use this list to see the valid options you can pass to --mcpu or --mattr. E.g. crystal build --mcpu=raptorlake --mattr=+feature1,-feature2 to build for raptor lake, enabling feature1 while disabling feature2
  • Can use clang -E -v - </dev/null to see what target options and such you get by default
  • Can use clang -march=native -E -v - </dev/null to see what they are for your specific CPU
  • Target-features, target-cpu, LLVM, Clang and general confusion - LLVM Project - LLVM Discussion Forums also provides some pointers
4 Likes

It’s looks like situation is a bit complicated when using zig cc to generate static binary, e.g. my remote vps use Intel® Xeon® Processor E5-2637 v4, it should be a mcpu=broadwell , but shards build --target=x86_64-linux-musl --static --no-debug --link-flags=-s --release, then linking use zig cc still not work. :joy:

One more things, i recompile my crystal compiler and enable --mcpu=native, it works! and seem like give me a better performance when running crystal build.

I have to change Makefile manually to workaround this.

Should we add a new FLAGS option to make enable --mcpu=native directly is possible?


EDIT: create a new issue for this. Add Makefile support `--mcpu=native` as override FLAGS to permit build crystal compiler can enable this option optional for a better performance. · Issue #14744 · crystal-lang/crystal · GitHub

Slightly modified Go code to keep step size shown.

#  collatz.go
package main

import (
  "fmt"
  "time"
)

func collatz(seed int) int {
  var steps = 0
  for seed > 1 {
    for seed%2 == 0 {
      steps++
      seed /= 2
    }
    if seed > 1 {
      steps++
      seed = seed*3 + 1
    }
  }
  return steps
}

func main() {
  start := time.Now()
  for seed := 1; seed <= 1_000_000_000; seed++ {
    collatzResult := collatz(seed)
    if seed%1_000_000 == 0 {
      fmt.Printf("Seed: %d Steps %d\r", seed, collatzResult)
    }
  }
  fmt.Printf("\n%s took %v", "collatz", time.Since(start))
}

My sligthly modified Crystal code to produce same results/display as Go.

Notice step|seed are UInt64.
Replace while with until to produce same loop results with GO.
No modulo (%) use.
seed //= 2 is just seed >>= 1
seed % 2 == 0 is just seed.even?
seed % 1_000_000_000 == 0 is seed.divisible_by? 1_000_000_000
until seed > 1_000_000_000 for same results as Go

# collatz1.cr
def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    while seed.even?
      steps += 1
      seed >>= 1
    end
    if seed > 1
      steps +=1
      seed = seed*3 + 1
    end
  end
  return steps
end

seed = 1_u64
start = Time.measure do
  until seed > 1_000_000_000
    collatzResult = collatz(seed)
    if seed.divisible_by? 1_000_000
      print "Seed #{seed} Steps: #{collatzResult}\r"
    end
  seed += 1
  end
end
puts "\ncollatz took: #{start}"

My system:
Lenovo Legion Slim 5
AMD Ryzen 7 8845HS, 8C|16T, 3.8GHZ - 5.1GHz
For application, single cores ran at 4.96HG-5.01GHz

Using Go 1.22.5

# Compiled as: go build collatz.go - renamed binaary to collatz_go

âžś  ~ ./collatz_go 
Seed: 1000000000 Steps 100
collatz took 3m1.472309147s% 

Using Crystal 1.12.2

# Compiled as: cystal build --release collatz1.cr - renamed to collatz1_cr

âžś  ~ ./collatz1_cr                      
Seed 1000000000 Steps: 100
collatz took: 00:02:43.841266775
2 Likes

We can further optimize the functional operation of the code
by understanding conceptually what the algorithm is doing,
and then using optimized instructions to perform it.

For the loop below, seed’s trailing_zeros are sucked out to
make it an odd number: 1100011000 -> 1100011
Here 3 trailing_zeros are sucked out, and step increased this amount,
and seed is shifted right 3 times.

while seed.even?
  steps += 1
  seed >>= 1
end

We can optimize that in Crystal by doing this.

if seed.even?
  tz_count = seed.trailing_zeros_count
  steps += tz_count
  seed >>= tz_count
end

But we don’t really need to do the check: if seed.even?
because if its odd tz_count = 0 and steps|seed are unchanged.

So all we need to do is:

while seed > 1
  tz_count = seed.trailing_zeros_count
  steps += tz_count
  seed >>= tz_count
  if seed > 1
    ....

So if we make collatz2.cr :

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    if seed.even?
      tz_count = seed.trailing_zeros_count
      steps += tz_count
      seed >>= tz_count
    end
    if seed > 1
      steps +=1
      seed = seed*3 + 1
    end
  end
  return steps
end

seed = 1_u64
start = Time.measure do
  until seed > 1_000_000_000
    collatzResult = collatz(seed)
    if seed.divisible_by? 1_000_000
      print "Seed #{seed} Steps: #{collatzResult}\r"
    end
  seed += 1
  end
end
puts "\ncollatz took: #{start}"

And make collatz3.cr:

def collatz(seed : UInt64)
  steps = 0_u64
  while seed > 1
    tz_count = seed.trailing_zeros_count
    steps += tz_count
    seed >>= tz_count
    if seed > 1
      steps +=1
      seed = seed*3 + 1
    end
  end
  return steps
end

seed = 1_u64
start = Time.measure do
  until seed > 1_000_000_000
    collatzResult = collatz(seed)
    if seed.divisible_by? 1_000_000
      print "Seed #{seed} Steps: #{collatzResult}\r"
    end
  seed += 1
  end
end
puts "\ncollatz took: #{start}"

We now get the following reduced times:

âžś  ~ ./collatz2_cr
Seed 1000000000 Steps: 100
collatz took: 00:02:17.671643941

âžś  ~ ./collatz3_cr                         
Seed 1000000000 Steps: 100
collatz took: 00:02:02.146402263

This just goes to show, to produce optimized code going from
one language to another (especially a compiled language) you may
have to do more than a mere instruction-to-instruction translation.

If you, instead, conceptually understand what the algorithm|code
is doing you can then use the full instruction set power of the language
to optimize the conceptual operation of the code.

It would be interesting to see if similar optimizations can be done in Go.

2 Likes

Apparently the compile option --mcpu native makes a big difference,
at least on my AMD Ryzen 7 8845HS system.

Compiling as: crystal build --release --mcpu native collatz(1|2|3).cr
produces these faster times.

âžś  ~ ./collatz1_cr_mcpu                                       
Seed 1000000000 Steps: 100
collatz took: 00:02:40.652895810
âžś  ~ 
âžś  ~ ./collatz2_cr_mcpu
Seed 1000000000 Steps: 100
collatz took: 00:02:00.446539529
âžś  ~
âžś  ~ ./collatz3_cr_mcpu
Seed 1000000000 Steps: 100
collatz took: 00:01:49.747841400

So for this code, compiling with --mcpu native improves performance.

This option should be discussed and placed in the Crystal documentation,
as I only learned about in this thread.

It will allow Crystal to compare better to languages like Rust, C|C++, etc.

2 Likes

I think so, too. in fact, i built my Crystal compiler with --mcpu native, and use it only on my local laptop (AMD 7840HS) for a while, it work like a charm.

I tried add a new flags to Makefile to try let people use it more easy and pay attention to it, but was rejected.

I no longer have an Intel based system, so I can’t run tests on one.
But it would be interesting to see how well compiling with --mcpu native
works on other systems, e.g. Intel, M1|2, Raspberry Pi, etc.

The (stripped) binaries on my AMD Ryzen 7 8845HS system are slightly different but the same for each version of the code for collatz(1|2|3).cr.

--release           : 430472 bytes               
    "  --mcpu native: 446856 bytes

So apparently the multiple arithmetic units and registers are used in a way that compiling with just --release doesn’t automatically do using LLVM.

I don’t know if this is something LLVM should automatically handle, or should it be left to users.

But Crystal’s Performance Documentation needs to discuss this, because as shown here, it can make a significant performance increase for the same source code.

I consider, it’s probably always a good choice to use --mcpu native if you don’t need to run on a machine other than the compiled.

Wow, I didn’t know using the --mcpu native option could speed things up.
That is helpful. I’m using a Ryzen PC too.