Build with --release performance is slow than the 2017 crystal version?

zw963 · June 11, 2022, 4:01pm

I am reading a old blog post, which write by @lbarasti , on Apr, 2017, more than five years old!

in original post, he build one CPU intensive script for benchmark, use ruby, Crystal, Crystal with --release and NodeJs.

Following is those code link:

ruby/crystal version
js version

code paste here too.

# loop.cr

size = 10 ** 5
a = (1..size).map{ [rand(1000), rand(1000)] }

def dist(u,v)
  (u[0] - v[0]).abs + (u[1] - v[1]).abs
end

a.each_with_index {|u,i|
  print "\r#{i}" if i % 100 == 0
  a[i + 1..-1].each_with_index {|v,j|
    dist(u,v) < 3
  }
}

When i try to rerun those code on one very old Intel(R) Core™ i7-4810MQ CPU @ 2.80GHz hp laptop, with many service/program is running on my arch linux.

Ruby 3.1.0 speed basically same as 2017. (original post 23 minutes)

real    19m50.219s
user    19m45.659s
sys     0m0.509s

When run with node 16.13.1, the performance is double. (original post 3.5 minutes)

real    1m34.699s
user    1m10.208s
sys     0m28.529s

When run directly with crystal 1.5.0dev,
It’s get better result when run with crystal run loop.cr directly (original post 12 minutes)

real 7m46.351s
user 8m6.549s
sys 0m6.469s

But, The thing really confuse me is, when run with --release enabled, i get worse result: (orignal post only 70 seconds)

crystal build --release loop.cr && time ./loop

real 1m29.273s
user 1m52.607s
sys 0m5.315s

it spent almost 90 seconds, slow than original result almost 30%!

I guess maybe it caused by use the 1.5.0dev? then i try build --release again use official 1.4.1 binary which download from github release page.

But, almost same result.

real    1m26.777s
user    1m44.298s
sys     0m5.343s

nodejs 16 almost as quickly as crystal …

But the really curious is, why Crystal --release speed slow than version 2017?

vlazar · June 11, 2022, 6:38pm

Unleash performance potential from compiler build process

opened 08:14PM - 13 May 22 UTC

straight-shoota

performance status:discussion

It has been discovered that different builds of the compiler shows huge differen…ces in performance levels. At this point the influencing factors are not exactly clear however. This issue is supposed to coordinate investigations and propose guidelines for best performing builds. The first mentions of the effect that a locally built compilers can perform much better (50% time) than the generic build was summarized in https://github.com/crystal-lang/distribution-scripts/issues/147#issuecomment-1105035485 The compiler builds produced by 84codes at https://github.com/84codes/crystal-container-images/blob/b4bf7e48bc48ef9276959420fee7aefcf73a881b/alpine/Dockerfile have reportedly shown a 30% performance improvement over the builds from https://github.com/crystal-lang/distribution-scripts/blob/ce8ae2727f540b07d421b97fed71b0e5f262708c/linux/Dockerfile Upgraded versions of the LLVM library may haven an effect, but they do not seem to explain everything. MUSL libc vs GNU libc may also be a factor, so does static vs. dynamic linking (reports suggest dynamic GNU > dynamic MUSL > static MUSL).

I3oris · June 11, 2022, 6:54pm

Yeah, comparing time through different machines isn’t relevant (CPU, hardware, and many other factors can differentiate results). Here it’s likely crystal is a bit slower on your machine. (ruby and node might have been optimized since 2017 also)

yxhuvud · June 11, 2022, 8:19pm

One very relevant thing that has happened since 2017 is that protection against integer overflow/underflow has been added. Try replacing the -s and +s with &- and &+ and see what happens.

yxhuvud · June 11, 2022, 8:35pm

That said, some minor changes bring down the time to a fraction of the original:

size = 10 ** 5
a = (1..size).map { {rand(1000), rand(1000)} }

def dist(u, v)
  (u[0] - v[0]).abs + (u[1] - v[1]).abs
end

a.each_with_index { |u, i|
  print "\r#{i}" if i % 100 == 0
  ((i + 1)..(a.size - 1)).each { |v|
    dist(u, a[v]) < 3
  }
}

takes

real	0m22,396s
user	0m22,395s
sys	0m0,001s

on my machine where the original took 1.35 minutes

asterite · June 11, 2022, 9:01pm

I think something definitely changed, or there’s something strange going on…

I’m reading the new Crystal book, and there’s a benchmark for a selection sort algorithm in different languages: Crystal-Programming/Chapter01/selectionSort at main · PacktPublishing/Crystal-Programming · GitHub

On my machine the Go version is twice as fast as Crystal.

Here’s the Crystal code:

def selection_sort(arr)
  # For each element index...
  arr.each_index do |i|
    # Find the smallest element after it
    min = (i...arr.size).min_by { |j| arr[j] }

    # Swap positions with the smallest element
    arr[i], arr[min] = arr[min], arr[i]
  end
end

# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse

# Sort it and then print its head and tail
selection_sort(list)
p list[0...10]
p list[-10..-1]

I said “okay, maybe let’s just copy the code from Go, and remove overflow checks and bound checks”. Here’s the resulting code:

def selection_sort(arr)
  n = arr.size
  i = 0
  while i < n
    min = i
    j = i &+ 1
    while j < n
      if arr.to_unsafe[j] < arr.to_unsafe[min]
        min = j
      end

      j &+= 1
    end

    arr.to_unsafe[i], arr.to_unsafe[min] = arr.to_unsafe[min], arr.to_unsafe[i]

    i &+= 1
  end
end

# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse

# Sort it and then print its head and tail
puts Time.measure {
  selection_sort(list)
}
p list[0...10]
p list[-10..-1]

But, it turns out, this is slower than the original code (the first one runs in 0.74s, the second one in 0.86s.)

I even went full pointers:

def selection_sort(arr)
  n = arr.size
  ptr = arr.to_unsafe
  i = 0
  while i < n
    min = i
    j = i &+ 1
    while j < n
      if ptr[j] < ptr[min]
        min = j
      end

      j &+= 1
    end

    tmp = ptr[i]
    ptr[i] = ptr[min]
    ptr[min] = tmp

    i &+= 1
  end
end

# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse
slice = Slice.new(list.to_unsafe, list.size)

# Sort it and then print its head and tail
puts Time.measure {
  selection_sort(slice)
}
p list[0...10]
p list[-10..-1]

It still takes 0.86s for me.

Note: I’m using LLVM 13. Maybe with LLVM 14 this is faster?

I tried putting a @[NoInline] annotation to selection_sort to see the generated LLVM code. Here it is:

define void @"*selection_sort<Slice(Int32)>:Nil"(%"Slice(Int32)" %arr) local_unnamed_addr #24 !dbg !29895 {
alloca:
  %arr.fca.0.extract = extractvalue %"Slice(Int32)" %arr, 0, !dbg !29896
  %arr.fca.2.extract = extractvalue %"Slice(Int32)" %arr, 2, !dbg !29896
  %0 = icmp sgt i32 %arr.fca.0.extract, 0, !dbg !29897
  br i1 %0, label %while2.preheader.preheader, label %fail, !dbg !29897

while2.preheader.preheader:                       ; preds = %alloca
  %1 = zext i32 %arr.fca.0.extract to i64, !dbg !29898
  %wide.trip.count = zext i32 %arr.fca.0.extract to i64, !dbg !29897
  br label %while2.preheader, !dbg !29898

while2.preheader:                                 ; preds = %while2.preheader.preheader, %fail4
  %indvars.iv12 = phi i64 [ 0, %while2.preheader.preheader ], [ %indvars.iv.next13, %fail4 ]
  %indvars.iv = phi i64 [ 1, %while2.preheader.preheader ], [ %indvars.iv.next, %fail4 ]
  %indvars.iv.next13 = add nuw nsw i64 %indvars.iv12, 1, !dbg !29899
  %2 = icmp ult i64 %indvars.iv.next13, %1, !dbg !29898
  %3 = trunc i64 %indvars.iv12 to i32, !dbg !29898
  br i1 %2, label %body3.preheader, label %fail4, !dbg !29898

body3.preheader:                                  ; preds = %while2.preheader
  br label %body3, !dbg !29898

fail:                                             ; preds = %fail4, %alloca
  ret void, !dbg !29900

body3:                                            ; preds = %body3.preheader, %body3
  %indvars.iv10 = phi i64 [ %indvars.iv.next11, %body3 ], [ %indvars.iv, %body3.preheader ]
  %min.06 = phi i32 [ %j.0.min.0, %body3 ], [ %3, %body3.preheader ]
  %scevgep = getelementptr i32, i32* %arr.fca.2.extract, i64 %indvars.iv10, !dbg !29901
  %4 = load i32, i32* %scevgep, align 4, !dbg !29901
  %5 = sext i32 %min.06 to i64, !dbg !29903
  %6 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %5, !dbg !29903
  %7 = load i32, i32* %6, align 4, !dbg !29906
  %8 = icmp slt i32 %4, %7, !dbg !29907
  %tmp = trunc i64 %indvars.iv10 to i32
  %j.0.min.0 = select i1 %8, i32 %tmp, i32 %min.06, !dbg !29908
  %indvars.iv.next11 = add nuw nsw i64 %indvars.iv10, 1, !dbg !29909
  %tmp15 = trunc i64 %indvars.iv.next11 to i32
  %tmp16 = trunc i64 %wide.trip.count to i32
  %exitcond.not = icmp eq i32 %tmp16, %tmp15, !dbg !29898
  br i1 %exitcond.not, label %fail4, label %body3, !dbg !29898

fail4:                                            ; preds = %body3, %while2.preheader
  %min.0.lcssa = phi i32 [ %3, %while2.preheader ], [ %j.0.min.0, %body3 ], !dbg !29909
  %9 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %indvars.iv12, !dbg !29910
  %10 = load i32, i32* %9, align 4, !dbg !29913
  %11 = sext i32 %min.0.lcssa to i64, !dbg !29914
  %12 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %11, !dbg !29914
  %13 = load i32, i32* %12, align 4, !dbg !29917
  store i32 %13, i32* %9, align 4, !dbg !29918
  store i32 %10, i32* %12, align 4, !dbg !29920
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !29897
  %exitcond14.not = icmp eq i64 %indvars.iv.next13, %wide.trip.count, !dbg !29897
  br i1 %exitcond14.not, label %fail, label %while2.preheader, !dbg !29897
}

I have no idea what LLVM did. What are all those indvars? Why are there a bunch of i64 stuff? (I think this is related to pointers… is this what’s making things slower?)

Anyway… it’s not a huge deal, but I feel that either something changed over time, or that there’s still a lot of room for improvement.

Also: overflow detection might not be impacting performance, because it seems that’s even faster than not doing that ( )

asterite · June 11, 2022, 9:27pm

It also seems there’s a new pass manager since a while now, and we aren’t using it: Using the New Pass Manager — LLVM 15.0.0git documentation . Maybe we should! I just checked using it, and the optimized LLVM IR occupied half the size. Or maybe we should recheck how is Clang using LLVM, because we copied them for optimizing the code and never looked again if they changed something.

asterite · June 11, 2022, 9:45pm

Ah, well. I tried the new pass manager and the result is the same. I still think we should use it, the patch is minimal. I might send it later…

jgaskins · June 12, 2022, 2:43am

Right above this, you mention you ran it with crystal run loop.cr. Are you doing time crystal run --release loop.cr or crystal build --release loop.cr && time ./loop? Those would give you very different results.

zw963 · June 12, 2022, 4:44am

Updated my post, in fact, i use

crystal run loop.cr

And

crystal build --release loop.cr && time ./loop

jgaskins · June 12, 2022, 5:28am

Cool, I thought so, but there was that one nagging thought of “wait, what if this detail is important?”

zw963 · June 12, 2022, 7:15am

Yeah, you are right, detail is important, though, if i use time crystal run --release loop.cr instead, the time only increase 3~4 seconds, it doesn’t have that much impact.

zw963 · June 12, 2022, 8:14am

After change the code from Array to use tuple, as this comment by @yxhuvud , do the same benchmark on same laptop, the speed is far far quickly then before.

diff --git a/loop.cr b/loop.cr
index a7789a4..3cbf280 100644
--- a/loop.cr
+++ b/loop.cr
@@ -1,6 +1,6 @@
 size = 100000
 
-a = (1..size).map{ [rand(1000), rand(1000)] }
+a = (1..size).map{ {rand(1000), rand(1000)} }
 
 def dist(u,v)
   (u[0] - v[0]).abs + (u[1] - v[1]).abs

╰─ $ cry build --release loop.cr; time ./loop
99900
real 0m56.078s
user 0m51.120s
sys 0m9.356s

I guess this because tuple is allocated on the stack, and immutable, so, do many compiler optimizations.

Improvement more changes, to avoid unnessary ary allocated, we get more better result.

diff --git a/loop.cr b/loop.cr
index 8cea93b..9864247 100644
--- a/loop.cr
+++ b/loop.cr
@@ -8,7 +8,7 @@ end
 
 a.each_with_index {|u,i|
   print "\r#{i}" if i % 100 == 0
-    a[i + 1..-1].each_with_index {|v,j|
-    dist(u,v) < 3
+      ((i + 1)..(a.size - 1)).each { |v|
+    dist(u, a[v]) < 3
   }
 }

 ╰─ $ cry build --release loop.cr; time ./loop 
real    0m38.295s
user    0m38.176s
sys     0m0.029s

I try change (1..size).map to Array.new(size), but, performance seem like not change too much.

I try run new code without --release too, get more better result too.

crystal run loop.cr

real    4m45.094s
user    4m44.493s
sys     0m0.714s

so, The big question is, back in 2017, Why Crystal compiler smart enough to treat array as immutable, and do those optimization?

npn · June 12, 2022, 9:29am

I think you should remove the printing statements for more correct measurement. Crystal is not well optimized for this so the delay is quite huge. time reduce from 37s to 28s in my laptop after remove the statements.

p/s: funny that unsafe operators (&+, &-, unsafe_fetch) do not affect the output in this case. and range is way faster than upto.

p/s 2: and nodejs is ways slower in my machine, 95s with print statements and 75s without them.

zw963 · June 13, 2022, 2:57pm

Hi, i test several times, with/without puts, on linux.

without puts => 25.2~25.3 seconds
with puts => 27.7~27.8 seconds

as you can see, the diff less then 3 seconds, so, what is your’s OS?

asterite · June 13, 2022, 3:23pm

Is the output redirected somewhere? (let’s say, /dev/null)

zw963 · June 13, 2022, 3:42pm

Are you ask me?

i think we are talk about remove the print statement, like this.

diff --git a/loop.cr b/loop.cr
index 9864247..d7ee7b3 100644
--- a/loop.cr
+++ b/loop.cr
@@ -7,7 +7,7 @@ def dist(u,v)
 end
 
 a.each_with_index {|u,i|
-  print "\r#{i}" if i % 100 == 0
+  # print "\r#{i}" if i % 100 == 0
       ((i + 1)..(a.size - 1)).each { |v|
     dist(u, a[v]) < 3
   }

If what you mean is test output to /dev/null, i test like this:

 ╰─ $ git diff
diff --git a/loop.cr b/loop.cr
index 9864247..296f83a 100644
--- a/loop.cr
+++ b/loop.cr
@@ -6,9 +6,11 @@ def dist(u,v)
   (u[0] - v[0]).abs + (u[1] - v[1]).abs
 end
 
-a.each_with_index {|u,i|
-  print "\r#{i}" if i % 100 == 0
-      ((i + 1)..(a.size - 1)).each { |v|
-    dist(u, a[v]) < 3
+File.open("/dev/null") do |file|
+  a.each_with_index {|u,i|
+    file.print "\r#{i}" if i % 100 == 0
+    ((i + 1)..(a.size - 1)).each { |v|
+      dist(u, a[v]) < 3
+    }
   }
-}
+end

the result is, print to /dev/null almost same as print to STDIO， better a bit only 0.1 seconds

Topic		Replies	Views
Faster `--release` compile times but slightly worse performance? Crystal Contrib	34	1952	May 12, 2023
Very slow build speeds for hello world Help & Support	42	949	August 18, 2024
CPU Compile Speed Benchmarks News	17	4936	November 9, 2024
macOS BigSur: Crystal 1.0.0 is much slower than Ruby 3.0.0 Community	8	735	April 2, 2021
Crystal 0.34.0 slower? Help & Support	53	1921	June 1, 2020

Build with --release performance is slow than the 2017 crystal version?

Related topics