I think something definitely changed, or there’s something strange going on…
I’m reading the new Crystal book, and there’s a benchmark for a selection sort algorithm in different languages: Crystal-Programming/Chapter01/selectionSort at main · PacktPublishing/Crystal-Programming · GitHub
On my machine the Go version is twice as fast as Crystal.
Here’s the Crystal code:
def selection_sort(arr)
  # For each element index...
  arr.each_index do |i|
    # Find the smallest element after it
    min = (i...arr.size).min_by { |j| arr[j] }
    # Swap positions with the smallest element
    arr[i], arr[min] = arr[min], arr[i]
  end
end
# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse
# Sort it and then print its head and tail
selection_sort(list)
p list[0...10]
p list[-10..-1]
I said “okay, maybe let’s just copy the code from Go, and remove overflow checks and bound checks”. Here’s the resulting code:
def selection_sort(arr)
  n = arr.size
  i = 0
  while i < n
    min = i
    j = i &+ 1
    while j < n
      if arr.to_unsafe[j] < arr.to_unsafe[min]
        min = j
      end
      j &+= 1
    end
    arr.to_unsafe[i], arr.to_unsafe[min] = arr.to_unsafe[min], arr.to_unsafe[i]
    i &+= 1
  end
end
# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse
# Sort it and then print its head and tail
puts Time.measure {
  selection_sort(list)
}
p list[0...10]
p list[-10..-1]
But, it turns out, this is slower than the original code (the first one runs in 0.74s, the second one in 0.86s.)
I even went full pointers:
def selection_sort(arr)
  n = arr.size
  ptr = arr.to_unsafe
  i = 0
  while i < n
    min = i
    j = i &+ 1
    while j < n
      if ptr[j] < ptr[min]
        min = j
      end
      j &+= 1
    end
    tmp = ptr[i]
    ptr[i] = ptr[min]
    ptr[min] = tmp
    i &+= 1
  end
end
# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse
slice = Slice.new(list.to_unsafe, list.size)
# Sort it and then print its head and tail
puts Time.measure {
  selection_sort(slice)
}
p list[0...10]
p list[-10..-1]
It still takes 0.86s for me.
Note: I’m using LLVM 13. Maybe with LLVM 14 this is faster?
I tried putting a @[NoInline] annotation to selection_sort to see the generated LLVM code. Here it is:
define void @"*selection_sort<Slice(Int32)>:Nil"(%"Slice(Int32)" %arr) local_unnamed_addr #24 !dbg !29895 {
alloca:
  %arr.fca.0.extract = extractvalue %"Slice(Int32)" %arr, 0, !dbg !29896
  %arr.fca.2.extract = extractvalue %"Slice(Int32)" %arr, 2, !dbg !29896
  %0 = icmp sgt i32 %arr.fca.0.extract, 0, !dbg !29897
  br i1 %0, label %while2.preheader.preheader, label %fail, !dbg !29897
while2.preheader.preheader:                       ; preds = %alloca
  %1 = zext i32 %arr.fca.0.extract to i64, !dbg !29898
  %wide.trip.count = zext i32 %arr.fca.0.extract to i64, !dbg !29897
  br label %while2.preheader, !dbg !29898
while2.preheader:                                 ; preds = %while2.preheader.preheader, %fail4
  %indvars.iv12 = phi i64 [ 0, %while2.preheader.preheader ], [ %indvars.iv.next13, %fail4 ]
  %indvars.iv = phi i64 [ 1, %while2.preheader.preheader ], [ %indvars.iv.next, %fail4 ]
  %indvars.iv.next13 = add nuw nsw i64 %indvars.iv12, 1, !dbg !29899
  %2 = icmp ult i64 %indvars.iv.next13, %1, !dbg !29898
  %3 = trunc i64 %indvars.iv12 to i32, !dbg !29898
  br i1 %2, label %body3.preheader, label %fail4, !dbg !29898
body3.preheader:                                  ; preds = %while2.preheader
  br label %body3, !dbg !29898
fail:                                             ; preds = %fail4, %alloca
  ret void, !dbg !29900
body3:                                            ; preds = %body3.preheader, %body3
  %indvars.iv10 = phi i64 [ %indvars.iv.next11, %body3 ], [ %indvars.iv, %body3.preheader ]
  %min.06 = phi i32 [ %j.0.min.0, %body3 ], [ %3, %body3.preheader ]
  %scevgep = getelementptr i32, i32* %arr.fca.2.extract, i64 %indvars.iv10, !dbg !29901
  %4 = load i32, i32* %scevgep, align 4, !dbg !29901
  %5 = sext i32 %min.06 to i64, !dbg !29903
  %6 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %5, !dbg !29903
  %7 = load i32, i32* %6, align 4, !dbg !29906
  %8 = icmp slt i32 %4, %7, !dbg !29907
  %tmp = trunc i64 %indvars.iv10 to i32
  %j.0.min.0 = select i1 %8, i32 %tmp, i32 %min.06, !dbg !29908
  %indvars.iv.next11 = add nuw nsw i64 %indvars.iv10, 1, !dbg !29909
  %tmp15 = trunc i64 %indvars.iv.next11 to i32
  %tmp16 = trunc i64 %wide.trip.count to i32
  %exitcond.not = icmp eq i32 %tmp16, %tmp15, !dbg !29898
  br i1 %exitcond.not, label %fail4, label %body3, !dbg !29898
fail4:                                            ; preds = %body3, %while2.preheader
  %min.0.lcssa = phi i32 [ %3, %while2.preheader ], [ %j.0.min.0, %body3 ], !dbg !29909
  %9 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %indvars.iv12, !dbg !29910
  %10 = load i32, i32* %9, align 4, !dbg !29913
  %11 = sext i32 %min.0.lcssa to i64, !dbg !29914
  %12 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %11, !dbg !29914
  %13 = load i32, i32* %12, align 4, !dbg !29917
  store i32 %13, i32* %9, align 4, !dbg !29918
  store i32 %10, i32* %12, align 4, !dbg !29920
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !29897
  %exitcond14.not = icmp eq i64 %indvars.iv.next13, %wide.trip.count, !dbg !29897
  br i1 %exitcond14.not, label %fail, label %while2.preheader, !dbg !29897
}
I have no idea what LLVM did. What are all those indvars? Why are there a bunch of i64 stuff? (I think this is related to pointers… is this what’s making things slower?)
Anyway… it’s not a huge deal, but I feel that either something changed over time, or that there’s still a lot of room for improvement.
Also: overflow detection might not be impacting performance, because it seems that’s even faster than not doing that (  )
 )