I think something definitely changed, or there’s something strange going on…
I’m reading the new Crystal book, and there’s a benchmark for a selection sort algorithm in different languages: Crystal-Programming/Chapter01/selectionSort at main · PacktPublishing/Crystal-Programming · GitHub
On my machine the Go version is twice as fast as Crystal.
Here’s the Crystal code:
def selection_sort(arr)
# For each element index...
arr.each_index do |i|
# Find the smallest element after it
min = (i...arr.size).min_by { |j| arr[j] }
# Swap positions with the smallest element
arr[i], arr[min] = arr[min], arr[i]
end
end
# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse
# Sort it and then print its head and tail
selection_sort(list)
p list[0...10]
p list[-10..-1]
I said “okay, maybe let’s just copy the code from Go, and remove overflow checks and bound checks”. Here’s the resulting code:
def selection_sort(arr)
n = arr.size
i = 0
while i < n
min = i
j = i &+ 1
while j < n
if arr.to_unsafe[j] < arr.to_unsafe[min]
min = j
end
j &+= 1
end
arr.to_unsafe[i], arr.to_unsafe[min] = arr.to_unsafe[min], arr.to_unsafe[i]
i &+= 1
end
end
# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse
# Sort it and then print its head and tail
puts Time.measure {
selection_sort(list)
}
p list[0...10]
p list[-10..-1]
But, it turns out, this is slower than the original code (the first one runs in 0.74s, the second one in 0.86s.)
I even went full pointers:
def selection_sort(arr)
n = arr.size
ptr = arr.to_unsafe
i = 0
while i < n
min = i
j = i &+ 1
while j < n
if ptr[j] < ptr[min]
min = j
end
j &+= 1
end
tmp = ptr[i]
ptr[i] = ptr[min]
ptr[min] = tmp
i &+= 1
end
end
# Produce a reversed list of 30k elements
list = (1..30000).to_a.reverse
slice = Slice.new(list.to_unsafe, list.size)
# Sort it and then print its head and tail
puts Time.measure {
selection_sort(slice)
}
p list[0...10]
p list[-10..-1]
It still takes 0.86s for me.
Note: I’m using LLVM 13. Maybe with LLVM 14 this is faster?
I tried putting a @[NoInline]
annotation to selection_sort
to see the generated LLVM code. Here it is:
define void @"*selection_sort<Slice(Int32)>:Nil"(%"Slice(Int32)" %arr) local_unnamed_addr #24 !dbg !29895 {
alloca:
%arr.fca.0.extract = extractvalue %"Slice(Int32)" %arr, 0, !dbg !29896
%arr.fca.2.extract = extractvalue %"Slice(Int32)" %arr, 2, !dbg !29896
%0 = icmp sgt i32 %arr.fca.0.extract, 0, !dbg !29897
br i1 %0, label %while2.preheader.preheader, label %fail, !dbg !29897
while2.preheader.preheader: ; preds = %alloca
%1 = zext i32 %arr.fca.0.extract to i64, !dbg !29898
%wide.trip.count = zext i32 %arr.fca.0.extract to i64, !dbg !29897
br label %while2.preheader, !dbg !29898
while2.preheader: ; preds = %while2.preheader.preheader, %fail4
%indvars.iv12 = phi i64 [ 0, %while2.preheader.preheader ], [ %indvars.iv.next13, %fail4 ]
%indvars.iv = phi i64 [ 1, %while2.preheader.preheader ], [ %indvars.iv.next, %fail4 ]
%indvars.iv.next13 = add nuw nsw i64 %indvars.iv12, 1, !dbg !29899
%2 = icmp ult i64 %indvars.iv.next13, %1, !dbg !29898
%3 = trunc i64 %indvars.iv12 to i32, !dbg !29898
br i1 %2, label %body3.preheader, label %fail4, !dbg !29898
body3.preheader: ; preds = %while2.preheader
br label %body3, !dbg !29898
fail: ; preds = %fail4, %alloca
ret void, !dbg !29900
body3: ; preds = %body3.preheader, %body3
%indvars.iv10 = phi i64 [ %indvars.iv.next11, %body3 ], [ %indvars.iv, %body3.preheader ]
%min.06 = phi i32 [ %j.0.min.0, %body3 ], [ %3, %body3.preheader ]
%scevgep = getelementptr i32, i32* %arr.fca.2.extract, i64 %indvars.iv10, !dbg !29901
%4 = load i32, i32* %scevgep, align 4, !dbg !29901
%5 = sext i32 %min.06 to i64, !dbg !29903
%6 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %5, !dbg !29903
%7 = load i32, i32* %6, align 4, !dbg !29906
%8 = icmp slt i32 %4, %7, !dbg !29907
%tmp = trunc i64 %indvars.iv10 to i32
%j.0.min.0 = select i1 %8, i32 %tmp, i32 %min.06, !dbg !29908
%indvars.iv.next11 = add nuw nsw i64 %indvars.iv10, 1, !dbg !29909
%tmp15 = trunc i64 %indvars.iv.next11 to i32
%tmp16 = trunc i64 %wide.trip.count to i32
%exitcond.not = icmp eq i32 %tmp16, %tmp15, !dbg !29898
br i1 %exitcond.not, label %fail4, label %body3, !dbg !29898
fail4: ; preds = %body3, %while2.preheader
%min.0.lcssa = phi i32 [ %3, %while2.preheader ], [ %j.0.min.0, %body3 ], !dbg !29909
%9 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %indvars.iv12, !dbg !29910
%10 = load i32, i32* %9, align 4, !dbg !29913
%11 = sext i32 %min.0.lcssa to i64, !dbg !29914
%12 = getelementptr inbounds i32, i32* %arr.fca.2.extract, i64 %11, !dbg !29914
%13 = load i32, i32* %12, align 4, !dbg !29917
store i32 %13, i32* %9, align 4, !dbg !29918
store i32 %10, i32* %12, align 4, !dbg !29920
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !29897
%exitcond14.not = icmp eq i64 %indvars.iv.next13, %wide.trip.count, !dbg !29897
br i1 %exitcond14.not, label %fail, label %while2.preheader, !dbg !29897
}
I have no idea what LLVM did. What are all those indvars
? Why are there a bunch of i64 stuff? (I think this is related to pointers… is this what’s making things slower?)
Anyway… it’s not a huge deal, but I feel that either something changed over time, or that there’s still a lot of room for improvement.
Also: overflow detection might not be impacting performance, because it seems that’s even faster than not doing that ( )