https://marioarias.hashnode.dev/a-response-to-the-luajit-is-wicked-fast-video
A short blogpost about luajit, C and Crystal
https://marioarias.hashnode.dev/a-response-to-the-luajit-is-wicked-fast-video
A short blogpost about luajit, C and Crystal
I tried this with Clang 18 and Crystal 1.17.0 on Debian, which too has LLVM 18. I got very different results:
$ hyperfine -w 3 ./clangimg
Benchmark 1: ./clangimg
Time (mean ± σ): 180.9 ms ± 3.5 ms [User: 180.1 ms, System: 0.8 ms]
Range (min … max): 176.2 ms … 187.2 ms 16 runs
$ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg
Time (mean ± σ): 213.1 ms ± 3.2 ms [User: 211.6 ms, System: 1.5 ms]
Range (min … max): 208.7 ms … 222.4 ms 14 runs
But also I wouldn’t expect Crystal to be significantly faster than Clang for simply using LLVM.
Yeah, I’d expect the crystal code to compile to essentially the same as the c code here, except it also do a bunch of overflow checks. Which match up with the results hertzdevil got.
You can compare all the versions with a single command:
hyperfine --warmup 3 -m 50 ./clangimg ./clangimg-mc ./crystalimg ./gccimg ./gccimg-mc
In my case:
Summary./crystalimg ran
1.13 ± 0.02 times faster than ./gccimg
1.14 ± 0.02 times faster than ./clangimg
1.19 ± 0.02 times faster than ./gccimg-mc
1.23 ± 0.02 times faster than ./clangimg-mc
Also, may I point out that at least in the crystal version f is 0 so all the pixels are set to 0, so all the green bits in image_ramp_green are 0, and all the rgb values are also 0, and they convert to a grey value of 0, so the code doesn’t actually DO anything?
Looking at the snippet, none of the img[i].*
assignments do anything, because those are local copies on the stack.
All you guys are pointing out that my code doesn’t do anything (and I was suspecting it), but why does following the same semantics and pattern of two other languages (which I consider natural) lead to different results, and if so, what will be the correct version?
At least for the value of f, for the code to make sense you should not cast to int at that point, since it’s a very small value and will be 0
I have not even looked at the other versions.
The C code at least seems to actually keep the f value. So I modified the crystal one to be equivalent in that aspect. Also added the missing assignment to img[i].alpha
, and made the image_to_grey
values the same, then ran them via hyperfine:
struct RGBPixel
property red, blue, green, alpha
def initialize(@red : UInt8 = 0, @blue : UInt8 = 0, @green : UInt8 = 0, @alpha : UInt8 = 255)
end
end
def image_ramp_green(n)
img = Array.new(n) { RGBPixel.new }
f = 255.0/(n - 1)
(0...n).each { |i|
img[i].green = (i * f).to_u8
img[i].alpha = 255
}
img
end
def image_to_gray(img, n)
(0...n).each do |i|
y = (0.3 * img[i].red + 0.59 * img[i].green + 0.1 * img[i].blue).to_u8
img[i].red = y
img[i].green = y
img[i].blue = y
end
end
N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }
> hyperfine --warmup 3 -m 50 ./crystalimg ./clangimg
Benchmark 1: ./crystalimg
Time (mean ± σ): 223.2 ms ± 2.0 ms [User: 220.4 ms, System: 1.4 ms]
Range (min … max): 220.0 ms … 229.4 ms 50 runs
Benchmark 2: ./clangimg
Time (mean ± σ): 255.5 ms ± 3.0 ms [User: 253.2 ms, System: 0.9 ms]
Range (min … max): 249.9 ms … 267.0 ms 50 runs
Summary
./crystalimg ran
1.14 ± 0.02 times faster than ./clangimg
I suspect it could be made much faster using some sort of SIMD vectorization tho ;-)
All the time is spent on the overflow checks in the #to_u8
calls. If we replace them with #to_u8!
, we get this result:
Benchmark 1: ./crystalimg
Time (mean ± σ): 1.7 ms ± 0.2 ms [User: 1.2 ms, System: 0.5 ms]
Range (min … max): 1.5 ms … 2.6 ms 552 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 2: ./clangimg
Time (mean ± σ): 126.6 ms ± 0.5 ms [User: 125.1 ms, System: 1.1 ms]
Range (min … max): 126.1 ms … 128.0 ms 23 runs
Summary
./crystalimg ran
72.62 ± 7.56 times faster than ./clangimg
which is exactly what I meant: the assignments are no-ops, since img[i]
copies the pixel to the stack. The following is much closer: (note that img[i] = img[i].copy_with(...)
creates two bounds checks as the array cannot be proved to remain unchanged inbetween)
record RGBPixel, red : UInt8 = 0, green : UInt8 = 0, blue : UInt8 = 0, alpha : UInt8 = 0
def image_ramp_green(n)
img = Array.new(n) { RGBPixel.new }
f = 255.0/(n - 1)
(0...n).each do |i|
img.update(i, &.copy_with(
green: (i * f).to_u8!,
alpha: 255_u8,
))
end
img
end
def image_to_gray(img, n)
(0...n).each do |i|
y = (0.3 * img[i].red + 0.59 * img[i].green + 0.1 * img[i].blue).to_u8!
img.update(i, &.copy_with(
red: y,
green: y,
blue: y,
))
end
end
N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }
Summary
./clangimg ran
1.25 ± 0.01 times faster than ./crystalimg
The remaining difference is the double indirection in Array
for the img
variable; if we use a Slice
instead, the Crystal and C versions will run at the same speed.
I wrote a version that modifies the struct contents and it has similar performance