Https://marioarias.hashnode.dev/a-response-to-the-luajit-is-wicked-fast-video

https://marioarias.hashnode.dev/a-response-to-the-luajit-is-wicked-fast-video

A short blogpost about luajit, C and Crystal

I tried this with Clang 18 and Crystal 1.17.0 on Debian, which too has LLVM 18. I got very different results:

$ hyperfine -w 3 ./clangimg
Benchmark 1: ./clangimg                                                                                                                                                                                                                
  Time (mean ± σ):     180.9 ms ±   3.5 ms    [User: 180.1 ms, System: 0.8 ms]
  Range (min … max):   176.2 ms … 187.2 ms    16 runs

$ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg                                                                                                                                                                                                              
  Time (mean ± σ):     213.1 ms ±   3.2 ms    [User: 211.6 ms, System: 1.5 ms]
  Range (min … max):   208.7 ms … 222.4 ms    14 runs

But also I wouldn’t expect Crystal to be significantly faster than Clang for simply using LLVM.

Yeah, I’d expect the crystal code to compile to essentially the same as the c code here, except it also do a bunch of overflow checks. Which match up with the results hertzdevil got.

You can compare all the versions with a single command:

hyperfine --warmup 3 -m 50 ./clangimg ./clangimg-mc ./crystalimg ./gccimg ./gccimg-mc

In my case:

Summary./crystalimg ran
1.13 ± 0.02 times faster than ./gccimg
1.14 ± 0.02 times faster than ./clangimg
1.19 ± 0.02 times faster than ./gccimg-mc
1.23 ± 0.02 times faster than ./clangimg-mc

Also, may I point out that at least in the crystal version f is 0 so all the pixels are set to 0, so all the green bits in image_ramp_green are 0, and all the rgb values are also 0, and they convert to a grey value of 0, so the code doesn’t actually DO anything?

Looking at the snippet, none of the img[i].* assignments do anything, because those are local copies on the stack.

All you guys are pointing out that my code doesn’t do anything (and I was suspecting it), but why does following the same semantics and pattern of two other languages (which I consider natural) lead to different results, and if so, what will be the correct version?

At least for the value of f, for the code to make sense you should not cast to int at that point, since it’s a very small value and will be 0

I have not even looked at the other versions.

The C code at least seems to actually keep the f value. So I modified the crystal one to be equivalent in that aspect. Also added the missing assignment to img[i].alpha, and made the image_to_grey values the same, then ran them via hyperfine:

struct RGBPixel
  property red, blue, green, alpha

  def initialize(@red : UInt8 = 0, @blue : UInt8 = 0, @green : UInt8 = 0, @alpha : UInt8 = 255)
  end
end

def image_ramp_green(n)
  img = Array.new(n) { RGBPixel.new }
  f = 255.0/(n - 1)
  (0...n).each { |i| 
    img[i].green = (i * f).to_u8 
    img[i].alpha = 255
  }
  img
end

def image_to_gray(img, n)
  (0...n).each do |i|
    y = (0.3 * img[i].red + 0.59 * img[i].green + 0.1 * img[i].blue).to_u8
    img[i].red = y
    img[i].green = y
    img[i].blue = y
  end
end

N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }
> hyperfine --warmup 3 -m 50  ./crystalimg ./clangimg
Benchmark 1: ./crystalimg
  Time (mean ± σ):     223.2 ms ±   2.0 ms    [User: 220.4 ms, System: 1.4 ms]
  Range (min … max):   220.0 ms … 229.4 ms    50 runs

Benchmark 2: ./clangimg
  Time (mean ± σ):     255.5 ms ±   3.0 ms    [User: 253.2 ms, System: 0.9 ms]
  Range (min … max):   249.9 ms … 267.0 ms    50 runs

Summary
  ./crystalimg ran
    1.14 ± 0.02 times faster than ./clangimg

I suspect it could be made much faster using some sort of SIMD vectorization tho ;-)

All the time is spent on the overflow checks in the #to_u8 calls. If we replace them with #to_u8!, we get this result:

Benchmark 1: ./crystalimg
  Time (mean ± σ):       1.7 ms ±   0.2 ms    [User: 1.2 ms, System: 0.5 ms]
  Range (min … max):     1.5 ms …   2.6 ms    552 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: ./clangimg
  Time (mean ± σ):     126.6 ms ±   0.5 ms    [User: 125.1 ms, System: 1.1 ms]
  Range (min … max):   126.1 ms … 128.0 ms    23 runs
 
Summary
  ./crystalimg ran
   72.62 ± 7.56 times faster than ./clangimg

which is exactly what I meant: the assignments are no-ops, since img[i] copies the pixel to the stack. The following is much closer: (note that img[i] = img[i].copy_with(...) creates two bounds checks as the array cannot be proved to remain unchanged inbetween)

record RGBPixel, red : UInt8 = 0, green : UInt8 = 0, blue : UInt8 = 0, alpha : UInt8 = 0

def image_ramp_green(n)
  img = Array.new(n) { RGBPixel.new }
  f = 255.0/(n - 1)
  (0...n).each do |i|
    img.update(i, &.copy_with(
      green: (i * f).to_u8!,
      alpha: 255_u8,
    ))
  end
  img
end

def image_to_gray(img, n)
  (0...n).each do |i|
    y = (0.3 * img[i].red + 0.59 * img[i].green + 0.1 * img[i].blue).to_u8!
    img.update(i, &.copy_with(
      red: y,
      green: y,
      blue: y,
    ))
  end
end

N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }
Summary
  ./clangimg ran
    1.25 ± 0.01 times faster than ./crystalimg

The remaining difference is the double indirection in Array for the img variable; if we use a Slice instead, the Crystal and C versions will run at the same speed.

4 Likes

I wrote a version that modifies the struct contents and it has similar performance