Https://marioarias.hashnode.dev/a-response-to-the-luajit-is-wicked-fast-video

MarioAriasC · August 4, 2025, 6:31pm

https://marioarias.hashnode.dev/a-response-to-the-luajit-is-wicked-fast-video

A short blogpost about luajit, C and Crystal

HertzDevil · August 4, 2025, 7:53pm

I tried this with Clang 18 and Crystal 1.17.0 on Debian, which too has LLVM 18. I got very different results:

$ hyperfine -w 3 ./clangimg
Benchmark 1: ./clangimg                                                                                                                                                                                                                
  Time (mean ± σ):     180.9 ms ±   3.5 ms    [User: 180.1 ms, System: 0.8 ms]
  Range (min … max):   176.2 ms … 187.2 ms    16 runs

$ hyperfine -w 3 ./crystalimg
Benchmark 1: ./crystalimg                                                                                                                                                                                                              
  Time (mean ± σ):     213.1 ms ±   3.2 ms    [User: 211.6 ms, System: 1.5 ms]
  Range (min … max):   208.7 ms … 222.4 ms    14 runs

But also I wouldn’t expect Crystal to be significantly faster than Clang for simply using LLVM.

yxhuvud · August 4, 2025, 8:36pm

Yeah, I’d expect the crystal code to compile to essentially the same as the c code here, except it also do a bunch of overflow checks. Which match up with the results hertzdevil got.

ralsina · August 4, 2025, 8:42pm

You can compare all the versions with a single command:

hyperfine --warmup 3 -m 50 ./clangimg ./clangimg-mc ./crystalimg ./gccimg ./gccimg-mc

In my case:

Summary./crystalimg ran
1.13 ± 0.02 times faster than ./gccimg
1.14 ± 0.02 times faster than ./clangimg
1.19 ± 0.02 times faster than ./gccimg-mc
1.23 ± 0.02 times faster than ./clangimg-mc

ralsina · August 4, 2025, 8:52pm

Also, may I point out that at least in the crystal version f is 0 so all the pixels are set to 0, so all the green bits in image_ramp_green are 0, and all the rgb values are also 0, and they convert to a grey value of 0, so the code doesn’t actually DO anything?

HertzDevil · August 4, 2025, 9:51pm

Looking at the snippet, none of the img[i].* assignments do anything, because those are local copies on the stack.

MarioAriasC · August 4, 2025, 11:59pm

All you guys are pointing out that my code doesn’t do anything (and I was suspecting it), but why does following the same semantics and pattern of two other languages (which I consider natural) lead to different results, and if so, what will be the correct version?

ralsina · August 5, 2025, 11:04am

At least for the value of f, for the code to make sense you should not cast to int at that point, since it’s a very small value and will be 0

I have not even looked at the other versions.

ralsina · August 5, 2025, 12:52pm

The C code at least seems to actually keep the f value. So I modified the crystal one to be equivalent in that aspect. Also added the missing assignment to img[i].alpha, and made the image_to_grey values the same, then ran them via hyperfine:

struct RGBPixel
  property red, blue, green, alpha

  def initialize(@red : UInt8 = 0, @blue : UInt8 = 0, @green : UInt8 = 0, @alpha : UInt8 = 255)
  end
end

def image_ramp_green(n)
  img = Array.new(n) { RGBPixel.new }
  f = 255.0/(n - 1)
  (0...n).each { |i| 
    img[i].green = (i * f).to_u8 
    img[i].alpha = 255
  }
  img
end

def image_to_gray(img, n)
  (0...n).each do |i|
    y = (0.3 * img[i].red + 0.59 * img[i].green + 0.1 * img[i].blue).to_u8
    img[i].red = y
    img[i].green = y
    img[i].blue = y
  end
end

N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }

> hyperfine --warmup 3 -m 50  ./crystalimg ./clangimg
Benchmark 1: ./crystalimg
  Time (mean ± σ):     223.2 ms ±   2.0 ms    [User: 220.4 ms, System: 1.4 ms]
  Range (min … max):   220.0 ms … 229.4 ms    50 runs

Benchmark 2: ./clangimg
  Time (mean ± σ):     255.5 ms ±   3.0 ms    [User: 253.2 ms, System: 0.9 ms]
  Range (min … max):   249.9 ms … 267.0 ms    50 runs

Summary
  ./crystalimg ran
    1.14 ± 0.02 times faster than ./clangimg

I suspect it could be made much faster using some sort of SIMD vectorization tho ;-)

HertzDevil · August 5, 2025, 2:06pm

All the time is spent on the overflow checks in the #to_u8 calls. If we replace them with #to_u8!, we get this result:

Benchmark 1: ./crystalimg
  Time (mean ± σ):       1.7 ms ±   0.2 ms    [User: 1.2 ms, System: 0.5 ms]
  Range (min … max):     1.5 ms …   2.6 ms    552 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: ./clangimg
  Time (mean ± σ):     126.6 ms ±   0.5 ms    [User: 125.1 ms, System: 1.1 ms]
  Range (min … max):   126.1 ms … 128.0 ms    23 runs
 
Summary
  ./crystalimg ran
   72.62 ± 7.56 times faster than ./clangimg

which is exactly what I meant: the assignments are no-ops, since img[i] copies the pixel to the stack. The following is much closer: (note that img[i] = img[i].copy_with(...) creates two bounds checks as the array cannot be proved to remain unchanged inbetween)

record RGBPixel, red : UInt8 = 0, green : UInt8 = 0, blue : UInt8 = 0, alpha : UInt8 = 0

def image_ramp_green(n)
  img = Array.new(n) { RGBPixel.new }
  f = 255.0/(n - 1)
  (0...n).each do |i|
    img.update(i, &.copy_with(
      green: (i * f).to_u8!,
      alpha: 255_u8,
    ))
  end
  img
end

def image_to_gray(img, n)
  (0...n).each do |i|
    y = (0.3 * img[i].red + 0.59 * img[i].green + 0.1 * img[i].blue).to_u8!
    img.update(i, &.copy_with(
      red: y,
      green: y,
      blue: y,
    ))
  end
end

N = 400 * 400
img = image_ramp_green(N)
(0...1000).each { image_to_gray(img, N) }

Summary
  ./clangimg ran
    1.25 ± 0.01 times faster than ./crystalimg

The remaining difference is the double indirection in Array for the img variable; if we use a Slice instead, the Crystal and C versions will run at the same speed.

MarioAriasC · August 5, 2025, 6:02pm

I wrote a version that modifies the struct contents and it has similar performance

Topic		Replies	Views
Question about Crystal, Compiled Code, and Performance Help & Support	13	1909	May 27, 2019
Build with --release performance is slow than the 2017 crystal version? Help & Support	16	645	June 13, 2022
macOS BigSur: Crystal 1.0.0 is much slower than Ruby 3.0.0 Community	8	735	April 2, 2021
Crystal and LLVM Help & Support	5	385	September 6, 2019
Very slow build speeds for hello world Help & Support	42	933	August 18, 2024

Https://marioarias.hashnode.dev/a-response-to-the-luajit-is-wicked-fast-video

Related topics