Performance struct vs class

I was comparing implementations of aerial distance:

module GeoCalculator
  EARTH_RADIUS_IN_KM = 6371.0

  def self.aerial_distance(from, to)
    dlat = to.lat - from.lat
    dlon = to.lon - from.lon

    a  = Math.sin(dlat/2.0)**2
    a += Math.cos(from.lat)*Math.cos(to.lat)*(Math.sin(dlon/2.0)**2)
    c  = 2*Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))

    c*EARTH_RADIUS_IN_KM
  end
end

Defined a Point as a struct

struct Point
  getter :lat, :lon

  def initialize(lat : Float32, lon : Float32)
    @lat = lat
    @lon = lon
  end
end

and also as a class

class Point
  getter :lat, :lon

  def initialize(lat : Float32, lon : Float32)
    @lat = lat
    @lon = lon
  end
end

Now, take this loop:

barcelona = Point.new(0.7223056104952821, 0.037933055776014836)
paris     = Point.new(0.8527087582226643, 0.04105401863784605)

1_000_000.times do
  GeoCalculator.aerial_distance(barcelona, paris)
end

and compile with --release (don’t know if that is relevant).

To my surprise, the loop runs about 3x faster with the struct.

As you see, instances are created once before entering the loop, and the majority of the method is trigonometry.

Where does the 3x come from?

Struct is allocated on stack so function has direct access to value.
Object is always pointer to the allocated memory so it must be dereferenced before using it. It adds up so this is why struct is faster.
But I don’t think that it should be that big difference. I’d expect 10-15% difference but not 200-300%.

How long it took to loop through 1_000_000 times? Can you increase number of loops? I think that it is calculation too fast so you are loosing benchmark precision. It should run for at least a minute or so to be statistically correct.

Yeah, that was the clear difference, but, as you said, the factor 3x seemed too much. Also, that 3x includes all the math!

How long it took to loop through 1_000_000 times? Can you increase number of loops? I think that it is calculation too fast so you are loosing benchmark precision. It should run for at least a minute or so to be statistically correct.

Ruby (only needs two small changes) consistenly yields around 0.8s, Crystal with a class around 0.07s, and with a struct 0.02s. I have raised the iterations to 100M and it got even worse in relative terms: 58.5s, 6.6s, 1.2s.

The program as a whole only has 2 additional objects in the heap, I’d doubt the GC is adding any significant overhead.

I don’t have an answer but I know LLVM is pretty good at optimizing code with structs (scalar values). Maybe one would have to compare the generated llvm IR code, or run the code through a profiler, to draw conclusions.

1 Like

@fxn I’m not sure if you are aware of the Benchmark - github.com/crystal-lang/crystal module it’s handy to compare implementations.

The --release is important :-)

You can check the llvm-ir code if you want:

$ crystal build foo.cr --release --no-debug --emit=llvm-ir

That will generate a foo.ll file and with some optimizations due to the --release it will be about 70k lines. Searching for call double @atan2 since the aerial_distance will be inlined will reveal some differences, but I didn’t dig why are those there. But the code is different at some point.

I tried to isolate struct vs class access with this simple benchmark:

require "benchmark"

class C
  getter :x, :y

  def initialize(@x : Float32, @y : Float32)
  end
end

c1 = C.new(1.0, 2.0)
c2 = C.new(3.0, 4.0)

struct S
  getter :x, :y

  def initialize(@x : Float32, @y : Float32)
  end
end

s1 = S.new(c1.x, c1.y)
s2 = S.new(c2.x, c2.y)

module M
  def self.f(a, b)
    a.x + b.x + a.y + b.y
  end
end

Benchmark.ips do |x|
  x.report("class") { M.f(c1, c2) }
  x.report("struct") { M.f(s1, s2) }
end

Structs are slower without --release (1.5x slower) and perform on par with --release. Can’t reproduce the 3x at all with bare attributes access.

Yep, thanks! The way the code is written (eg explicit ivar assignment) is influenced because I was comparing Ruby vs Crystal. I compared class vs struct for the sake of it, and was puzzled by the numbers.

The --release is important :-)

Indeed! In the artificial benchmark above the ratio is different.

The thing with that benchmark is that those local variables are used by Benchmark, which captured the block and so the vars become closured and allocated on the heap. It’s very hard to benchmark.

Awesome!!!

That connects with something I wondered: if all blocks are inlined, what happens with vars that belong originally to the outer scope (closured)? Probably related?

Can you compile your code with --emit=llvm-ir and send here .ll file? also compile it with --emit=asm and also attach .s file. I want to see what asm code is generated from .ll file.

The original one with aerial_distance?

Yes. Benchmark is affecting results.

Sure!

Cool! I will compare the code and will check what da heck is going on.