Performance struct vs class

fxn · February 15, 2020, 12:38pm

I was comparing implementations of aerial distance:

module GeoCalculator
  EARTH_RADIUS_IN_KM = 6371.0

  def self.aerial_distance(from, to)
    dlat = to.lat - from.lat
    dlon = to.lon - from.lon

    a  = Math.sin(dlat/2.0)**2
    a += Math.cos(from.lat)*Math.cos(to.lat)*(Math.sin(dlon/2.0)**2)
    c  = 2*Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))

    c*EARTH_RADIUS_IN_KM
  end
end

Defined a Point as a struct

struct Point
  getter :lat, :lon

  def initialize(lat : Float32, lon : Float32)
    @lat = lat
    @lon = lon
  end
end

and also as a class

class Point
  getter :lat, :lon

  def initialize(lat : Float32, lon : Float32)
    @lat = lat
    @lon = lon
  end
end

Now, take this loop:

barcelona = Point.new(0.7223056104952821, 0.037933055776014836)
paris     = Point.new(0.8527087582226643, 0.04105401863784605)

1_000_000.times do
  GeoCalculator.aerial_distance(barcelona, paris)
end

and compile with --release (don’t know if that is relevant).

To my surprise, the loop runs about 3x faster with the struct.

As you see, instances are created once before entering the loop, and the majority of the method is trigonometry.

Where does the 3x come from?

ComputerMage · February 15, 2020, 5:24pm

Struct is allocated on stack so function has direct access to value.
Object is always pointer to the allocated memory so it must be dereferenced before using it. It adds up so this is why struct is faster.
But I don’t think that it should be that big difference. I’d expect 10-15% difference but not 200-300%.

How long it took to loop through 1_000_000 times? Can you increase number of loops? I think that it is calculation too fast so you are loosing benchmark precision. It should run for at least a minute or so to be statistically correct.

fxn · February 15, 2020, 5:50pm

Yeah, that was the clear difference, but, as you said, the factor 3x seemed too much. Also, that 3x includes all the math!

How long it took to loop through 1_000_000 times? Can you increase number of loops? I think that it is calculation too fast so you are loosing benchmark precision. It should run for at least a minute or so to be statistically correct.

Ruby (only needs two small changes) consistenly yields around 0.8s, Crystal with a class around 0.07s, and with a struct 0.02s. I have raised the iterations to 100M and it got even worse in relative terms: 58.5s, 6.6s, 1.2s.

The program as a whole only has 2 additional objects in the heap, I’d doubt the GC is adding any significant overhead.

asterite · February 15, 2020, 7:23pm

I don’t have an answer but I know LLVM is pretty good at optimizing code with structs (scalar values). Maybe one would have to compare the generated llvm IR code, or run the code through a profiler, to draw conclusions.

bcardiff · February 15, 2020, 9:19pm

@fxn I’m not sure if you are aware of the Benchmark - github.com/crystal-lang/crystal module it’s handy to compare implementations.

The --release is important :-)

You can check the llvm-ir code if you want:

$ crystal build foo.cr --release --no-debug --emit=llvm-ir

That will generate a foo.ll file and with some optimizations due to the --release it will be about 70k lines. Searching for call double @atan2 since the aerial_distance will be inlined will reveal some differences, but I didn’t dig why are those there. But the code is different at some point.

fxn · February 15, 2020, 10:32pm

I tried to isolate struct vs class access with this simple benchmark:

require "benchmark"

class C
  getter :x, :y

  def initialize(@x : Float32, @y : Float32)
  end
end

c1 = C.new(1.0, 2.0)
c2 = C.new(3.0, 4.0)

struct S
  getter :x, :y

  def initialize(@x : Float32, @y : Float32)
  end
end

s1 = S.new(c1.x, c1.y)
s2 = S.new(c2.x, c2.y)

module M
  def self.f(a, b)
    a.x + b.x + a.y + b.y
  end
end

Benchmark.ips do |x|
  x.report("class") { M.f(c1, c2) }
  x.report("struct") { M.f(s1, s2) }
end

Structs are slower without --release (1.5x slower) and perform on par with --release. Can’t reproduce the 3x at all with bare attributes access.

fxn · February 15, 2020, 10:53pm

Yep, thanks! The way the code is written (eg explicit ivar assignment) is influenced because I was comparing Ruby vs Crystal. I compared class vs struct for the sake of it, and was puzzled by the numbers.

The --release is important :-)

Indeed! In the artificial benchmark above the ratio is different.

asterite · February 15, 2020, 11:23pm

The thing with that benchmark is that those local variables are used by Benchmark, which captured the block and so the vars become closured and allocated on the heap. It’s very hard to benchmark.

fxn · February 15, 2020, 11:26pm

Awesome!!!

That connects with something I wondered: if all blocks are inlined, what happens with vars that belong originally to the outer scope (closured)? Probably related?

ComputerMage · February 16, 2020, 1:03am

Can you compile your code with --emit=llvm-ir and send here .ll file? also compile it with --emit=asm and also attach .s file. I want to see what asm code is generated from .ll file.

fxn · February 16, 2020, 1:04am

The original one with aerial_distance?

ComputerMage · February 16, 2020, 1:05am

Yes. Benchmark is affecting results.

fxn · February 16, 2020, 1:20am

Sure!

ComputerMage · February 16, 2020, 1:28am

Cool! I will compare the code and will check what da heck is going on.

Topic		Replies	Views
Digging into struct initialization performance Website	4	234	September 19, 2024
Mutable structs - what am I doing wrong? Help & Support	15	567	September 24, 2019
Is this a good use case for a class instead of a struct? Help & Support	22	2023	April 1, 2019
Does omitting types in method parameters affect compiler speed? Help & Support	6	583	October 12, 2019
Confused about method chaining on struct Help & Support	13	1009	March 2, 2019

Performance struct vs class

Related topics