I was comparing implementations of aerial distance:
module GeoCalculator
EARTH_RADIUS_IN_KM = 6371.0
def self.aerial_distance(from, to)
dlat = to.lat - from.lat
dlon = to.lon - from.lon
a = Math.sin(dlat/2.0)**2
a += Math.cos(from.lat)*Math.cos(to.lat)*(Math.sin(dlon/2.0)**2)
c = 2*Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))
c*EARTH_RADIUS_IN_KM
end
end
Defined a Point as a struct
struct Point
getter :lat, :lon
def initialize(lat : Float32, lon : Float32)
@lat = lat
@lon = lon
end
end
and also as a class
class Point
getter :lat, :lon
def initialize(lat : Float32, lon : Float32)
@lat = lat
@lon = lon
end
end
Now, take this loop:
barcelona = Point.new(0.7223056104952821, 0.037933055776014836)
paris = Point.new(0.8527087582226643, 0.04105401863784605)
1_000_000.times do
GeoCalculator.aerial_distance(barcelona, paris)
end
and compile with --release (don’t know if that is relevant).
To my surprise, the loop runs about 3x faster with the struct.
As you see, instances are created once before entering the loop, and the majority of the method is trigonometry.
Struct is allocated on stack so function has direct access to value.
Object is always pointer to the allocated memory so it must be dereferenced before using it. It adds up so this is why struct is faster.
But I don’t think that it should be that big difference. I’d expect 10-15% difference but not 200-300%.
How long it took to loop through 1_000_000 times? Can you increase number of loops? I think that it is calculation too fast so you are loosing benchmark precision. It should run for at least a minute or so to be statistically correct.
Yeah, that was the clear difference, but, as you said, the factor 3x seemed too much. Also, that 3x includes all the math!
How long it took to loop through 1_000_000 times? Can you increase number of loops? I think that it is calculation too fast so you are loosing benchmark precision. It should run for at least a minute or so to be statistically correct.
Ruby (only needs two small changes) consistenly yields around 0.8s, Crystal with a class around 0.07s, and with a struct 0.02s. I have raised the iterations to 100M and it got even worse in relative terms: 58.5s, 6.6s, 1.2s.
The program as a whole only has 2 additional objects in the heap, I’d doubt the GC is adding any significant overhead.
I don’t have an answer but I know LLVM is pretty good at optimizing code with structs (scalar values). Maybe one would have to compare the generated llvm IR code, or run the code through a profiler, to draw conclusions.
That will generate a foo.ll file and with some optimizations due to the --release it will be about 70k lines. Searching for call double @atan2 since the aerial_distance will be inlined will reveal some differences, but I didn’t dig why are those there. But the code is different at some point.
Yep, thanks! The way the code is written (eg explicit ivar assignment) is influenced because I was comparing Ruby vs Crystal. I compared class vs struct for the sake of it, and was puzzled by the numbers.
The --release is important :-)
Indeed! In the artificial benchmark above the ratio is different.
The thing with that benchmark is that those local variables are used by Benchmark, which captured the block and so the vars become closured and allocated on the heap. It’s very hard to benchmark.
That connects with something I wondered: if all blocks are inlined, what happens with vars that belong originally to the outer scope (closured)? Probably related?
Can you compile your code with --emit=llvm-ir and send here .ll file? also compile it with --emit=asm and also attach .s file. I want to see what asm code is generated from .ll file.