How to use benchmark correctly?

I found that the compiler optimized so well that some calls were optimized directly into empty statements. A simple example:

require "benchmark"
require "math"

Benchmark.ips do |x|
  x.report("Example A") { Math.cos(1) }
  x.report("Example B") { }
end

benchmarking without the --release flag:

Example A  54.12M ( 18.48ns) (± 0.44%)  0.0B/op   8.70× slower
Example B 471.05M (  2.12ns) (± 5.06%)  0.0B/op        fastest

This results are intuitive, but Crystal prompts me to use --release flag, and the results:

Example A 793.13M (  1.26ns) (± 1.37%)  0.0B/op        fastest
Example B 791.37M (  1.26ns) (± 2.15%)  0.0B/op   1.00× slower

This is clearly not the desired outcome.
How to use the benchmark correctly?

I can’t see the problem here. In my projects, I only use --release builds to run benchmarks, since I want to check the best option for a production scenario.

Maybe I’m wrong? I’m curious to see the other responses.

For me, when I benchmark, I always run them in at least 1000 iterations to allow the benchmark some warmup time.

require "benchmark"
require "math"

Benchmark.ips do |x|
  x.report("Example A") do
    1000.times do
      Math.cos(1)
    end
  end
  x.report("Example B") do
    1000.times do
      1
    end
  end
end

I’m not familiar with any of the Math functions, so it’s a bit odd to me that these would still be equal. Though, I do get that testing against a static value like that isn’t very helpful, I’d still expect a static value to always be the faster option.

❯ ./bench 
Example A 773.99M (  1.29ns) (± 0.61%)  0.0B/op        fastest            
Example B 772.76M (  1.29ns) (± 1.06%)  0.0B/op   1.00× slower

:thinking:

I don’t think this is actually needed. Benchmark.ips already does a 2s warmup period before starting to measure. Having the benchmark iterate 1000 times additionally is just artificially inflating the metrics by the same amount so I don’t think it would really make a difference.

What is the desired outcome? As it stands it’s showing you there is essentially no performance hit when using Math.cos as LLVM is able to optimize it all away in this context. Probably because its hardcoded scalar value vs something only known at runtime. This is more useful than the incorrect non-release version that makes you think there is.

2 Likes

That’s cool! I didn’t know that.

It’s also customizable: Benchmark - Crystal 1.9.2

1 Like

It is! The compiler will compute Math.cos(1) at compile-time, replacing it with a constant. No work done in that place, then!

So the benchmark result is actually telling you that the compiler is optimizing this.

To avoid this, try passing the argument of Math.cos as a runtime value.

require "benchmark"
require "math"

value = ARGV[0].to_f

Benchmark.ips do |x|
  x.report("Example A") { Math.cos(value) }
  x.report("Example B") { }
end

Then you run it like this:

crystal run foo.cr --release -- 0.87234

and the result for me is:

Example A 896.49M (  1.12ns) (± 1.42%)  0.0B/op   1.01× slower
Example B 908.05M (  1.10ns) (± 1.41%)  0.0B/op        fastest

So almost as fast as doing nothing, but still slightly slower than doing nothing.

1 Like

If you inspect the emitted LLVM IR, you can see that both blocks compile to nothing since Math.cos has no side effects. You must write:

require "benchmark"

x = 1
y = 0.0

Benchmark.ips do |b|
  b.report("Example A") { y = Math.cos(x) }
  b.report("Example B") { }
end

The x is to ensure the argument forms a closure and cannot be optimized away by LLVM, the y is to ensure the return value cannot be optimized away. On my machine this gives:

Example A  65.08M ( 15.37ns) (± 3.94%)  0.0B/op  13.59× slower
Example B 884.50M (  1.13ns) (± 8.33%)  0.0B/op        fastest
10 Likes

Considering that we don’t want the compiler to do this kind of optimization when doing benchmarking, is it possible to achieve this effect in the benchmark standard library without requiring the user to write such code with little tricks himself?

I’m not sure I follow, why wouldn’t you want benchmarks to include compiler optimizations? And what do you mean by “without requiring the user to write such code with little tricks himself”?

Being able to write more readable code that is ultimately as efficient as lower level/less readable code is a big win. Especially when the user doesn’t need to think about it.

EDIT: NVM, I think you’re talking about this specific context where the values are all known at compile time vs in the real code would be runtime values.

1 Like

For example, suppose there are two methods: method1 and method2.
We want the compilation to optimize these two methods as fast as possible, and rightly so.
Then we wrote the following benchmark to find out which implementation performed better:

Benchmark.ips do |b|
  b.report("Method 1") { method1 }
  b.report("Method 2") { method2 }
end

However, the compiler was smart enough to find that method1 and method2 had no side effects, and we didn’t use their results, so it optimized the above code like this:

Benchmark.ips do |b|
  b.report("Method 1") { }
  b.report("Method 2") { }
end

This is clearly not the result we expected.

I’m only preventing the compiler from making such optimizations, not preventing the compiler from optimizing method1 and method2.