Benchmark: Int32 -> StaticArray(UInt8) BigEndian

aiac · June 21, 2024, 3:49pm

code

require "benchmark"

def shr(value : Int32)
  buffer = UInt8.static_array(0, 0, 0, 0)
  buffer[0] = (value >> 24).to_u8!
  buffer[1] = (value >> 16).to_u8!
  buffer[2] = (value >> 8).to_u8!
  buffer[3] = (value).to_u8!
  buffer
end

def unsafe_as(value : Int32)
  value.unsafe_as(StaticArray(UInt8, 4)).reverse!
end

def io(value : Int32)
  buffer = UInt8.static_array(0, 0, 0, 0)
  io = IO::Memory.new(buffer.to_slice)
  io.write_bytes(value, IO::ByteFormat::BigEndian)
  buffer
end

Benchmark.ips(warmup: 4, calculation: 10) do |x|
  x.report "shr" do
    shr(-100)
  end

  x.report "unsafe_as" do
    unsafe_as(-100)
  end

  x.report "io" do
    io(-100)
  end
end

result

linux

❯ crystal run --release test.cr
      shr 884.61M (  1.13ns) (± 2.81%)   0.0B/op        fastest
unsafe_as 884.51M (  1.13ns) (± 1.36%)   0.0B/op   1.00× slower
       io  49.58M ( 20.17ns) (± 1.02%)  96.0B/op  17.84× slower

windows

╰─❯ crystal run --release test.cr
      shr 888.95M (  1.12ns) (± 1.43%)   0.0B/op   1.00× slower
unsafe_as 889.03M (  1.12ns) (± 1.29%)   0.0B/op        fastest
       io  18.22M ( 54.89ns) (± 1.15%)  96.0B/op  48.80× slower

UPDATE:

add byte_format solution provided by @blacksmoke16

+ def byte_format(value : Int32)
+   buffer = UInt8.static_array(0, 0, 0, 0)
+   IO::ByteFormat::BigEndian.encode(value, buffer.to_slice)
+   buffer
+ end

linux

❯ crystal run --release test.cr
        shr 848.20M (  1.18ns) (± 9.38%)   0.0B/op   1.00× slower
  unsafe_as 846.83M (  1.18ns) (± 8.63%)   0.0B/op   1.00× slower
         io  47.15M ( 21.21ns) (± 9.51%)  96.0B/op  18.02× slower
byte_format 849.82M (  1.18ns) (± 8.68%)   0.0B/op        fastest

windows

╰─❯ crystal run --release test.cr
        shr 859.45M (  1.16ns) (± 8.17%)   0.0B/op   1.00× slower
  unsafe_as 860.80M (  1.16ns) (± 7.45%)   0.0B/op        fastest
         io  15.75M ( 63.48ns) (±11.52%)  96.0B/op  54.64× slower
byte_format 860.23M (  1.16ns) (± 7.06%)   0.0B/op   1.00× slower

Blacksmoke16 · June 21, 2024, 3:56pm

FWIW that’s what the 5th column is showing. E.g. that shr and unsafe_as allocate 0 bytes of memory per operation, while io allocates 96 bytes. This is probably why the io one is the slowest.

EDIT: Also seems unsafe_as is essentially what IO::ByteFormat::BigEndian.encode is doing: crystal/src/io/byte_format.cr at 04998c0c7a247153a136f1a4eecb1bbf655d1ac5 · crystal-lang/crystal · GitHub.

jgaskins · June 21, 2024, 6:30pm

Any time you see a Benchmark.ips entry taking ~1ns, you’ve hit the floor for how low you can measure. This usually means one or both of these things:

The operation is faster than 1ns
LLVM is optimizing out the block entirely

Running your code on my machine indicated that both of these things were happening, so we need to measure multiple iterations within the report block to get an accurate benchmark as well as invoke a side effect to keep LLVM from optimizing out the report block entirely.

require "benchmark"

def shr(value : Int32)
  buffer = UInt8.static_array(0, 0, 0, 0)
  buffer[0] = (value >> 24).to_u8!
  buffer[1] = (value >> 16).to_u8!
  buffer[2] = (value >> 8).to_u8!
  buffer[3] = (value).to_u8!
  buffer
end

def unsafe_as(value : Int32)
  value.unsafe_as(StaticArray(UInt8, 4)).reverse!
end

def io(value : Int32)
  buffer = UInt8.static_array(0, 0, 0, 0)
  io = IO::Memory.new(buffer.to_slice)
  io.write_bytes(value, IO::ByteFormat::BigEndian)
  buffer
end

def byte_format(value : Int32)
  buffer = UInt8.static_array(0, 0, 0, 0)
  IO::ByteFormat::BigEndian.encode(value, buffer.to_slice)
  buffer
end

values = [nil] of StaticArray(UInt8, 4)?
Benchmark.ips do |x|
  iterations = 1_000
  x.report "shr" { iterations.times { values[0] = shr(-100) } }
  x.report "unsafe_as" { iterations.times { values[0] = unsafe_as(-100) } }
  x.report "io" { iterations.times { values[0] = io(-100) } }
  x.report "byte_format" { iterations.times { values[0] = byte_format -100 } }
end

Here, we run the methods 1000x per measurement and mutate an element of an array allocated at the outermost scope to store the result of the method as our side effect. This fixes both issues above. With that in place, these are the results on my machine:

        shr   2.13M (468.65ns) (± 0.72%)    0.0B/op        fastest
  unsafe_as   2.12M (472.51ns) (± 1.78%)    0.0B/op   1.01× slower
         io  48.67k ( 20.55µs) (± 1.76%)  93.8kB/op  43.84× slower
byte_format   2.12M (470.85ns) (± 1.88%)    0.0B/op   1.00× slower

All orders of magnitude here are 1000x higher due to iterations = 1_000, so the entries that are measured in nanoseconds are actually measured in picoseconds per iteration, and the one measured in microseconds is actually measured in nanoseconds.

aiac · June 22, 2024, 3:40am

thanks, very helpful