How to easily get consistent object hash?

Every time a program runs this code:

struct Foo
  def initialize(@foo : String)
  end
end

pp Foo.new("bar").hash

I get a different Int64 result. How to get a result consistent across runs? I guess I could use OpenSSL::SHA1, but couldn’t make it work due to missing #string method…

Edit: Post with benchmarks: How to easily get consistent object hash?

I could do

require "openssl"

digest = OpenSSL::Digest.new("SHA1")
digest.update(foo.inspect)
pp digest.hexdigest

But this looks an a bit overhead – the inspect part

In my case, I don’t need cryptographic security, I just want to get an unique value which would change whenever a Foo's instance variable value changes, consistent across runs. With SHA1 digest I get 1.13M IPS with this code:

benchmark code
require "openssl"

struct Foo
  def initialize(@foo : String)
  end
end

foo = Foo.new("bar")
digest = OpenSSL::Digest.new("SHA1")
io = IO::Memory.new

require "benchmark"

Benchmark.ips do |x|
  x.report do
    io.rewind
    foo.inspect(io)
    digest.update(io)
    hash = digest.hexdigest
  end
end

It’s kinda slow :confused:

I think this is what you want:

struct Foo
  def initialize(@foo : String)
  end
  
  def hash(hasher)
    hasher = @foo.hash(hasher)
  end
end
1 Like

This would require to explicitly define the hash function on every object. That’s not what I want, @drujensen

1 Like

The problem here is the hasher use seed initialization, different each run, to avoid some security issues:

  @@seed = uninitialized UInt64[2]
  Crystal::System::Random.random_bytes(Slice.new(pointerof(@@seed).as(UInt8*), sizeof(typeof(@@seed))))

Apparently there’s no way to define a custom hasher.

My guess is you should implement your own hashing system.

1 Like

Based on the comparison on this page:

https://blake2.net

I’d expect sha-1 to be pretty fast. It claims sha-1 is 909 MiB per second. Maybe there’s some overhead in using openSSL, or at least in the way crystal calls to openSSL?

You might want to pick up the minimal C-source for one of the faster digests, and see what happens if you try to call that directly from crystal. (or rewrite it in crystal, if you’re more ambitious! :slight_smile: ). It happens that I wanted to try out blake2b yesterday, so I know a nice small repository for that source is at:

One nice thing about blake2 is that you can select what size you want for the digest values, from 1 to 64 bytes.

The dumb little C-test program that I wrote yesterday is able to process 2,397 files (with a total of 37,169,518 bytes) in 0.21 seconds. There are two versions of blake2: blake2s for CPU’s with 32-bit ints, and blake2b for CPU’s with 64-bit ints. My simple program took advantage of blake2b.

Hey, @drosehn, thanks for the response!

I don’t need cryptographic security, however, thus looking to MurMur3 and xxHash (see https://github.com/Cyan4973/xxHash, it’s 20 times faster than SHA1). I’ve found https://github.com/kuende/murmur3 and also https://github.com/jreinert/xxhash-crystal, but the latter is binding, which doesn’t suite my case. MurMur3 is ~4 times faster than Crystal’s SHA1. Still slow, though :smiley:

In your benchmark above, I don’t think it’s SHA1 which is slow, but turning the digest into hexstring:

require "openssl"

struct Foo
  def initialize(@foo : String)
  end
end

foo = Foo.new("bar")
digest = OpenSSL::Digest.new("SHA1")
io = IO::Memory.new

require "benchmark"

Benchmark.ips do |x|
  x.report "SHA1 no-hexdigest" do
    io.rewind
    foo.inspect(io)
    digest.update(io)
  end

  x.report "SHA1 hexdigest" do
    io.rewind
    foo.inspect(io)
    digest.update(io)
    hash = digest.hexdigest
  end

  x.report "nohash" do
    io.rewind
    foo.inspect(io)
  end
end

Here it gives:

SHA1 no-hexdigest  13.18M ( 75.85ns) (± 2.03%)    0 B/op   1.08× slower
   SHA1 hexdigest   1.11M (904.56ns) (± 2.04%)  192 B/op  12.91× slower
           nohash  14.27M ( 70.05ns) (± 1.76%)    0 B/op        fastest

I have no clue why it’s so slow to get the hex value however.

You’re a bit wrong. The actual computation happens in OpenSSL::Digest#finish as seen here

1 Like

Had a little chat with original Crystal::Hasher author and he proposed just to do

hasher = Crystal::Hasher.new(1, 1)
foo.hash(hasher).result

It works and it’s consistent and fast, but the result will differ on different machines due to little-big-endians thing. And it’s not suitable for my purposes, unfortunately…

Why do you need this?

I build a job processing system which would be aware of jobs with the same payload.

Yuri Sokolov advised me to use Fnv1a for this case. It’s consistent both across runs and machines. And super-fast:

module FNV
  extend self

  INIT32  =         0x811c9dc5_u32
  INIT64  = 0xcbf29ce484222325_u64
  PRIME32 =         0x01000193_u32
  PRIME64 =      0x100000001b3_u64

  def fnv1a_32(bytes : Bytes)
    hash = INIT32

    bytes.each do |byte|
      hash = hash ^ byte
      hash = hash &* PRIME32
    end

    hash
  end

  def fnv1a_64(bytes : Bytes)
    hash = INIT64

    bytes.each do |byte|
      hash = hash ^ byte
      hash = hash &* PRIME64
    end

    hash
  end
end

small_payload = "foo".to_slice
big_payload = ("a" * 1000).to_slice

require "benchmark"

Benchmark.ips do |x|
  x.report "small 32" do
    hash = FNV.fnv1a_32(small_payload).to_s
  end

  x.report "small 64" do
    hash = FNV.fnv1a_64(small_payload).to_s
  end

  x.report "big 32" do
    hash = FNV.fnv1a_32(big_payload).to_s
  end

  x.report "big 64" do
    hash = FNV.fnv1a_64(big_payload).to_s
  end
end
  small 32  19.43M ( 51.47ns) (± 1.81%)   32 B/op        fastest
  small 64  11.07M ( 90.36ns) (± 1.73%)   48 B/op   1.76× slower
    big 32 505.53k (  1.98µs) (± 5.51%)   32 B/op  38.44× slower
    big 64 506.68k (  1.97µs) (± 4.15%)   48 B/op  38.35× slower

I’m happy, it fits my use-case perfectly!

Update: It’s very slow on big entries. I’m not so happy, again

But it’s not clear why you need a hash for a payload… why not serialize each job using JSON?

Because a JSON serialization result could be of arbitrary length, and I’m about to store it in Redis, thus trying to optimize things as much as possible :slight_smile:

I also managed to try https://github.com/ysbaddaden/siphash.cr and it’s working better than FNV on big strings. I’ve made a comparison of these three: OpenSSL::SHA1, FNV and SipHash on 10, 100, 500 and 10k-sized strings. I’ve made my conclusions and I’m gonna give the developer a choice on what algorithm to use for a job.

Benchmark code
require "openssl"
require "./fnv1a"
require "siphash"
require "benchmark"
require "random/secure"

payload_10 = ("a" * 10).to_slice
payload_100 = ("a" * 100).to_slice
payload_500 = ("a" * 500).to_slice
payload_10000 = ("a" * 10000).to_slice

sha_1digest = OpenSSL::Digest.new("SHA1")
siphash_key = StaticArray(UInt8, 16).new(0)

Benchmark.ips do |x|
  x.report "sha1 10" do
    hash = sha_1digest.update(payload_10).to_s
  end

  x.report "sha1 100" do
    hash = sha_1digest.update(payload_100).to_s
  end

  x.report "sha1 500" do
    hash = sha_1digest.update(payload_500).to_s
  end

  x.report "sha1 10000" do
    hash = sha_1digest.update(payload_10000).to_s
  end

  x.report "fnv1a 10" do
    hash = FNV.fnv1a_32(payload_10).to_s
  end

  x.report "fnv1a 100" do
    hash = FNV.fnv1a_32(payload_100).to_s
  end

  x.report "fnv1a 500" do
    hash = FNV.fnv1a_32(payload_500).to_s
  end

  x.report "fnv1a 10000" do
    hash = FNV.fnv1a_32(payload_10000).to_s
  end

  x.report "siphash 10" do
    hash = SipHash(1, 3).siphash(payload_10, siphash_key).to_s
  end

  x.report "siphash 100" do
    hash = SipHash(1, 3).siphash(payload_100, siphash_key).to_s
  end

  x.report "siphash 500" do
    hash = SipHash(1, 3).siphash(payload_500, siphash_key).to_s
  end

  x.report "siphash 10000" do
    hash = SipHash(1, 3).siphash(payload_10000, siphash_key).to_s
  end
end

Results:

      sha1 10 854.76k (  1.17µs) (± 0.82%)  368 B/op   17.50× slower
     sha1 100 710.62k (  1.41µs) (± 1.20%)  368 B/op   21.05× slower
     sha1 500 468.16k (  2.14µs) (± 0.96%)  368 B/op   31.95× slower
   sha1 10000  54.48k ( 18.36µs) (± 4.26%)  368 B/op  274.55× slower
     fnv1a 10  14.96M ( 66.86ns) (± 1.15%)   32 B/op         fastest
    fnv1a 100   4.04M (247.47ns) (± 1.57%)   32 B/op    3.70× slower
    fnv1a 500 971.51k (  1.03µs) (± 2.13%)   32 B/op   15.40× slower
  fnv1a 10000  53.39k ( 18.73µs) (± 2.13%)   32 B/op  280.18× slower
   siphash 10   8.74M (114.47ns) (± 0.86%)   32 B/op    1.71× slower
  siphash 100   5.23M (191.16ns) (± 0.81%)   48 B/op    2.86× slower
  siphash 500   2.24M (445.82ns) (± 0.91%)   48 B/op    6.67× slower
siphash 10000 161.62k (  6.19µs) (± 2.34%)   48 B/op   92.55× slower

A bit out of topic, but it’s the first time I saw &* operator

hash = hash &* PRIME32

What’s the meaning exactly?

The &* is the same as current *. In next versions * and other operations like +, … would raise on overflow.

1 Like

Crystal doesn’t run on any big-endian architectures currently, and it probably never will. Big-endian architectures are rare these days. I wouldn’t worry about it and use Crystal::Hasher.

1 Like

:cry: This was an interesting read, sad it’ll be deleted