How to easily get consistent object hash?

vladfaust · January 19, 2019, 3:58pm

Every time a program runs this code:

struct Foo
  def initialize(@foo : String)
  end
end

pp Foo.new("bar").hash

I get a different Int64 result. How to get a result consistent across runs? I guess I could use OpenSSL::SHA1, but couldn’t make it work due to missing #string method…

Edit: Post with benchmarks: How to easily get consistent object hash?

vladfaust · January 19, 2019, 4:03pm

I could do

require "openssl"

digest = OpenSSL::Digest.new("SHA1")
digest.update(foo.inspect)
pp digest.hexdigest

But this looks an a bit overhead – the inspect part

vladfaust · January 19, 2019, 4:13pm

In my case, I don’t need cryptographic security, I just want to get an unique value which would change whenever a Foo's instance variable value changes, consistent across runs. With SHA1 digest I get 1.13M IPS with this code:

benchmark code

require "openssl"

struct Foo
  def initialize(@foo : String)
  end
end

foo = Foo.new("bar")
digest = OpenSSL::Digest.new("SHA1")
io = IO::Memory.new

require "benchmark"

Benchmark.ips do |x|
  x.report do
    io.rewind
    foo.inspect(io)
    digest.update(io)
    hash = digest.hexdigest
  end
end

It’s kinda slow

drujensen · January 19, 2019, 4:36pm

I think this is what you want:

struct Foo
  def initialize(@foo : String)
  end
  
  def hash(hasher)
    hasher = @foo.hash(hasher)
  end
end

vladfaust · January 19, 2019, 4:38pm

This would require to explicitly define the hash function on every object. That’s not what I want, @drujensen

anykeyh · January 19, 2019, 4:43pm

The problem here is the hasher use seed initialization, different each run, to avoid some security issues:

  @@seed = uninitialized UInt64[2]
  Crystal::System::Random.random_bytes(Slice.new(pointerof(@@seed).as(UInt8*), sizeof(typeof(@@seed))))

github.com

crystal-lang/crystal/blob/master/src/crystal/hasher.cr

# :nodoc:
struct Crystal::Hasher
  # Implementation of a Hasher to compute a fast and safe hash
  # value for primitive and basic Crystal objects. All other
  # hashes are computed based on these.
  #
  # The algorithm bases on https://github.com/funny-falcon/funny_hash
  #
  # It is two multiply-rotate 64bit hash functions, combined
  # within finalizer.
  #
  # Both hash functions combines previous state with a block value
  # before multiplication. One function multiplies new state
  # as is (and then rotates state), other multiplies new state
  # already rotated by 32 bits.
  #
  # This way algorithm ensures that every block bit affects at
  # least 1 bit of every state, and certainly many bits of some
  # state. So effect of this bit could not be easily canceled
  # with following blocks. (Cause next blocks have to cancel

This file has been truncated. show original

Apparently there’s no way to define a custom hasher.

My guess is you should implement your own hashing system.

drosehn · January 19, 2019, 5:06pm

Based on the comparison on this page:

https://blake2.net

I’d expect sha-1 to be pretty fast. It claims sha-1 is 909 MiB per second. Maybe there’s some overhead in using openSSL, or at least in the way crystal calls to openSSL?

You might want to pick up the minimal C-source for one of the faster digests, and see what happens if you try to call that directly from crystal. (or rewrite it in crystal, if you’re more ambitious! ). It happens that I wanted to try out blake2b yesterday, so I know a nice small repository for that source is at:

One nice thing about blake2 is that you can select what size you want for the digest values, from 1 to 64 bytes.

The dumb little C-test program that I wrote yesterday is able to process 2,397 files (with a total of 37,169,518 bytes) in 0.21 seconds. There are two versions of blake2: blake2s for CPU’s with 32-bit ints, and blake2b for CPU’s with 64-bit ints. My simple program took advantage of blake2b.

vladfaust · January 19, 2019, 5:15pm

Hey, @drosehn, thanks for the response!

I don’t need cryptographic security, however, thus looking to MurMur3 and xxHash (see https://github.com/Cyan4973/xxHash, it’s 20 times faster than SHA1). I’ve found https://github.com/kuende/murmur3 and also https://github.com/jreinert/xxhash-crystal, but the latter is binding, which doesn’t suite my case. MurMur3 is ~4 times faster than Crystal’s SHA1. Still slow, though

anykeyh · January 19, 2019, 6:33pm

In your benchmark above, I don’t think it’s SHA1 which is slow, but turning the digest into hexstring:

require "openssl"

struct Foo
  def initialize(@foo : String)
  end
end

foo = Foo.new("bar")
digest = OpenSSL::Digest.new("SHA1")
io = IO::Memory.new

require "benchmark"

Benchmark.ips do |x|
  x.report "SHA1 no-hexdigest" do
    io.rewind
    foo.inspect(io)
    digest.update(io)
  end

  x.report "SHA1 hexdigest" do
    io.rewind
    foo.inspect(io)
    digest.update(io)
    hash = digest.hexdigest
  end

  x.report "nohash" do
    io.rewind
    foo.inspect(io)
  end
end

Here it gives:

SHA1 no-hexdigest  13.18M ( 75.85ns) (± 2.03%)    0 B/op   1.08× slower
   SHA1 hexdigest   1.11M (904.56ns) (± 2.04%)  192 B/op  12.91× slower
           nohash  14.27M ( 70.05ns) (± 1.76%)    0 B/op        fastest

I have no clue why it’s so slow to get the hex value however.

vladfaust · January 19, 2019, 6:47pm

You’re a bit wrong. The actual computation happens in OpenSSL::Digest#finish as seen here

vladfaust · January 19, 2019, 7:18pm

Had a little chat with original Crystal::Hasher author and he proposed just to do

hasher = Crystal::Hasher.new(1, 1)
foo.hash(hasher).result

It works and it’s consistent and fast, but the result will differ on different machines due to little-big-endians thing. And it’s not suitable for my purposes, unfortunately…

asterite · January 19, 2019, 7:19pm

Why do you need this?

vladfaust · January 19, 2019, 7:44pm

I build a job processing system which would be aware of jobs with the same payload.

Yuri Sokolov advised me to use Fnv1a for this case. It’s consistent both across runs and machines. And super-fast:

module FNV
  extend self

  INIT32  =         0x811c9dc5_u32
  INIT64  = 0xcbf29ce484222325_u64
  PRIME32 =         0x01000193_u32
  PRIME64 =      0x100000001b3_u64

  def fnv1a_32(bytes : Bytes)
    hash = INIT32

    bytes.each do |byte|
      hash = hash ^ byte
      hash = hash &* PRIME32
    end

    hash
  end

  def fnv1a_64(bytes : Bytes)
    hash = INIT64

    bytes.each do |byte|
      hash = hash ^ byte
      hash = hash &* PRIME64
    end

    hash
  end
end

small_payload = "foo".to_slice
big_payload = ("a" * 1000).to_slice

require "benchmark"

Benchmark.ips do |x|
  x.report "small 32" do
    hash = FNV.fnv1a_32(small_payload).to_s
  end

  x.report "small 64" do
    hash = FNV.fnv1a_64(small_payload).to_s
  end

  x.report "big 32" do
    hash = FNV.fnv1a_32(big_payload).to_s
  end

  x.report "big 64" do
    hash = FNV.fnv1a_64(big_payload).to_s
  end
end

  small 32  19.43M ( 51.47ns) (± 1.81%)   32 B/op        fastest
  small 64  11.07M ( 90.36ns) (± 1.73%)   48 B/op   1.76× slower
    big 32 505.53k (  1.98µs) (± 5.51%)   32 B/op  38.44× slower
    big 64 506.68k (  1.97µs) (± 4.15%)   48 B/op  38.35× slower

I’m happy, it fits my use-case perfectly!

Update: It’s very slow on big entries. I’m not so happy, again

asterite · January 19, 2019, 8:10pm

But it’s not clear why you need a hash for a payload… why not serialize each job using JSON?

vladfaust · January 19, 2019, 8:13pm

Because a JSON serialization result could be of arbitrary length, and I’m about to store it in Redis, thus trying to optimize things as much as possible

vladfaust · January 19, 2019, 8:35pm

I also managed to try https://github.com/ysbaddaden/siphash.cr and it’s working better than FNV on big strings. I’ve made a comparison of these three: OpenSSL::SHA1, FNV and SipHash on 10, 100, 500 and 10k-sized strings. I’ve made my conclusions and I’m gonna give the developer a choice on what algorithm to use for a job.

Benchmark code

require "openssl"
require "./fnv1a"
require "siphash"
require "benchmark"
require "random/secure"

payload_10 = ("a" * 10).to_slice
payload_100 = ("a" * 100).to_slice
payload_500 = ("a" * 500).to_slice
payload_10000 = ("a" * 10000).to_slice

sha_1digest = OpenSSL::Digest.new("SHA1")
siphash_key = StaticArray(UInt8, 16).new(0)

Benchmark.ips do |x|
  x.report "sha1 10" do
    hash = sha_1digest.update(payload_10).to_s
  end

  x.report "sha1 100" do
    hash = sha_1digest.update(payload_100).to_s
  end

  x.report "sha1 500" do
    hash = sha_1digest.update(payload_500).to_s
  end

  x.report "sha1 10000" do
    hash = sha_1digest.update(payload_10000).to_s
  end

  x.report "fnv1a 10" do
    hash = FNV.fnv1a_32(payload_10).to_s
  end

  x.report "fnv1a 100" do
    hash = FNV.fnv1a_32(payload_100).to_s
  end

  x.report "fnv1a 500" do
    hash = FNV.fnv1a_32(payload_500).to_s
  end

  x.report "fnv1a 10000" do
    hash = FNV.fnv1a_32(payload_10000).to_s
  end

  x.report "siphash 10" do
    hash = SipHash(1, 3).siphash(payload_10, siphash_key).to_s
  end

  x.report "siphash 100" do
    hash = SipHash(1, 3).siphash(payload_100, siphash_key).to_s
  end

  x.report "siphash 500" do
    hash = SipHash(1, 3).siphash(payload_500, siphash_key).to_s
  end

  x.report "siphash 10000" do
    hash = SipHash(1, 3).siphash(payload_10000, siphash_key).to_s
  end
end

Results:

      sha1 10 854.76k (  1.17µs) (± 0.82%)  368 B/op   17.50× slower
     sha1 100 710.62k (  1.41µs) (± 1.20%)  368 B/op   21.05× slower
     sha1 500 468.16k (  2.14µs) (± 0.96%)  368 B/op   31.95× slower
   sha1 10000  54.48k ( 18.36µs) (± 4.26%)  368 B/op  274.55× slower
     fnv1a 10  14.96M ( 66.86ns) (± 1.15%)   32 B/op         fastest
    fnv1a 100   4.04M (247.47ns) (± 1.57%)   32 B/op    3.70× slower
    fnv1a 500 971.51k (  1.03µs) (± 2.13%)   32 B/op   15.40× slower
  fnv1a 10000  53.39k ( 18.73µs) (± 2.13%)   32 B/op  280.18× slower
   siphash 10   8.74M (114.47ns) (± 0.86%)   32 B/op    1.71× slower
  siphash 100   5.23M (191.16ns) (± 0.81%)   48 B/op    2.86× slower
  siphash 500   2.24M (445.82ns) (± 0.91%)   48 B/op    6.67× slower
siphash 10000 161.62k (  6.19µs) (± 2.34%)   48 B/op   92.55× slower

anykeyh · January 20, 2019, 4:59am

A bit out of topic, but it’s the first time I saw &* operator

hash = hash &* PRIME32

What’s the meaning exactly?

vlazar · January 20, 2019, 6:33am

The &* is the same as current *. In next versions * and other operations like +, … would raise on overflow.

RX14 · January 27, 2019, 12:47am

Crystal doesn’t run on any big-endian architectures currently, and it probably never will. Big-endian architectures are rare these days. I wouldn’t worry about it and use Crystal::Hasher.

Jenz · April 3, 2019, 1:47pm

This was an interesting read, sad it’ll be deleted