Is there a way to Digest large files?

Anyone know how to stream a large file and get a Digest from it? since we cant File.read a file like over 2GB. I went through the docs but didn’t see a way to pass a IO to Digest::Base

There was a limit to files contents being under a 32bit number but I dont know if that is still a limit. What error are you getting?

i tried to do it via:

require "digest/sha1"
require "digest/md5"

large_file = "3gb.file"
file = File.open(large_file)

slice = Bytes.new(256000000)

digest = Digest::SHA1.base64digest do |ctx|
  file.read(slice)
  while (slice)
    ctx.update slice
    file.read(slice)
  end
end

puts digest.to_slice.hexstring

But it throws the same Arithmetic error i get when doing File.read of the 3gb file

The error in particular is:

Unhandled exception: Arithmetic overflow (OverflowError)
  from ../../../../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/digest/sha1.cr:38:19 in 'update_impl'
  from ../../../../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/digest/base.cr:107:5 in 'update'
  from ../../../../../../../../../play:19:9 in '__crystal_main'
  from ../../../../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:105:5 in 'main_user_code'
  from ../../../../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:91:7 in 'main'
  from ../../../../../../../../../usr/local/Cellar/crystal/0.35.1_1/src/crystal/main.cr:114:3 in 'main'

Looks like files are now 64 bit capable

But slices are not.

It’s a bug in the sha1 code, it should be using wrapping arithmetic. Could you please report it? Thank you!

1 Like

This line:

Should be:

      @length_low &+= 8

And similarly for the += 1 a bit below that.

@asterite perfect. wanted to confirm it was a bug before opening.

A current work around is to use OpenSSL

File.open(large_file) do |f|
  slice = Bytes.new(256_000)
  io = OpenSSL::DigestIO.new(f, "MD5")

  while (io.read(slice)) > 0; end

  puts io.hexdigest
end
1 Like

@kalinon Not related to your original inquiry, but you might want to check out Blake3 if you’re going to be hashing large files regularly. @Didactic.Drunk has an implementation for Crystal. https://github.com/didactic-drunk/blake3.cr

1 Like