Digest weirdness

What’s going on here?

require "digest/sha256"

class Summer < IO
  @count = 0
  getter digest : Digest::SHA256

  def initialize(@io : IO)
    @digest = Digest::SHA256.new

  def read(slice : Bytes) : Nil
    raise IO::Error.new("can't read from this")

  def write(slice : Bytes) : Nil

im = Vips::Image.new_from_file("test.png")

summer = Summer.new(File.open("test5.png", "w"))

im.write_to_target(summer, ".png")

puts summer.digest.hexfinal
puts Digest::SHA256.new.update(File.read("test5.png").to_slice).hexfinal

Why is it outputting two different hashes, they’re created from the same data?

Does it change the output if you close the file for writing before reading it on the last line?

The reason I ask is that it may not have written the last chunk to disk before reading the file’s contents on the last line, in which case the hashes might not actually have been created from the same data. Closing the file will flush the file’s output buffer before closing — File in Crystal is an IO::Buffered, plus your filesystem likely has its own kernel-level buffers to flush.

1 Like

Are you using a C binding that depends on an another SSL library like BoringSSL? I’ve experienced some weird behaviors from the stdlib digest functions because of that before.

Indeed, that’s the case. Adding in a strategically placed close fixes the problem.

But without the close, the last chunk is never written, but left broken. Shouldn’t some finalizer run when the program ends and flush out any buffers?

This happens with kernel-level filesystem buffers, but the Crystal IO::Buffered buffers won’t be flushed automatically. The finalize method on an IO will close it, but that’s called by the GC, which I don’t think is called on program exit. This might be a good argument for it to do that, though.

The way I always make sure my files are closed (or, really, any object that requires cleanup) is to use the form of File.open that takes a block. That way, I can’t screw it up. :smile:

Well, that’s a massive gotcha to leave lying around, I’d say. What use is a finalizer if it’s only called by the GC?

I have a feeling that it’s implemented that way as part of the object lifecycle rather than the program lifecycle since finalize is to initialize in method naming as “final” is to “initial” in English.

I’m sure it’s tempting for someone to come into the thread to say “but that behavior is documented”. I don’t think that would be conducive to conversation, though. A programming language or library doing what folks expect is far more valuable than depending on people reading documentation (with the caveat that it’s not actually feasible to do what everyone expects because not everyone has the same set of expectations of a given thing).

I think it’s reasonable to expect that finalize would finalize your objects on exit. I opened an issue on the crystal repo to discuss it.

When else would it be called? (Unless you call it explicitly, which would be pretty much the same as calling #close explicitly, which was missing in your example).

1 Like

Because the program is shutting down. I know it’s documented as being called by the GC, but still jumped to the (wrong) assumption that it would also be called at program shutdown (as everything is being, metaphorically speaking, garbage collected).

Rather than being GC specific, I was under the impression that it was more of an object lifecycle method. Like PHPs __destruct(), which is also called on exit. I don’t know much about finalizers in other languages to tell if PHP is an abnormality in that regard.

The program may be shutting down, but shutdown doesn’t require a garbage collection cycle. You just free everything at shutdown in a non-GC way. Finalization is always about the GC, not scope or program lifetime.

I don’t think there should be any expectation that #close is called on exit unless you use an ensure, or a block that does that for you like File.open. The GC is not about scope.

That’s a very theoretically pure, but unhelpful stance for the problem at hand. The gist of the matter is that the hapless developer just asked for a file, was handed a buffered IO which then caused havoc because everything is just thrown out at program end.

While it’s technically correct, depending on your viewpoint, it’s a bit of a buried mine for new developers that haven’t yet noticed that File inherits from IO::Buffered and realized the implications of that.

But a bit of googling does establish the fact that the object end of lifecycle aspect varies wildly between languages. But it does seem that for each “finalize isn’t guaranteed to be called” language, there’s someone trying to hack up a solution that gives them guaranteed destructors.

In Crystal one could close the gap with at_exit, but I got a feeling that’s not a solution we’d want any random shard that just wants to clean up after itself to use. Pythons weakref.finalize might serve as an inspiration, but I think a simple object method is nicer.

This is true, but it is also not very relevant to the underlying issue, which is that the optimization of using buffered IO broke the program. It may be ok behavior though, there are certainly holes in the buffer abstraction that we don’t want to plug because the performance cost would be too high.

But if the behavioral difference to unbuffered IO is not desirable, then it should perhaps be fixed. There are ways to handle outstanding writes that that doesn’t involve the GC, so please let us not focus to much on the implementation of a fix rather than on if it make sense to fix it in the first place. I for one think I would prefer to have pending data be written (but not fsynced), but I certainly have not thought through all consequences of doing that.

How do other languages that provide buffered file IO by default handle the issue?

Crystal looks like Ruby (where the interpreter would ensure that files are properly closed) while still allowing one to low-level enough to shoot oneself in the foot, so what’s “right” isn’t always obvious.

Coming from the interpreted world, I lean more towards having the compiler save my butt, but I’m sure someone coming from C/C++ would be more in favor of the opposite.

Looks like C does it through atexit, and a quick test seems to confirm that.

I just checked SBCL and can confirm that it does not flush buffered streams on exit. But there’s also multiple ways ot tell SBCL to exit when calling EXIT, and you could probably hook into one of those. None involve the GC. The normal way would be to use the language’s equivalent to an ensure or a File.open.

I can’t confirm if Java flushes on exit or not, but I can confirm that Java does NOT do a GC at exit, so there’s that.

Ensuring the GC runs at exit seems like a bad way to approach this because it could cause unintended pauses. Also, what should happen if an exception is raised during finalization?

I think exceptions in finalize are actually already known to overflow the stack.