Digest weirdness

Xen · November 17, 2023, 9:58pm

What’s going on here?

require "digest/sha256"

class Summer < IO
  @count = 0
  getter digest : Digest::SHA256

  def initialize(@io : IO)
    @digest = Digest::SHA256.new
  end

  def read(slice : Bytes) : Nil
    raise IO::Error.new("can't read from this")
  end

  def write(slice : Bytes) : Nil
    @digest.update(slice)
    @io.write(slice)
  end
end

im = Vips::Image.new_from_file("test.png")

summer = Summer.new(File.open("test5.png", "w"))

im.write_to_target(summer, ".png")

puts summer.digest.hexfinal
puts Digest::SHA256.new.update(File.read("test5.png").to_slice).hexfinal

Why is it outputting two different hashes, they’re created from the same data?

jgaskins · November 18, 2023, 12:00am

Does it change the output if you close the file for writing before reading it on the last line?

jgaskins · November 18, 2023, 12:12am

The reason I ask is that it may not have written the last chunk to disk before reading the file’s contents on the last line, in which case the hashes might not actually have been created from the same data. Closing the file will flush the file’s output buffer before closing — File in Crystal is an IO::Buffered, plus your filesystem likely has its own kernel-level buffers to flush.

syeopite · November 18, 2023, 10:32am

Are you using a C binding that depends on an another SSL library like BoringSSL? I’ve experienced some weird behaviors from the stdlib digest functions because of that before.

Xen · November 18, 2023, 11:10am

Indeed, that’s the case. Adding in a strategically placed close fixes the problem.

But without the close, the last chunk is never written, but left broken. Shouldn’t some finalizer run when the program ends and flush out any buffers?

jgaskins · November 18, 2023, 4:07pm

This happens with kernel-level filesystem buffers, but the Crystal IO::Buffered buffers won’t be flushed automatically. The finalize method on an IO will close it, but that’s called by the GC, which I don’t think is called on program exit. This might be a good argument for it to do that, though.

The way I always make sure my files are closed (or, really, any object that requires cleanup) is to use the form of File.open that takes a block. That way, I can’t screw it up.

Xen · November 18, 2023, 9:10pm

Well, that’s a massive gotcha to leave lying around, I’d say. What use is a finalizer if it’s only called by the GC?

jgaskins · November 18, 2023, 10:34pm

I have a feeling that it’s implemented that way as part of the object lifecycle rather than the program lifecycle since finalize is to initialize in method naming as “final” is to “initial” in English.

I’m sure it’s tempting for someone to come into the thread to say “but that behavior is documented”. I don’t think that would be conducive to conversation, though. A programming language or library doing what folks expect is far more valuable than depending on people reading documentation (with the caveat that it’s not actually feasible to do what everyone expects because not everyone has the same set of expectations of a given thing).

I think it’s reasonable to expect that finalize would finalize your objects on exit. I opened an issue on the crystal repo to discuss it.

straight-shoota · November 18, 2023, 10:38pm

When else would it be called? (Unless you call it explicitly, which would be pretty much the same as calling #close explicitly, which was missing in your example).

Xen · November 18, 2023, 11:01pm

Because the program is shutting down. I know it’s documented as being called by the GC, but still jumped to the (wrong) assumption that it would also be called at program shutdown (as everything is being, metaphorically speaking, garbage collected).

Rather than being GC specific, I was under the impression that it was more of an object lifecycle method. Like PHPs __destruct(), which is also called on exit. I don’t know much about finalizers in other languages to tell if PHP is an abnormality in that regard.

MistressRemilia · November 18, 2023, 11:11pm

The program may be shutting down, but shutdown doesn’t require a garbage collection cycle. You just free everything at shutdown in a non-GC way. Finalization is always about the GC, not scope or program lifetime.

I don’t think there should be any expectation that #close is called on exit unless you use an ensure, or a block that does that for you like File.open. The GC is not about scope.

Xen · November 19, 2023, 1:40pm

That’s a very theoretically pure, but unhelpful stance for the problem at hand. The gist of the matter is that the hapless developer just asked for a file, was handed a buffered IO which then caused havoc because everything is just thrown out at program end.

While it’s technically correct, depending on your viewpoint, it’s a bit of a buried mine for new developers that haven’t yet noticed that File inherits from IO::Buffered and realized the implications of that.

But a bit of googling does establish the fact that the object end of lifecycle aspect varies wildly between languages. But it does seem that for each “finalize isn’t guaranteed to be called” language, there’s someone trying to hack up a solution that gives them guaranteed destructors.

In Crystal one could close the gap with at_exit, but I got a feeling that’s not a solution we’d want any random shard that just wants to clean up after itself to use. Pythons weakref.finalize might serve as an inspiration, but I think a simple object method is nicer.

yxhuvud · November 19, 2023, 7:35pm

This is true, but it is also not very relevant to the underlying issue, which is that the optimization of using buffered IO broke the program. It may be ok behavior though, there are certainly holes in the buffer abstraction that we don’t want to plug because the performance cost would be too high.

But if the behavioral difference to unbuffered IO is not desirable, then it should perhaps be fixed. There are ways to handle outstanding writes that that doesn’t involve the GC, so please let us not focus to much on the implementation of a fix rather than on if it make sense to fix it in the first place. I for one think I would prefer to have pending data be written (but not fsynced), but I certainly have not thought through all consequences of doing that.

How do other languages that provide buffered file IO by default handle the issue?

Xen · November 19, 2023, 7:57pm

Crystal looks like Ruby (where the interpreter would ensure that files are properly closed) while still allowing one to low-level enough to shoot oneself in the foot, so what’s “right” isn’t always obvious.

Coming from the interpreted world, I lean more towards having the compiler save my butt, but I’m sure someone coming from C/C++ would be more in favor of the opposite.

MistressRemilia · November 19, 2023, 7:57pm

Looks like C does it through atexit, and a quick test seems to confirm that.

I just checked SBCL and can confirm that it does not flush buffered streams on exit. But there’s also multiple ways ot tell SBCL to exit when calling EXIT, and you could probably hook into one of those. None involve the GC. The normal way would be to use the language’s equivalent to an ensure or a File.open.

I can’t confirm if Java flushes on exit or not, but I can confirm that Java does NOT do a GC at exit, so there’s that.

Ensuring the GC runs at exit seems like a bad way to approach this because it could cause unintended pauses. Also, what should happen if an exception is raised during finalization?

jgaskins · November 24, 2023, 4:46am

I think exceptions in finalize are actually already known to overflow the stack.

Topic		Replies	Views
Unable to get info: Bad file descriptor (IO::Error) Help & Support	15	378	May 7, 2024
Arithmetic overflow when trying to benchmark some digest Help & Support	6	304	February 13, 2021
How to easily get consistent object hash?	21	1874	April 3, 2019
Child process accessing a file opened in Crystal Help & Support	1	242	January 30, 2019
Undefined constant Digest::CRC32 Help & Support	6	280	March 5, 2021

Digest weirdness

Related topics