Data loss when writing long lines with File.print after File.set_encoding

I ran into an interesting problem with my Crystal text editor. It seems that writing long lines to a file can cause characters to be dropped in some situations. Specifically, if I use File.set_encoding and set invalid to :skip, then File.print of a very long line line causes bytes to dropped at 1K offsets. This 1K oddity leads me to think there’s a buffer boundary problem somewhere in File.print.

I’m using Crystal 1.18.2 on Fedora Linux 43 (x86_64).

Here is a test program that demonstrates the problem: when I run it, it generates a file containing 2999 bytes, when it should be 3001 bytes. When I delete the set_encoding line, the problem goes away.

longline = "0123456789" * 300
File.open("junk", "w") do |f|
  f.set_encoding("UTF-8", invalid: :skip) # This causes the data loss.
  f.print(longline)
  f.print("\n")
end
size = File.info("junk").size
puts "file junk has #{size} bytes, should be 3001"

It’s not just files, it happens with IO::Memory, too. In this example, all of the | characters are filtered from the buffer:

longline = ("." * 1024 + "|") * 4
buffer = IO::Memory.new
buffer.set_encoding "UTF-8", invalid: :skip
buffer << longline

puts buffer

puts "buffer has #{buffer.bytesize} bytes, should be 4100"

This seems to be a bug. Can you file an issue on the GitHub repo?

1 Like

Thank you, that is a better test program. I filed a bug and included your program.

1 Like

Added a PR for it. It may not be the right solution, but at the very least it points out where the problem is.

2 Likes