Write same content to gzip file with different times get different size of gz file

hello everyone,
I want to write big size string to gzip file, because the string length have a limit , so I cut the string to multi substring, then write substring to gz file seperately, but the gz file is a little bigger.

I make a demo write2gzfile.cr to show the different like this:

require "gzip"

aa = "AB"*100
bb = "12"*100

# write to gzip file with one time
gzio = File.open("./fileaabb.gz", "w")
write2gz(gzio, aa+"\n"+bb)
gzio.close

# write the same content to gzip file with two times
gzio = File.open("./fileaa.bb.gz", "w")
write2gz(gzio, aa)
write2gz(gzio, bb)
gzio.close


def write2gz(io : IO, content = "")
	Gzip::Writer.open(io) do |gzip|
      gzip.puts(content)
    end
end

after build, then run get this two files:

$ ls -l file*gz
-rw-r--r-- 1 root root         31 Dec 19 21:46 fileaabb.gz
-rw-r--r-- 1 root root         52 Dec 19 21:46 fileaa.bb.gz

fileaabb.gz file is 31 bytes, but fileaa.bb.gz is 52 bytes.
If the size of content which wait for writing is more big, the size different of gz file after writing maybe more big.

So I want to make the gz file after writing is smaller as possible, how can I do that?

Thanks~
Regar
Si

The difference comes from the GZ data format. fileaabb.gz compresses the entire data into a single flate container. fileaa.bb.gz on the other hand, compresses each component into a separate flate container. This adds some overhead for the additional header info.

To demonstrate this, when you look at the contents of fileaa.bb.gz, it contains the byte sequence 8b1f 0008 twice. This sequence indicates a gzip container and the first one contains the first line of ABs and the second one contains the second line of 12s.

To avoid the additional overhead, you probably want to reuse the Gzip::Writer instance to output only a single flate container.

require "gzip"

gzio = Gzip::Writer.open(File.open("./fileaabb.gz", "w"), sync_close: true)
gzio.puts "AB"*100
gzio.puts "12"*100
gzio.close
2 Likes

Thanks~