Compress::Gzip::Reader cannot open a file in BGZF format. Is this a bug?

kojix2 · January 4, 2024, 12:45pm

BGZF stands for Blocked GNU Zip Format. It is a file compression format that is commonly used in genomics-related file formats. This BGZF file is a series of compressed blocks. By using a pre-created index, only the required range can be decompressed. This allows random access to large compressed files.

It is generated with the command bgzip, but can be decompressed with zcat or gzip -d.

I thought BGZF was almost fully compatible with Gzip. However, I found that Crystal’s Compress::Gzip::Reader cannot read it. I wasn’t sure if it was just expected behavior or if it was a bug. So I’m reporting it here instead of GitHub issue.

Thank you.

kojix2 · January 4, 2024, 1:04pm

Steps to Reproduce

Install tabix:

sudo apt install tabix

dpkg -L tabix | grep bgzip
# /usr/bin/bgzip
# /usr/share/man/man1/bgzip.1.gz

Compress a file with bgzip:
```
bgzip -k your.txt
```
Decompress and view the file:
```
zcat your.txt.gz
```

Read compressed file in Crystal:

require "compress/gzip"

string = File.open("your.txt.gz") do |file|
  Compress::Gzip::Reader.open(file) do |gzip|
    gzip.gets_to_end
  end
end

Unhandled exception: deflate: invalid stored block lengths (Compress::Deflate::Error)

straight-shoota · January 4, 2024, 9:54pm

I think it’s “just” a missing feature, not a bug.

Of course, missing a feature can in some way be seen as a bug because it means the implementation is incomplete. Depends on how far you want to stretch the definitions…
It would be a bug if Compress::Gzip claimed to support everything gzip does.

Anyway, contributions are welcome

kojix2 · January 7, 2024, 2:48pm

Since the standard library needs to work reliably and be persistently maintained, it may be appropriate for the BGZF to be supported by a third party library.
In any case, thank you.

kojix2 · April 30, 2024, 12:53pm

I ran into this problem again today and was a bit annoyed, so I spent a few hours investigating. The bgzf format is frequently used for biological data, so I couldn’t afford to be interrupted every time I opened them.

Then I identified a very suspicious part.

github.com

crystal-lang/crystal/blob/8b9e299362d6028c8aa51f7093200683dc9028e0/src/compress/gzip/header.cr#L44


      
          
          flg = Flg.new(header[3])
          
          seconds = IO::ByteFormat::LittleEndian.decode(Int32, header.to_slice[4, 4])
          @modification_time = Time.unix(seconds).to_local
          
          xfl = header[8]
          @os = header[9]
          
          if flg.extra?
            xlen = io.read_byte.not_nil!
            @extra = Bytes.new(xlen)
            io.read_fully(@extra)
          end
          
          if flg.name?
            @name = io.gets('\0', chomp: true)
          end
          
          if flg.comment?
            @comment = io.gets('\0', chomp: true)

According to the specification, the LEN of the EXTRA field is 2 bytes. However, only one byte is read here.

It is likely that little-endian is being used here. I think that further delayed the discovery of the problem.

This is the official slide from Samtools explaining the BGZF file. The LEN is | 0 | 6 |.
However, the actual file is | 6 | 0 |. I suspect the actual file is correct, but at least there is some confusion.

Perhaps I should create a pull request.

This may possibly be related to the previous issue.

zw963 · May 1, 2024, 10:17am

I thought you can create a shards to support this format. naqvis github have many crystal shards can be as example.

i following magic.cr and created the first my own c library shards.

kojix2 · May 1, 2024, 11:45am

Hi @zw963
thanks for your comment.

I checked this issue one more time today and thought for sure it must be a bug in the standard library, so I submitted a pull request.

I thought you can create a shards to support this format. naqvis github have many crystal shards can be as example.

i following magic.cr and created the first my own c library shards.

Yea, the htslib binding for reading BGZF(BAM/BCF/fq.gz) files, which is often used for bioinformation, has already been created and uploaded to GitHub. It is far from perfect, though.

Topic		Replies	Views
Error when read extra field gzip compressed data Help & Support	10	2175	March 24, 2020
Arithmetic overflow (OverflowError) when Gzip::Writer Help & Support	2	464	December 24, 2019
Read nrrd-file with attached header and gzipped data Help & Support	7	264	February 11, 2023
Updated Crystal to latest version. Getting Kemal error	2	926	August 2, 2020
Write same content to gzip file with different times get different size of gz file Help & Support	2	415	December 24, 2019

Compress::Gzip::Reader cannot open a file in BGZF format. Is this a bug?

Related topics