Compress::Gzip::Reader cannot open a file in BGZF format. Is this a bug?

BGZF stands for Blocked GNU Zip Format. It is a file compression format that is commonly used in genomics-related file formats. This BGZF file is a series of compressed blocks. By using a pre-created index, only the required range can be decompressed. This allows random access to large compressed files.

It is generated with the command bgzip, but can be decompressed with zcat or gzip -d.

I thought BGZF was almost fully compatible with Gzip. However, I found that Crystal’s Compress::Gzip::Reader cannot read it. I wasn’t sure if it was just expected behavior or if it was a bug. So I’m reporting it here instead of GitHub issue.

Thank you.

1 Like

Steps to Reproduce

  1. Install tabix:

    sudo apt install tabix
    
    dpkg -L tabix | grep bgzip
    # /usr/bin/bgzip
    # /usr/share/man/man1/bgzip.1.gz
    
  2. Compress a file with bgzip:

    bgzip -k your.txt
    
  3. Decompress and view the file:

    zcat your.txt.gz
    
  4. Read compressed file in Crystal:

    require "compress/gzip"
    
    string = File.open("your.txt.gz") do |file|
      Compress::Gzip::Reader.open(file) do |gzip|
        gzip.gets_to_end
      end
    end
    
Unhandled exception: deflate: invalid stored block lengths (Compress::Deflate::Error)

I think it’s “just” a missing feature, not a bug.

Of course, missing a feature can in some way be seen as a bug because it means the implementation is incomplete. Depends on how far you want to stretch the definitions…
It would be a bug if Compress::Gzip claimed to support everything gzip does.

Anyway, contributions are welcome :laughing:

1 Like

Since the standard library needs to work reliably and be persistently maintained, it may be appropriate for the BGZF to be supported by a third party library.
In any case, thank you.

I ran into this problem again today and was a bit annoyed, so I spent a few hours investigating. The bgzf format is frequently used for biological data, so I couldn’t afford to be interrupted every time I opened them.

Then I identified a very suspicious part.

According to the specification, the LEN of the EXTRA field is 2 bytes. However, only one byte is read here.

It is likely that little-endian is being used here. I think that further delayed the discovery of the problem.

This is the official slide from Samtools explaining the BGZF file. The LEN is | 0 | 6 |.
However, the actual file is | 6 | 0 |. I suspect the actual file is correct, but at least there is some confusion.

Perhaps I should create a pull request.

This may possibly be related to the previous issue.

1 Like

I thought you can create a shards to support this format. naqvis github have many crystal shards can be as example.

i following magic.cr and created the first my own c library shards.

Hi @zw963
thanks for your comment.

I checked this issue one more time today and thought for sure it must be a bug in the standard library, so I submitted a pull request.

I thought you can create a shards to support this format. naqvis github have many crystal shards can be as example.

i following magic.cr and created the first my own c library shards.

Yea, the htslib binding for reading BGZF(BAM/BCF/fq.gz) files, which is often used for bioinformation, has already been created and uploaded to GitHub. It is far from perfect, though.