BGZF stands for Blocked GNU Zip Format. It is a file compression format that is commonly used in genomics-related file formats. This BGZF file is a series of compressed blocks. By using a pre-created index, only the required range can be decompressed. This allows random access to large compressed files.
It is generated with the command bgzip, but can be decompressed with zcat or gzip -d.
I thought BGZF was almost fully compatible with Gzip. However, I found that Crystal’s Compress::Gzip::Reader cannot read it. I wasn’t sure if it was just expected behavior or if it was a bug. So I’m reporting it here instead of GitHub issue.
Of course, missing a feature can in some way be seen as a bug because it means the implementation is incomplete. Depends on how far you want to stretch the definitions…
It would be a bug if Compress::Gzip claimed to support everything gzip does.
Since the standard library needs to work reliably and be persistently maintained, it may be appropriate for the BGZF to be supported by a third party library.
In any case, thank you.
I ran into this problem again today and was a bit annoyed, so I spent a few hours investigating. The bgzf format is frequently used for biological data, so I couldn’t afford to be interrupted every time I opened them.
Then I identified a very suspicious part.
According to the specification, the LEN of the EXTRA field is 2 bytes. However, only one byte is read here.
It is likely that little-endian is being used here. I think that further delayed the discovery of the problem.
This is the official slide from Samtools explaining the BGZF file. The LEN is | 0 | 6 |.
However, the actual file is | 6 | 0 |. I suspect the actual file is correct, but at least there is some confusion.
Perhaps I should create a pull request.
This may possibly be related to the previous issue.
Yea, the htslib binding for reading BGZF(BAM/BCF/fq.gz) files, which is often used for bioinformation, has already been created and uploaded to GitHub. It is far from perfect, though.