How to read files with extremely large lines

bistace · November 9, 2021, 1:06pm

Hi,

I am working with FASTA files, that are divided into records. Each record is composed of a header and a sequence, the latter being folded over several lines if needed. Here is an example of a small fasta file:

>chr1
AACAATTGCAATGATCTATCCCCATCATGATGAAATTTCCCAAGATTACCCGGGCCTGTC
GGCCAAGGCTATATACTCGTTGAATACATCAGTGTAGCGCGCGTGCGGCCCAGAACATCT
AAGGGCATCACAGACCTGTTATTGCCTCAAACTTCCGTCGCCTAAACGGCGATAGTCCCT
>chr2
CTAAGAAGCTAGCTGCGGAGGGATGGCTCCGCATAGCTAGTTAGCAGGCTGAGGTCTCGT
TCGTTAACGGAATTAACCAGACAAATCGCTCCACCAACTAAGAACGGCCATGCACCACCA
CCCATAGAATCAAGAAAGAGCTCTCAGTCTGTCAATCCTTGCTATGTCTGGACCTGGTAA
GTTTCCCCGTGTTGAGTCAAATTAAGCCGCAGGCTCCACGCCTG

As I’m pretty new to Crystal, I am trying to develop relatively simple things and so I am trying to create a fasta file processor to output statistics. The problem I am facing is that records of the file can be several GB in size and I get an overflow error when trying to read the sequence. Here is how I read the file:

  input_file = File.open(filename: file, mode: "r")
  sequence_header = input_file.gets("\n", chomp: true)
  sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "")
  while sequence
    # do something with it
    sequence_header = input_file.gets("\n", chomp: true)
    begin
      # here is the line that generates the overflow error
      sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "")
    rescue
      break
    end
  end

So I put the first header in sequence_header and the first sequence in sequence. While I can read from the file, I do something with the sequence. The file can be extremely large and I need to be able to read it fast but as I said, the sequences can also be very large.

Would you know a way for me to avoid this overflow error?

EDIT: I said that the sequences are folded over several lines, but it is not systematically the case so the line in itself can be very large.

RespiteSage · November 9, 2021, 4:35pm

IO.gets has several versions which take a maximum number of characters to read, which seems ideal for your situation: docs. You would need to process the line in chunks and differentiate between records by checking for \n, but those methods seem like they’re made for exactly your use case.

orangeSi · November 10, 2021, 5:20am

io.gets(">") maybe not work for below fasta format:

>id1 >dfdf
ATCG
>id2
ATCG

I just read the fasta file line by line File.each_line(fasta) yet.

bistace · November 10, 2021, 6:57am

Thank you @RespiteSage, it would indeed be a solution for this particular problem. However, there are instances where I would need to store the full sequence (which may be larger that 2GB). Would you know any way to do this, other than subclassing String to increase its size to an i64?

@orangeSi The fasta in your example is not valid. The ‘>’ character has to be the first one on the line and can’t appear anywhere else. Moreover, reading a large fasta line by line would be very slow compared to reading in chunks (I think, I have not tested it).

straight-shoota · November 10, 2021, 11:16am

Where and why do you need to store these sequences? Technically, you should always be able to slice it into reasonably sized pieces.

However, I’d also maybe consider using a different internal data representation. Strings are very inefficient for this kind of thing. It might make sense as a human readable storage format, but when you read that data into your program, you could parse it into more efficient format for further processing. That’s an additional computing overhead, though. So it depends on your use case whether that makes any sense or not.

RespiteSage · November 10, 2021, 4:25pm

It would help to know more about what you’re trying to do specifically. For example, do you need to store the whole sequence or just one record at a time? Are you filtering the records based on something?

However, I’ll do my best to help in the general case. You’re running into two issues:

Strings are an inefficient storage method for a limited alphabet.
Sequence types (String, Array, etc.) are generally indexed using Int32* in Crystal, which limits the available size to about 2**30 elements.

Your general problem might be solved by tackling either of these issues (or both, if you really need it or just want to be thorough).

The first problem could potentially be solved by creating a custom storage structure. This depends on what alphabet you’re using. If you only have FASTA sequences of nucleic acids with no ambiguity or gaps (i.e. an alphabet of just G, C, A, and T), you could potentially fit one letter in two bits (e.g. G = 00, C = 01, A = 10, T = 11). Therefore you could fit 4 letters in one byte, which is 4 times as efficient as ASCII. You could use Elorest’s bitfields for this purpose. If you have a larger alphabet, the benefits of this method start to drop off pretty quickly, though. And then you’d store these records in a normal Array or something.

The second problem is easy to solve, and I’d recommend you start with this approach. Since sequence types are limited in their indexing, you just combine sequence types. So, for example, you read your FASTA sequences 16 characters at a time (chunk = file.gets(16)) and store those strings in an array. Suddenly the larger structure (the array) can store 16 times as much data with the same number of indices. Your processing code becomes more complex, but that’s a necessity with the amount of data you’re handling. The larger your chunks are, the more data you can store overall.

I hope one or both of those approaches are useful to you. Something else to keep in mind, if you have a file where an individual line might be >= 2 GB, is that the second approach will only help you if you have enough RAM. You’re not going to fit the contents of a 20 GB file in 16 GB of RAM unless you compress it (which is what the first approach does).

* It might be UInt32.

orangeSi · November 11, 2021, 1:29am

yes, my above fasta example format is not valid, but sometimes I see that: like NCBI some database fasta file

>dfsdf(fdfs>fdser)
ATCG

So I just want to make sure the program work well even with this special format fasta file.

orangeSi · November 11, 2021, 10:34am

The below code maybe work with my unvalid fasta format file:

def read_fasta
  fafile = "test.fa"
  lines = "
>q1(>dfdf)
ATCG
ATCG
>q2(>dfsdf)
ATCGA
ATCGA

>q3
ATCGAT
ATCGAT

"
  File.write(fafile, lines)
  faio = File.open(fafile, "r")
  fstheader = ""
  add_flag = ""
  while header = faio.gets
    header = add_flag + header if add_flag == ">"
    if header.starts_with?(">")
      faid = header
      if faseq = faio.gets("\n>", chomp: true).not_nil!.gsub("\n", "")
        yield faid, faseq
        add_flag = ">"
      end
    end
  end
end

read_fasta() do |faid, faseq|
  puts "faid=#{faid}\nfaseq=#{faseq}"
end

Topic		Replies	Views
Issue reading really large file with read_file Help & Support	6	454	February 22, 2021
Reading a file from the end to the beginning line by line Help & Support	7	443	December 1, 2020
Is there a way to Digest large files? Help & Support	9	620	September 2, 2020
File.read unexpected error Help & Support	3	135	May 20, 2024
File offsets above Int32 Help & Support	5	285	October 2, 2019

How to read files with extremely large lines

Related topics