How to read files with extremely large lines

Hi,

I am working with FASTA files, that are divided into records. Each record is composed of a header and a sequence, the latter being folded over several lines if needed. Here is an example of a small fasta file:

>chr1
AACAATTGCAATGATCTATCCCCATCATGATGAAATTTCCCAAGATTACCCGGGCCTGTC
GGCCAAGGCTATATACTCGTTGAATACATCAGTGTAGCGCGCGTGCGGCCCAGAACATCT
AAGGGCATCACAGACCTGTTATTGCCTCAAACTTCCGTCGCCTAAACGGCGATAGTCCCT
>chr2
CTAAGAAGCTAGCTGCGGAGGGATGGCTCCGCATAGCTAGTTAGCAGGCTGAGGTCTCGT
TCGTTAACGGAATTAACCAGACAAATCGCTCCACCAACTAAGAACGGCCATGCACCACCA
CCCATAGAATCAAGAAAGAGCTCTCAGTCTGTCAATCCTTGCTATGTCTGGACCTGGTAA
GTTTCCCCGTGTTGAGTCAAATTAAGCCGCAGGCTCCACGCCTG

As I’m pretty new to Crystal, I am trying to develop relatively simple things and so I am trying to create a fasta file processor to output statistics. The problem I am facing is that records of the file can be several GB in size and I get an overflow error when trying to read the sequence. Here is how I read the file:

  input_file = File.open(filename: file, mode: "r")
  sequence_header = input_file.gets("\n", chomp: true)
  sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "")
  while sequence
    # do something with it
    sequence_header = input_file.gets("\n", chomp: true)
    begin
      # here is the line that generates the overflow error
      sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "")
    rescue
      break
    end
  end

So I put the first header in sequence_header and the first sequence in sequence. While I can read from the file, I do something with the sequence. The file can be extremely large and I need to be able to read it fast but as I said, the sequences can also be very large.

Would you know a way for me to avoid this overflow error?

EDIT: I said that the sequences are folded over several lines, but it is not systematically the case so the line in itself can be very large.

IO.gets has several versions which take a maximum number of characters to read, which seems ideal for your situation: docs. You would need to process the line in chunks and differentiate between records by checking for \n, but those methods seem like they’re made for exactly your use case.

1 Like

io.gets(">") maybe not work for below fasta format:

>id1 >dfdf
ATCG
>id2
ATCG

I just read the fasta file line by line File.each_line(fasta) yet.

Thank you @RespiteSage, it would indeed be a solution for this particular problem. However, there are instances where I would need to store the full sequence (which may be larger that 2GB). Would you know any way to do this, other than subclassing String to increase its size to an i64?

@orangeSi The fasta in your example is not valid. The ‘>’ character has to be the first one on the line and can’t appear anywhere else. Moreover, reading a large fasta line by line would be very slow compared to reading in chunks (I think, I have not tested it).

Where and why do you need to store these sequences? Technically, you should always be able to slice it into reasonably sized pieces.

However, I’d also maybe consider using a different internal data representation. Strings are very inefficient for this kind of thing. It might make sense as a human readable storage format, but when you read that data into your program, you could parse it into more efficient format for further processing. That’s an additional computing overhead, though. So it depends on your use case whether that makes any sense or not.

1 Like

It would help to know more about what you’re trying to do specifically. For example, do you need to store the whole sequence or just one record at a time? Are you filtering the records based on something?

However, I’ll do my best to help in the general case. You’re running into two issues:

  1. Strings are an inefficient storage method for a limited alphabet.
  2. Sequence types (String, Array, etc.) are generally indexed using Int32* in Crystal, which limits the available size to about 2**30 elements.

Your general problem might be solved by tackling either of these issues (or both, if you really need it or just want to be thorough).

The first problem could potentially be solved by creating a custom storage structure. This depends on what alphabet you’re using. If you only have FASTA sequences of nucleic acids with no ambiguity or gaps (i.e. an alphabet of just G, C, A, and T), you could potentially fit one letter in two bits (e.g. G = 00, C = 01, A = 10, T = 11). Therefore you could fit 4 letters in one byte, which is 4 times as efficient as ASCII. You could use Elorest’s bitfields for this purpose. If you have a larger alphabet, the benefits of this method start to drop off pretty quickly, though. And then you’d store these records in a normal Array or something.

The second problem is easy to solve, and I’d recommend you start with this approach. Since sequence types are limited in their indexing, you just combine sequence types. So, for example, you read your FASTA sequences 16 characters at a time (chunk = file.gets(16)) and store those strings in an array. Suddenly the larger structure (the array) can store 16 times as much data with the same number of indices. Your processing code becomes more complex, but that’s a necessity with the amount of data you’re handling. The larger your chunks are, the more data you can store overall.

I hope one or both of those approaches are useful to you. Something else to keep in mind, if you have a file where an individual line might be >= 2 GB, is that the second approach will only help you if you have enough RAM. You’re not going to fit the contents of a 20 GB file in 16 GB of RAM unless you compress it (which is what the first approach does).

* It might be UInt32.

yes, my above fasta example format is not valid, but sometimes I see that: like NCBI some database fasta file

>dfsdf(fdfs>fdser)
ATCG

So I just want to make sure the program work well even with this special format fasta file.

1 Like

The below code maybe work with my unvalid fasta format file:

def read_fasta
  fafile = "test.fa"
  lines = "
>q1(>dfdf)
ATCG
ATCG
>q2(>dfsdf)
ATCGA
ATCGA

>q3
ATCGAT
ATCGAT

"
  File.write(fafile, lines)
  faio = File.open(fafile, "r")
  fstheader = ""
  add_flag = ""
  while header = faio.gets
    header = add_flag + header if add_flag == ">"
    if header.starts_with?(">")
      faid = header
      if faseq = faio.gets("\n>", chomp: true).not_nil!.gsub("\n", "")
        yield faid, faseq
        add_flag = ">"
      end
    end
  end
end

read_fasta() do |faid, faseq|
  puts "faid=#{faid}\nfaseq=#{faseq}"
end