I am working with FASTA files, that are divided into records. Each record is composed of a header and a sequence, the latter being folded over several lines if needed. Here is an example of a small fasta file:
>chr1 AACAATTGCAATGATCTATCCCCATCATGATGAAATTTCCCAAGATTACCCGGGCCTGTC GGCCAAGGCTATATACTCGTTGAATACATCAGTGTAGCGCGCGTGCGGCCCAGAACATCT AAGGGCATCACAGACCTGTTATTGCCTCAAACTTCCGTCGCCTAAACGGCGATAGTCCCT >chr2 CTAAGAAGCTAGCTGCGGAGGGATGGCTCCGCATAGCTAGTTAGCAGGCTGAGGTCTCGT TCGTTAACGGAATTAACCAGACAAATCGCTCCACCAACTAAGAACGGCCATGCACCACCA CCCATAGAATCAAGAAAGAGCTCTCAGTCTGTCAATCCTTGCTATGTCTGGACCTGGTAA GTTTCCCCGTGTTGAGTCAAATTAAGCCGCAGGCTCCACGCCTG
As I’m pretty new to Crystal, I am trying to develop relatively simple things and so I am trying to create a fasta file processor to output statistics. The problem I am facing is that records of the file can be several GB in size and I get an overflow error when trying to read the sequence. Here is how I read the file:
input_file = File.open(filename: file, mode: "r") sequence_header = input_file.gets("\n", chomp: true) sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "") while sequence # do something with it sequence_header = input_file.gets("\n", chomp: true) begin # here is the line that generates the overflow error sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "") rescue break end end
So I put the first header in
sequence_header and the first sequence in
sequence. While I can read from the file, I do something with the sequence. The file can be extremely large and I need to be able to read it fast but as I said, the sequences can also be very large.
Would you know a way for me to avoid this overflow error?
EDIT: I said that the sequences are folded over several lines, but it is not systematically the case so the line in itself can be very large.