I am working with FASTA files, that are divided into records. Each record is composed of a header and a sequence, the latter being folded over several lines if needed. Here is an example of a small fasta file:
As I’m pretty new to Crystal, I am trying to develop relatively simple things and so I am trying to create a fasta file processor to output statistics. The problem I am facing is that records of the file can be several GB in size and I get an overflow error when trying to read the sequence. Here is how I read the file:
input_file = File.open(filename: file, mode: "r")
sequence_header = input_file.gets("\n", chomp: true)
sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "")
while sequence
# do something with it
sequence_header = input_file.gets("\n", chomp: true)
begin
# here is the line that generates the overflow error
sequence = input_file.gets(">", chomp: true).not_nil!.gsub("\n", "")
rescue
break
end
end
So I put the first header in sequence_header and the first sequence in sequence. While I can read from the file, I do something with the sequence. The file can be extremely large and I need to be able to read it fast but as I said, the sequences can also be very large.
Would you know a way for me to avoid this overflow error?
EDIT: I said that the sequences are folded over several lines, but it is not systematically the case so the line in itself can be very large.
IO.gets has several versions which take a maximum number of characters to read, which seems ideal for your situation: docs. You would need to process the line in chunks and differentiate between records by checking for \n, but those methods seem like they’re made for exactly your use case.
Thank you @RespiteSage, it would indeed be a solution for this particular problem. However, there are instances where I would need to store the full sequence (which may be larger that 2GB). Would you know any way to do this, other than subclassing String to increase its size to an i64?
@orangeSi The fasta in your example is not valid. The ‘>’ character has to be the first one on the line and can’t appear anywhere else. Moreover, reading a large fasta line by line would be very slow compared to reading in chunks (I think, I have not tested it).
Where and why do you need to store these sequences? Technically, you should always be able to slice it into reasonably sized pieces.
However, I’d also maybe consider using a different internal data representation. Strings are very inefficient for this kind of thing. It might make sense as a human readable storage format, but when you read that data into your program, you could parse it into more efficient format for further processing. That’s an additional computing overhead, though. So it depends on your use case whether that makes any sense or not.
It would help to know more about what you’re trying to do specifically. For example, do you need to store the whole sequence or just one record at a time? Are you filtering the records based on something?
However, I’ll do my best to help in the general case. You’re running into two issues:
Strings are an inefficient storage method for a limited alphabet.
Sequence types (String, Array, etc.) are generally indexed using Int32* in Crystal, which limits the available size to about 2**30 elements.
Your general problem might be solved by tackling either of these issues (or both, if you really need it or just want to be thorough).
The first problem could potentially be solved by creating a custom storage structure. This depends on what alphabet you’re using. If you only have FASTA sequences of nucleic acids with no ambiguity or gaps (i.e. an alphabet of just G, C, A, and T), you could potentially fit one letter in two bits (e.g. G = 00, C = 01, A = 10, T = 11). Therefore you could fit 4 letters in one byte, which is 4 times as efficient as ASCII. You could use Elorest’s bitfields for this purpose. If you have a larger alphabet, the benefits of this method start to drop off pretty quickly, though. And then you’d store these records in a normal Array or something.
The second problem is easy to solve, and I’d recommend you start with this approach. Since sequence types are limited in their indexing, you just combine sequence types. So, for example, you read your FASTA sequences 16 characters at a time (chunk = file.gets(16)) and store those strings in an array. Suddenly the larger structure (the array) can store 16 times as much data with the same number of indices. Your processing code becomes more complex, but that’s a necessity with the amount of data you’re handling. The larger your chunks are, the more data you can store overall.
I hope one or both of those approaches are useful to you. Something else to keep in mind, if you have a file where an individual line might be >= 2 GB, is that the second approach will only help you if you have enough RAM. You’re not going to fit the contents of a 20 GB file in 16 GB of RAM unless you compress it (which is what the first approach does).