How to iterate CSV objects multiple times

After being able to open a CSV file and iterate through it, it’s not possible to go through it again once its @traversed property is set to true, given that #next returns false until the file is loaded again.

I reviewed Crystal’s csv’s source code and noticed that recently the #rewind method was removed from many iterators, but It might have broken the CSV object manipulation. Now I don’t see any functionality to traverse CSV object many times. I have to load the file again.

Am I missing something?
Any thoughts?

Here’s an example of the problem:

require "csv"

File.open(filename) do |infile|
  csv_rows = CSV.new(infile, header = true)

  csv_rows.each do |row|
    print row # prints every row object
  end

  csv_rows.each do |row|
    print row # never reaches here
  end
end

Hi! I don’t think CSV ever had a rewind function. We could consider adding it. But each in CSV, iterators, etc., unlike Ruby, doesn’t automatically rewind. Your best bet is to open the file again and read the CSV.

1 Like

That is, my advice would be to do something like this:

require "csv"

def each_csv_row(filename)
  File.open(filename) do |infile|
    csv_rows = CSV.new(infile, headers: true)
    csv_rows.each do |row|
      yield row
    end
  end
end

each_csv_row(filename) do |row|
  # ...
end

Thank you very much for your quick response. I think It would be a nice-to-have feature. Probably to CSV only ( not on all iterables ) since Its common to iterate CSV objects many times for differente calculations, and to load it again seems not to be most intuitive way to do it. Your solution works perfectly fine though. Thanks a lot man.

I think adding rewind to CSV is doable and easy

1 Like

Feel free to create a GitHub issue with a feature request. Maybe someone will implement it.

1 Like

Essentially, that’s going to happen any way, whether theres a CSV#rewind method or not. To avoid that you’d need to read the CSV data into a buffer and use that for consecutive iterations.

A relatively easy implementation would be to rewind the file IO using infile.pos = 0 thus reusing the existing file descriptor. You’d still need a new CSV instance, but that should be fine.
IMO that’s a pretty neat solution and I’m not sure there should be a #rewind method for that.

I sent a PR for this: https://github.com/crystal-lang/crystal/pull/8912

This will be available in the next release.

Just note that you need to explicitly call rewind between each each.

Should each call rewind after it goes through the whole dataset?

There was a recent discussion about that. We ended up agreeing on not doing that.

If we really want to do that, we need to dig back rewind from iterators and do that in every iterator, not just in CSV.

I personally never needed to iterate a same thing twice. The reason is, I try to optimize for performance and iterating twice is slower than doing it once. I think there are no or very few scenarios where you would need to iterate something twice.

Maybe OP can explain why the CSV has to be iterated twice.

2 Likes

Here’s the GH issue: https://github.com/crystal-lang/crystal/issues/8504

Also, I regret a bit removing rewind from Iterator. It was a nifty feature, even if it didn’t work in all cases, for the cases it worked, it worked great, and we could have implemented each and other methods maybe more intuitively.

1 Like

I might play around with the idea of bringing back rewind, then making each, map, etc., always rewind… if even possible.

EDIT: stupid “no 3 consecutive replies” rule.


This is what I came up with to be able to implement this:

https://play.crystal-lang.org/#/r/8qg0

For each iterator we actually need two iterators, as explained in the comments there. And each time we return an iterator we need to wrap it in this Ruby iterator (for a lack of a better name).

We can definitely do it, but it’s quite a work, and we need to make sure someone implementing Iterator remembers to use this…

1 Like

I would have expected something.each {|x| ...} etc. to do the same thing everytime it is called.
So I think i’m in favor of having an automatic rewind.

Its weird that I never noticed this non-rewinding behaviour before. I thought that is something one will notice quite easily.

I was implementing some scalers based on CSV columns. I had to calclulate Std deviation.
I’m implementing a Scaler for CSV columns.
I had to do some iterations to start calculating sum, avg, and then Std deviation, which implied iterating once again (at least in my not-so-clever solution).
I thought I could keep a an inner reference of the CSV in my Scaler class for this purpose, so I woudn’t have to load a huge CSV again.

I though that normal behavior was to just call again the reference and iterate through it as many times as needed, but for my surprise (and not too much experience with iterators) it was not. The CSV was traversed and had to be loaded again.
I don’t know, on the low-level, which task is cheaper: To keep the reference of a 500k rows * ~50 cols alive on memory or to load it when needed.
I hope I explained myself.
Thanks for your time guys.
And you are doing an amazing job with Crystal. :+1:

1 Like

Thank you for clarifying!

Yeah, for standard deviation I guess it makes sense to traverse the data twice. Just note that when doing that, either with rewind in the next release or doing each twice in Ruby, I’m almost sure the data is parsed twice which can be costly.

That said, apparently there’s a way to compute it with just one traversal: https://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods

EDIT: if not doing the “rapid calculation method” I think parsing the CSV twice should be okay, because parsing is fast and memory for only one row at a time is needed.

Thanks a lot for your time and support. I think I can continue with this as it is. As you mention, even without the rapid way, loading it twice should just be fine.
Your code snippet worked great and I’ll try to implement that rapid calculation way.
This post was mainly to clarify de facto functionality I thought was incomplete or could have a hidden bug.
Again, thanks a lot. :+1: