How to iterate CSV objects multiple times

FriscoGPS · March 17, 2020, 4:59am

After being able to open a CSV file and iterate through it, it’s not possible to go through it again once its @traversed property is set to true, given that #next returns false until the file is loaded again.

I reviewed Crystal’s csv’s source code and noticed that recently the #rewind method was removed from many iterators, but It might have broken the CSV object manipulation. Now I don’t see any functionality to traverse CSV object many times. I have to load the file again.

Am I missing something?
Any thoughts?

Here’s an example of the problem:

require "csv"

File.open(filename) do |infile|
  csv_rows = CSV.new(infile, header = true)

  csv_rows.each do |row|
    print row # prints every row object
  end

  csv_rows.each do |row|
    print row # never reaches here
  end
end

asterite · March 17, 2020, 11:19am

Hi! I don’t think CSV ever had a rewind function. We could consider adding it. But each in CSV, iterators, etc., unlike Ruby, doesn’t automatically rewind. Your best bet is to open the file again and read the CSV.

asterite · March 17, 2020, 1:14pm

That is, my advice would be to do something like this:

require "csv"

def each_csv_row(filename)
  File.open(filename) do |infile|
    csv_rows = CSV.new(infile, headers: true)
    csv_rows.each do |row|
      yield row
    end
  end
end

each_csv_row(filename) do |row|
  # ...
end

FriscoGPS · March 17, 2020, 3:03pm

Thank you very much for your quick response. I think It would be a nice-to-have feature. Probably to CSV only ( not on all iterables ) since Its common to iterate CSV objects many times for differente calculations, and to load it again seems not to be most intuitive way to do it. Your solution works perfectly fine though. Thanks a lot man.

asterite · March 17, 2020, 3:24pm

I think adding rewind to CSV is doable and easy

asterite · March 17, 2020, 3:34pm

Feel free to create a GitHub issue with a feature request. Maybe someone will implement it.

straight-shoota · March 17, 2020, 11:38pm

Essentially, that’s going to happen any way, whether theres a CSV#rewind method or not. To avoid that you’d need to read the CSV data into a buffer and use that for consecutive iterations.

A relatively easy implementation would be to rewind the file IO using infile.pos = 0 thus reusing the existing file descriptor. You’d still need a new CSV instance, but that should be fine.
IMO that’s a pretty neat solution and I’m not sure there should be a #rewind method for that.

asterite · March 18, 2020, 2:11pm

I sent a PR for this: https://github.com/crystal-lang/crystal/pull/8912

asterite · March 18, 2020, 7:06pm

This will be available in the next release.

Just note that you need to explicitly call rewind between each each.

wontruefree · March 18, 2020, 8:20pm

Should each call rewind after it goes through the whole dataset?

asterite · March 18, 2020, 8:37pm

There was a recent discussion about that. We ended up agreeing on not doing that.

If we really want to do that, we need to dig back rewind from iterators and do that in every iterator, not just in CSV.

I personally never needed to iterate a same thing twice. The reason is, I try to optimize for performance and iterating twice is slower than doing it once. I think there are no or very few scenarios where you would need to iterate something twice.

Maybe OP can explain why the CSV has to be iterated twice.

asterite · March 18, 2020, 8:41pm

Here’s the GH issue: https://github.com/crystal-lang/crystal/issues/8504

Also, I regret a bit removing rewind from Iterator. It was a nifty feature, even if it didn’t work in all cases, for the cases it worked, it worked great, and we could have implemented each and other methods maybe more intuitively.

asterite · March 18, 2020, 8:45pm

I might play around with the idea of bringing back rewind, then making each, map, etc., always rewind… if even possible.

EDIT: stupid “no 3 consecutive replies” rule.

This is what I came up with to be able to implement this:

https://play.crystal-lang.org/#/r/8qg0

For each iterator we actually need two iterators, as explained in the comments there. And each time we return an iterator we need to wrap it in this Ruby iterator (for a lack of a better name).

We can definitely do it, but it’s quite a work, and we need to make sure someone implementing Iterator remembers to use this…

mavu · March 18, 2020, 10:49pm

I would have expected something.each {|x| ...} etc. to do the same thing everytime it is called.
So I think i’m in favor of having an automatic rewind.

Its weird that I never noticed this non-rewinding behaviour before. I thought that is something one will notice quite easily.

FriscoGPS · March 19, 2020, 6:05am

I was implementing some scalers based on CSV columns. I had to calclulate Std deviation.
I’m implementing a Scaler for CSV columns.
I had to do some iterations to start calculating sum, avg, and then Std deviation, which implied iterating once again (at least in my not-so-clever solution).
I thought I could keep a an inner reference of the CSV in my Scaler class for this purpose, so I woudn’t have to load a huge CSV again.

I though that normal behavior was to just call again the reference and iterate through it as many times as needed, but for my surprise (and not too much experience with iterators) it was not. The CSV was traversed and had to be loaded again.
I don’t know, on the low-level, which task is cheaper: To keep the reference of a 500k rows * ~50 cols alive on memory or to load it when needed.
I hope I explained myself.
Thanks for your time guys.
And you are doing an amazing job with Crystal.

asterite · March 19, 2020, 12:51pm

Thank you for clarifying!

Yeah, for standard deviation I guess it makes sense to traverse the data twice. Just note that when doing that, either with rewind in the next release or doing each twice in Ruby, I’m almost sure the data is parsed twice which can be costly.

That said, apparently there’s a way to compute it with just one traversal: https://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods

EDIT: if not doing the “rapid calculation method” I think parsing the CSV twice should be okay, because parsing is fast and memory for only one row at a time is needed.

FriscoGPS · March 19, 2020, 2:34pm

Thanks a lot for your time and support. I think I can continue with this as it is. As you mention, even without the rapid way, loading it twice should just be fine.
Your code snippet worked great and I’ll try to implement that rapid calculation way.
This post was mainly to clarify de facto functionality I thought was incomplete or could have a hidden bug.
Again, thanks a lot.

Topic		Replies	Views
Brainstorm: how to have more Ruby-like iterators?	8	501	March 20, 2020
Can’t iterate over a Dir.new more than one time	1	367	August 22, 2020
Translation of Ruby Code Help & Support	3	461	October 31, 2021
Redo a loop iteration Help & Support	6	482	June 27, 2019
Disambiguate Iterator methods from block-yielding methods Crystal Contrib	5	528	November 24, 2019

How to iterate CSV objects multiple times

Related topics