Parsing a CSV and using Structs

Hi,

Newbie here :wave: I have this bit of code for parsing a TSV file line-by-line, checking if a certain condition is met, and then counting the number of lines that meet the condition.

def parse_bim(bim : String, chrom : Int32)    
    bimFile = File.new(bim, "r")
    cont = 0
    
    bimFile.each_line() do |row|
        record = row.split()
        # ["22", "rs123456", "0", "16055490", "C", "T"]

        if record[0].to_i32 == chrom
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"
end

What I’d like to know is if it possible to use a Struct to pass each each line instead of using string.split() and string.to_i32? The structure of the file columns won’t change.

Thanks!

Probably would be easier to use CSV - Crystal 1.4.1, with a separator of like \t. That should make the reading of each line much easier, tho you’ll still need to use to_i to convert the string into an Int32.

EDIT: You could use a struct that accepts a CSV::Row, that decomposes the row into unique ivars and provide getters for them yea. But there isn’t an automated way to do this. So if you only are using that in this one method in your example, probably not worth it.

1 Like

I think it would be nice to have something like CSV::Serializable for transforming a CSV row to some type instance. But maybe it’s not super necessary because a CSV is always a mapping of keys (columns) to values, so nothing as complex as JSON or YAML.

4 Likes

Yeah, that’s kind of what I was alluding to. I was looking for ways to avoid row.split() and record[0].to_i32 as I don’t know how costly they are. It doesn’t make sense to always convert a string to i32 if I know that records[0] will always be i32. Doing the conversion for 1 million+ rows doesn’t seem like an optimal thing to do.

This would still have to be done even if there was a CSV::Serializable as the underlying data is a String anyway. E.g. crystal/token.cr at master · crystal-lang/crystal · GitHub. It just would be less explicit.

If you use the built in CSV type, all that is handled for you.

1 Like

Thanks for the responses!

Based on a rough estimates, using CSV seems to be almost twice as slow. Here is my solution:

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0

    CSV.each_row(bimFile, separator: '\t') do |record|
        if record[0].to_i32 == chrom
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end

PS: I’m new to the language, I may be doing things the wrong way.

How did you do your benchmark? Building in release mode and using time resulted in both being 0.011s or so.

Just a little change to avoid much times .to_str by one time .to_str as below:

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0
    chrom_str = chrom.to_str


    CSV.each_row(bimFile, separator: '\t') do |record|
        if record[0] == chrom_str
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end
1 Like

I had not test the speed of GitHub - naqvis/CrysDA: Crystal library for Data Analysis, Wrangling, Munging, but maybe you can try it.

1 Like

or use starts_with?

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0
    chrom_str = chrom.to_str
    
    pat = /#{chrom.to_str}\s/

    bimFile.each_line() do |row|
        if row.starts_with?(pat)
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end
1 Like

Thanks! This is more efficient. This way I’ll do the splitting and other operations only when the condition is met and not for every row. Thanks again!

1 Like

Use string instead of regex will imporeve the the speed, benchmarks as below:

$crystal build --release x.cr

$./x
      1: use starts_with?(str)   1.52k (656.96?s) (± 2.64%)  1.37MB/op  299.23× slower
      3: use starts_with?(str) 455.10k (  2.20?s) (± 0.36%)    0.0B/op    1.00× slower
    2: use starts_with?(regex)   1.76k (568.54?s) (± 4.11%)   156kB/op  258.96× slower
  rep 1: use starts_with?(str)   1.55k (646.99?s) (± 2.24%)  1.37MB/op  294.69× slower
  rep 3: use starts_with?(str) 455.48k (  2.20?s) (± 0.56%)    0.0B/op         fastest
rep 2: use starts_with?(regex)   1.76k (567.63?s) (± 0.98%)   156kB/op  258.54× slower

$cat x.cr
require "benchmark"
#read = File.read("x")

line = "1\txx\txx\txx\txx"
chrom = 1
key_str="#{chrom}\t"
pat = /#{chrom.to_s}\t/
n = 10000

Benchmark.ips do |x|
  x.report("1: use starts_with?(str)") {
    n.times do
      line.starts_with?("#{chrom}\t")
    end
  }

  x.report("3: use starts_with?(str)") {
    n.times do
      line.starts_with?(key_str)
    end
  }

  x.report("2: use starts_with?(regex)") {
    n.times do
      line.starts_with?(pat)
    end
  }

  x.report("rep 1: use starts_with?(str)") {
    n.times do
      line.starts_with?("#{chrom}\t")
    end
  }

  x.report("rep 3: use starts_with?(str)") {
    n.times do
      line.starts_with?(key_str)
    end
  }

  x.report("rep 2: use starts_with?(regex)") {
    n.times do
      line.starts_with?(pat)
    end
  }

end

So below code will be better for the speed!

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0
  
    key_str="#{chrom}\t"

    bimFile.each_line() do |row|
        if row.starts_with?(key_str)
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end