Parsing a CSV and using Structs

lln13 · June 27, 2022, 3:29pm

Hi,

Newbie here I have this bit of code for parsing a TSV file line-by-line, checking if a certain condition is met, and then counting the number of lines that meet the condition.

def parse_bim(bim : String, chrom : Int32)    
    bimFile = File.new(bim, "r")
    cont = 0
    
    bimFile.each_line() do |row|
        record = row.split()
        # ["22", "rs123456", "0", "16055490", "C", "T"]

        if record[0].to_i32 == chrom
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"
end

What I’d like to know is if it possible to use a Struct to pass each each line instead of using string.split() and string.to_i32? The structure of the file columns won’t change.

Thanks!

Blacksmoke16 · June 27, 2022, 3:38pm

Probably would be easier to use CSV - Crystal 1.4.1, with a separator of like \t. That should make the reading of each line much easier, tho you’ll still need to use to_i to convert the string into an Int32.

EDIT: You could use a struct that accepts a CSV::Row, that decomposes the row into unique ivars and provide getters for them yea. But there isn’t an automated way to do this. So if you only are using that in this one method in your example, probably not worth it.

asterite · June 27, 2022, 5:37pm

I think it would be nice to have something like CSV::Serializable for transforming a CSV row to some type instance. But maybe it’s not super necessary because a CSV is always a mapping of keys (columns) to values, so nothing as complex as JSON or YAML.

lln13 · June 27, 2022, 5:53pm

Yeah, that’s kind of what I was alluding to. I was looking for ways to avoid row.split() and record[0].to_i32 as I don’t know how costly they are. It doesn’t make sense to always convert a string to i32 if I know that records[0] will always be i32. Doing the conversion for 1 million+ rows doesn’t seem like an optimal thing to do.

Blacksmoke16 · June 27, 2022, 5:59pm

This would still have to be done even if there was a CSV::Serializable as the underlying data is a String anyway. E.g. https://github.com/crystal-lang/crystal/blob/master/src/json/token.cr#L23. It just would be less explicit.

If you use the built in CSV type, all that is handled for you.

lln13 · June 27, 2022, 7:02pm

Thanks for the responses!

Based on a rough estimates, using CSV seems to be almost twice as slow. Here is my solution:

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0

    CSV.each_row(bimFile, separator: '\t') do |record|
        if record[0].to_i32 == chrom
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end

PS: I’m new to the language, I may be doing things the wrong way.

Blacksmoke16 · June 27, 2022, 7:38pm

How did you do your benchmark? Building in release mode and using time resulted in both being 0.011s or so.

orangeSi · June 28, 2022, 1:26am

Just a little change to avoid much times .to_str by one time .to_str as below:

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0
    chrom_str = chrom.to_str


    CSV.each_row(bimFile, separator: '\t') do |record|
        if record[0] == chrom_str
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end

orangeSi · June 28, 2022, 1:35am

I had not test the speed of GitHub - naqvis/CrysDA: Crystal library for Data Analysis, Wrangling, Munging, but maybe you can try it.

orangeSi · June 29, 2022, 5:28am

or use starts_with?

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0
    chrom_str = chrom.to_str
    
    pat = /#{chrom.to_str}\s/

    bimFile.each_line() do |row|
        if row.starts_with?(pat)
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end

lln13 · June 29, 2022, 1:14pm

Thanks! This is more efficient. This way I’ll do the splitting and other operations only when the condition is met and not for every row. Thanks again!

orangeSi · June 30, 2022, 2:15am

Use string instead of regex will imporeve the the speed, benchmarks as below:

$crystal build --release x.cr

$./x
      1: use starts_with?(str)   1.52k (656.96?s) (± 2.64%)  1.37MB/op  299.23× slower
      3: use starts_with?(str) 455.10k (  2.20?s) (± 0.36%)    0.0B/op    1.00× slower
    2: use starts_with?(regex)   1.76k (568.54?s) (± 4.11%)   156kB/op  258.96× slower
  rep 1: use starts_with?(str)   1.55k (646.99?s) (± 2.24%)  1.37MB/op  294.69× slower
  rep 3: use starts_with?(str) 455.48k (  2.20?s) (± 0.56%)    0.0B/op         fastest
rep 2: use starts_with?(regex)   1.76k (567.63?s) (± 0.98%)   156kB/op  258.54× slower

$cat x.cr
require "benchmark"
#read = File.read("x")

line = "1\txx\txx\txx\txx"
chrom = 1
key_str="#{chrom}\t"
pat = /#{chrom.to_s}\t/
n = 10000

Benchmark.ips do |x|
  x.report("1: use starts_with?(str)") {
    n.times do
      line.starts_with?("#{chrom}\t")
    end
  }

  x.report("3: use starts_with?(str)") {
    n.times do
      line.starts_with?(key_str)
    end
  }

  x.report("2: use starts_with?(regex)") {
    n.times do
      line.starts_with?(pat)
    end
  }

  x.report("rep 1: use starts_with?(str)") {
    n.times do
      line.starts_with?("#{chrom}\t")
    end
  }

  x.report("rep 3: use starts_with?(str)") {
    n.times do
      line.starts_with?(key_str)
    end
  }

  x.report("rep 2: use starts_with?(regex)") {
    n.times do
      line.starts_with?(pat)
    end
  }

end

So below code will be better for the speed!

def parse_bim_csv(bim : String, chrom : Int32)
    bimFile = File.new(bim, "r")
    cont = 0
  
    key_str="#{chrom}\t"

    bimFile.each_line() do |row|
        if row.starts_with?(key_str)
            cont +=1
        end
    end

    bimFile.close()

    puts "# of variants on chrom #{chrom}: #{cont}"

end

Topic		Replies	Views
Statically Parse CSV to Hash Community	9	994	June 26, 2019
Process CSV from pipe Help & Support	2	305	September 5, 2021
How to iterate CSV objects multiple times Crystal Contrib	16	2374	March 19, 2020
Creeating struct from JSON Help & Support	6	354	August 9, 2021
String views or some pointer into strings Help & Support	3	250	December 3, 2023

Parsing a CSV and using Structs

Related topics