Newbie here I have this bit of code for parsing a TSV file line-by-line, checking if a certain condition is met, and then counting the number of lines that meet the condition.
def parse_bim(bim : String, chrom : Int32)
bimFile = File.new(bim, "r")
cont = 0
bimFile.each_line() do |row|
record = row.split()
# ["22", "rs123456", "0", "16055490", "C", "T"]
if record[0].to_i32 == chrom
cont +=1
end
end
bimFile.close()
puts "# of variants on chrom #{chrom}: #{cont}"
end
What I’d like to know is if it possible to use a Struct to pass each each line instead of using string.split() and string.to_i32? The structure of the file columns won’t change.
Probably would be easier to use CSV - Crystal 1.4.1, with a separator of like \t. That should make the reading of each line much easier, tho you’ll still need to use to_i to convert the string into an Int32.
EDIT: You could use a struct that accepts a CSV::Row, that decomposes the row into unique ivars and provide getters for them yea. But there isn’t an automated way to do this. So if you only are using that in this one method in your example, probably not worth it.
I think it would be nice to have something like CSV::Serializable for transforming a CSV row to some type instance. But maybe it’s not super necessary because a CSV is always a mapping of keys (columns) to values, so nothing as complex as JSON or YAML.
Yeah, that’s kind of what I was alluding to. I was looking for ways to avoid row.split() and record[0].to_i32 as I don’t know how costly they are. It doesn’t make sense to always convert a string to i32 if I know that records[0] will always be i32. Doing the conversion for 1 million+ rows doesn’t seem like an optimal thing to do.
def parse_bim_csv(bim : String, chrom : Int32)
bimFile = File.new(bim, "r")
cont = 0
chrom_str = chrom.to_str
pat = /#{chrom.to_str}\s/
bimFile.each_line() do |row|
if row.starts_with?(pat)
cont +=1
end
end
bimFile.close()
puts "# of variants on chrom #{chrom}: #{cont}"
end
Thanks! This is more efficient. This way I’ll do the splitting and other operations only when the condition is met and not for every row. Thanks again!
Use string instead of regex will imporeve the the speed, benchmarks as below:
$crystal build --release x.cr
$./x
1: use starts_with?(str) 1.52k (656.96?s) (± 2.64%) 1.37MB/op 299.23× slower
3: use starts_with?(str) 455.10k ( 2.20?s) (± 0.36%) 0.0B/op 1.00× slower
2: use starts_with?(regex) 1.76k (568.54?s) (± 4.11%) 156kB/op 258.96× slower
rep 1: use starts_with?(str) 1.55k (646.99?s) (± 2.24%) 1.37MB/op 294.69× slower
rep 3: use starts_with?(str) 455.48k ( 2.20?s) (± 0.56%) 0.0B/op fastest
rep 2: use starts_with?(regex) 1.76k (567.63?s) (± 0.98%) 156kB/op 258.54× slower
$cat x.cr
require "benchmark"
#read = File.read("x")
line = "1\txx\txx\txx\txx"
chrom = 1
key_str="#{chrom}\t"
pat = /#{chrom.to_s}\t/
n = 10000
Benchmark.ips do |x|
x.report("1: use starts_with?(str)") {
n.times do
line.starts_with?("#{chrom}\t")
end
}
x.report("3: use starts_with?(str)") {
n.times do
line.starts_with?(key_str)
end
}
x.report("2: use starts_with?(regex)") {
n.times do
line.starts_with?(pat)
end
}
x.report("rep 1: use starts_with?(str)") {
n.times do
line.starts_with?("#{chrom}\t")
end
}
x.report("rep 3: use starts_with?(str)") {
n.times do
line.starts_with?(key_str)
end
}
x.report("rep 2: use starts_with?(regex)") {
n.times do
line.starts_with?(pat)
end
}
end
So below code will be better for the speed!
def parse_bim_csv(bim : String, chrom : Int32)
bimFile = File.new(bim, "r")
cont = 0
key_str="#{chrom}\t"
bimFile.each_line() do |row|
if row.starts_with?(key_str)
cont +=1
end
end
bimFile.close()
puts "# of variants on chrom #{chrom}: #{cont}"
end