Read ouput line by line from bash command(process bigfile) in Linux

hello, everyone, I want to read ouput of Linux bash command line, my code is like this:

io =  IO::Memory.new
Process.run("cat xx.txt|cut -f 1-6|sort -k 1,1", shell: true, output: io)
io.close
io_arr = io.to_s.split(/\n/)
io_arr.each do |e|
   puts e
end

when xx.txt is a small file(1M), the code works well. But when xx.txt is very big file(100G), I got this error:

Unhandled exception: Negative size (ArgumentError)

Thanks for any help~

Welcome to the Crystal community, @orangeSi!

The issue seems to be that IO::Memory stores its capacity as an Int32, so once you’ve read 2GB into the buffer, its size overflows to become a negative number. I don’t know if this is the exact problem you’re experiencing, but it seems pretty likely.

With very few exceptions, I doubt anyone actually needs to hold that much data in a single buffer, and it looks like you can avoid holding the entire output in memory by using IO.pipe instead of an IO::Memory and passing a block to Process.run:

reader, writer = IO.pipe
Process.run "cat xx.txt|cut -f 1-6|sort -k 1,1", shell: true, output: writer do |process|
  until process.terminated?
    line = reader.gets
    puts line
  end
end

Notice this code creates two IOs with IO.pipe — we tell the child process to write to one end and our code reads from the other end.

IO::Memory isn’t a good fit for streaming both input and output at the same time, especially across processes since it only has a single position marker for both reading and writing — that is, if you call io.puts "foo", calling io.gets won’t read "foo" since your position in the buffer is at the end of the buffer you just wrote. It also doesn’t release memory after it’s been read. That’s a feature (it’s what allows io.rewind), it just doesn’t scale well for huge inputs. :-D

3 Likes

For such simple commands I recommend you do do this in Crystal directly, with File.each_line

1 Like

thanks, you reslove my problem!

1 Like

Note that Process.run's output argument defaults to :pipe, so you don’t need to pass your own pipe, it creates one for you:

Process.run("ls -l /etc/", shell: true) do |proc|
  IO.copy(proc.output, STDOUT)
end
1 Like

thanks for note,but I wanted to read output one line by one line instead of the all lines(which proc.output do ).

1 Like

I tried the code, it do read line by line, but it will hold on at the terminal all the way,like this:

echo -e "1\t2\t3\n4\t5\t6" >tmp

test.cr like this:

ifile = ARGV[0]
reader, writer = IO.pipe
#Process.run("gzip -dc #{ifile}|cut -f 1-6|sort -k 1,1", shell: true, output: writer) do |process|
Process.run("cat #{ifile}|cut -f 1-6|sort -k 1,1", shell: true, output: writer) do |process|
	until process.terminated?
		line = reader.gets
		puts "line is #{line}"
		puts process.terminated?
	end
	puts "end the file"
end

then run

./test tmp

got this

line is 1	2	3
false
line is 4	5	6
false

it didn’t output “end the file” in the test.cr

I am using CentOS release 6.9 system.

output is the read end of the pipe, so an IO:

Process.run("ls -l /", shell: true) do |proc|
  while line = proc.output.gets
    puts line unless line.includes? "bin"
  end
end

https://carc.in/#/r/7hai

To fix your hang it’s probably enough to check for EOF (gets returning nil), like my example above.

3 Likes

Really thanks a lot~ output.gets do works well! My code now is like this:

ifile = ARGV[0]
Process.run("cat #{ifile}|cut -f 1-6|sort -k 1,1", shell: true) do |proc|
	while line = proc.output.gets
		puts line
	end
	puts "here you get end the file"
end

proc.output.gets will get the real terminal of file at last.

Still, any reason not to convert this little command to pure Crystal? If the actual code is really like your example, it will be trivial to do - and much more efficient (and robust).

1 Like

Depends on whether that’s needed. If this is invoked thousands of times per second by an automated process or by a network request, that would be a great reason to pull that functionality into Crystal. If it’s only ever run by hand, the difference in efficiency will never be noticed on anything resembling modern hardware. :slight_smile:

1 Like

You’re quite right, there is also code maintainability to take into consideration.

yes, my actual code is like that. If the perfromance matter, I will try to write pure Crystal code.
At least I learned the way about read the ouput line by line from system command call.

2 Likes