I’m not sure if this is an issue with my code, Crystal, or that because I’m going through a Process. Is there anything I should do about it or is it just a harmless warning? Or is there a more efficient way to handle this?
Yeah, that GC Warning is annoying… maybe we can find a way to disable it.
That said:
the JSON file you mention seem to be 193Mb big, according to du -h
copying from ARGF to input involves writing to IO::Memory by chunks, and the internal buffer of IO::Memory starts with a small buffer and so it will have to be resized many times. If you know you are getting a file you could do IO::Memory.new(File.size(ARGV[0])) to avoid extra allocations
however, why are you reading from ARGF to input, rewinding and then passing that to the process? You could pass ARGF directly, since input accepts an IO.
the above points applies to output and error: why not pass STDOUT and STDERR directly? That way the output from jq is piped to the program’s STDOUT/STDERR (I think you can pass one of the values of the Process::Redirect enum.
If you can’t pipe directly, the advice of pre-allocating memory for all of the IO::Memory instances with estimates of what you’ll get should improve things a bit. But trying to pipe everything is better. When I did that the app consumed a total of 1.2MB, compared to ~500MB.
Oh sorry yea, other file I was working with was smaller and didn’t have the issue.
In this case that would work, but for the actual full program it’s possible the input data would be YAML/XML and I have to convert that to JSON first. Or that the output format is XML/YAML and have to convert the jq output. Because of that I need to use IO::Memory to store it until i determine what should happen with it. See oq.
But it’s a good call to use STDERR for error directly since I don’t have to do anything to it. I’ll have to look into setting the buffer size, thanks for the tips.
EDIT: I wonder if it would be best to conditionally execute the Process.run based on the input format. The if the input format is just JSON, i would be able to pass through and only have the more complex processing for other formats. As opposed to trying to have a common IO for each.
In this case that would work, but for the actual full program it’s possible the input data would be YAML/XML and I have to convert that to JSON first. Or that the output format is XML/YAML and have to convert the jq output.
I think in all of these cases you could use a pull parser to read from, and a builder to write to an IO.
And I think you can use an IO.pipe to do that. So something like this:
require "json"
require "yaml"
read, write = IO.pipe
spawn do
# Here we read from ARGF and write to "write"
pull = JSON::PullParser.new(ARGF)
builder = YAML::Builder.new(write)
builder.stream do
builder.document do
# read from pull parser, write to builder
# -- left as exercise for the reader :-P
# (this isn't trivial)
end
end
write.close # this line is important!
end
# Just send to STDOUT
Process.run("cat", [] of String, input: read, output: STDOUT)
In the last line you can do the same trick if you need to stream the output:
read2, write2 = IO.pipe
chan = Channel(Nil).new
spawn do
# process the output in a streaming way
p read2.gets_to_end
# say we are done!
chan.send(nil)
end
Process.run("cat", [] of String, input: read, output: write2)
write2.close
# Need to wait the reading part, otherwise it exits too soon
chan.receive
Just to be clear, a user shouldn’t be worried about the GC warning if they get it? It’s just more of an informational thing from the GC?
I think so, I’m not sure. I think the GC notices you are allocating bigger and bigger chunks of memory and warns you about it because there might be a way to avoid those allocations. But if your app needs to work with stuff in memory then I think those warnings are inevitable.
That said… I didn’t get any warning when I tried your code sample. Maybe it works differently on a mac.
The question I have is:
How does a 190mb file transform into 536875008 bytes of data? 536.87MB?
Is this because of the GC and that data will be free’d later?
I think it’s because the buffer in IO::Memory starts small and duplicates over time, so the GC has to allocate at least 190mb + 190mb*2 for the last allocation (because it’s rounded to powers of 2). And I think in that point it can’t free the old bytes… I’m not sure.
Yes, that’s why I used spawn. I can’t remember exactly why this is needed, but I think it’s that pipe has a small buffer and it won’t write to it anymore if it’s full unless the reader starts consuming it. And that won’t happen in the same fiber. But I though Process did a spawn of it’s own… I’m sure someone knows the answer for this, but for me it’s still blurry.
I see this warning a lot in the Savi compiler, which often uses a hefty chunk of memory to compile a program.
And now it’s getting in the way for the integration tests I’m doing where I test the error output of the compiler in certain test source trees - the sporadic presence or absence of these warnings is intermingled with the output I am testing.