The Crystal Programming Language Forum

GC Warning: Repeated allocation of very large block

I’m getting a

GC Warning: Repeated allocation of very large block (appr. size 536875008):
    May lead to memory leak and poor performance

error when reading in a large json file (190mb) from ARGF, passing it to a Process.run, and puts the output.

test.cr

input = IO::Memory.new
output = IO::Memory.new
error = IO::Memory.new

IO.copy(ARGF, input)

input.rewind

Process.run("jq", ["."] of String, input: input, output: output, error: error)

puts output.to_s
puts error.to_s

Steps to reproduce:

  • Install jq
  • Get the json file
    • wget https://raw.githubusercontent.com/zemirco/sf-city-lots-json/master/citylots.json
  • Build the code
    • crystal build --release test.cr
  • Run it
    • ./test citylots.json > test.json

I’m not sure if this is an issue with my code, Crystal, or that because I’m going through a Process. Is there anything I should do about it or is it just a harmless warning? Or is there a more efficient way to handle this?

Yeah, that GC Warning is annoying… maybe we can find a way to disable it.

That said:

  • the JSON file you mention seem to be 193Mb big, according to du -h
  • copying from ARGF to input involves writing to IO::Memory by chunks, and the internal buffer of IO::Memory starts with a small buffer and so it will have to be resized many times. If you know you are getting a file you could do IO::Memory.new(File.size(ARGV[0])) to avoid extra allocations
  • however, why are you reading from ARGF to input, rewinding and then passing that to the process? You could pass ARGF directly, since input accepts an IO.
  • the above points applies to output and error: why not pass STDOUT and STDERR directly? That way the output from jq is piped to the program’s STDOUT/STDERR (I think you can pass one of the values of the Process::Redirect enum.

If you can’t pipe directly, the advice of pre-allocating memory for all of the IO::Memory instances with estimates of what you’ll get should improve things a bit. But trying to pipe everything is better. When I did that the app consumed a total of 1.2MB, compared to ~500MB.

Oh sorry yea, other file I was working with was smaller and didn’t have the issue.

In this case that would work, but for the actual full program it’s possible the input data would be YAML/XML and I have to convert that to JSON first. Or that the output format is XML/YAML and have to convert the jq output. Because of that I need to use IO::Memory to store it until i determine what should happen with it. See oq.

But it’s a good call to use STDERR for error directly since I don’t have to do anything to it. I’ll have to look into setting the buffer size, thanks for the tips.

EDIT: I wonder if it would be best to conditionally execute the Process.run based on the input format. The if the input format is just JSON, i would be able to pass through and only have the more complex processing for other formats. As opposed to trying to have a common IO for each.

In this case that would work, but for the actual full program it’s possible the input data would be YAML/XML and I have to convert that to JSON first. Or that the output format is XML/YAML and have to convert the jq output.

I think in all of these cases you could use a pull parser to read from, and a builder to write to an IO.

And I think you can use an IO.pipe to do that. So something like this:

require "json"
require "yaml"

read, write = IO.pipe

spawn do
  # Here we read from ARGF and write to "write"
  pull = JSON::PullParser.new(ARGF)
  builder = YAML::Builder.new(write)
  builder.stream do
    builder.document do
      # read from pull parser, write to builder
      # -- left as exercise for the reader :-P
      # (this isn't trivial)
    end
  end
  write.close # this line is important!
end

# Just send to STDOUT
Process.run("cat", [] of String, input: read, output: STDOUT)

In the last line you can do the same trick if you need to stream the output:

read2, write2 = IO.pipe

chan = Channel(Nil).new

spawn do
  # process the output in a streaming way
  p read2.gets_to_end

  # say we are done!
  chan.send(nil)
end

Process.run("cat", [] of String, input: read, output: write2)

write2.close

# Need to wait the reading part, otherwise it exits too soon
chan.receive
1 Like

Great, thanks for the ideas. I’ll have to play around with it more.

Just to be clear, a user shouldn’t be worried about the GC warning if they get it? It’s just more of an informational thing from the GC?

Just to be clear, a user shouldn’t be worried about the GC warning if they get it? It’s just more of an informational thing from the GC?

I think so, I’m not sure. I think the GC notices you are allocating bigger and bigger chunks of memory and warns you about it because there might be a way to avoid those allocations. But if your app needs to work with stuff in memory then I think those warnings are inevitable.

That said… I didn’t get any warning when I tried your code sample. Maybe it works differently on a mac.

1 Like

Wow, asterite is on fire today

The question I have is:
How does a 190mb file transform into 536875008 bytes of data? 536.87MB?
Is this because of the GC and that data will be free’d later?

I think it’s because the buffer in IO::Memory starts small and duplicates over time, so the GC has to allocate at least 190mb + 190mb*2 for the last allocation (because it’s rounded to powers of 2). And I think in that point it can’t free the old bytes… I’m not sure.

1 Like

@asterite Shouldn’t you be able to do:

read, write = IO.pipe

Process.run("jq", ["."], input: ARGF, output: write)

pp read.gets_to_end

echo '{"name": "Jim"}' | ./test

It just seems to hang. But i’m probably just missing something…

Yes, that’s why I used spawn. I can’t remember exactly why this is needed, but I think it’s that pipe has a small buffer and it won’t write to it anymore if it’s full unless the reader starts consuming it. And that won’t happen in the same fiber. But I though Process did a spawn of it’s own… I’m sure someone knows the answer for this, but for me it’s still blurry.

1 Like