GC Warning: Repeated allocation of very large block

I’m getting a

GC Warning: Repeated allocation of very large block (appr. size 536875008):
    May lead to memory leak and poor performance

error when reading in a large json file (190mb) from ARGF, passing it to a Process.run, and puts the output.

test.cr

input = IO::Memory.new
output = IO::Memory.new
error = IO::Memory.new

IO.copy(ARGF, input)

input.rewind

Process.run("jq", ["."] of String, input: input, output: output, error: error)

puts output.to_s
puts error.to_s

Steps to reproduce:

  • Install jq
  • Get the json file
    • wget https://raw.githubusercontent.com/zemirco/sf-city-lots-json/master/citylots.json
  • Build the code
    • crystal build --release test.cr
  • Run it
    • ./test citylots.json > test.json

I’m not sure if this is an issue with my code, Crystal, or that because I’m going through a Process. Is there anything I should do about it or is it just a harmless warning? Or is there a more efficient way to handle this?

Yeah, that GC Warning is annoying… maybe we can find a way to disable it.

That said:

  • the JSON file you mention seem to be 193Mb big, according to du -h
  • copying from ARGF to input involves writing to IO::Memory by chunks, and the internal buffer of IO::Memory starts with a small buffer and so it will have to be resized many times. If you know you are getting a file you could do IO::Memory.new(File.size(ARGV[0])) to avoid extra allocations
  • however, why are you reading from ARGF to input, rewinding and then passing that to the process? You could pass ARGF directly, since input accepts an IO.
  • the above points applies to output and error: why not pass STDOUT and STDERR directly? That way the output from jq is piped to the program’s STDOUT/STDERR (I think you can pass one of the values of the Process::Redirect enum.

If you can’t pipe directly, the advice of pre-allocating memory for all of the IO::Memory instances with estimates of what you’ll get should improve things a bit. But trying to pipe everything is better. When I did that the app consumed a total of 1.2MB, compared to ~500MB.

Oh sorry yea, other file I was working with was smaller and didn’t have the issue.

In this case that would work, but for the actual full program it’s possible the input data would be YAML/XML and I have to convert that to JSON first. Or that the output format is XML/YAML and have to convert the jq output. Because of that I need to use IO::Memory to store it until i determine what should happen with it. See oq.

But it’s a good call to use STDERR for error directly since I don’t have to do anything to it. I’ll have to look into setting the buffer size, thanks for the tips.

EDIT: I wonder if it would be best to conditionally execute the Process.run based on the input format. The if the input format is just JSON, i would be able to pass through and only have the more complex processing for other formats. As opposed to trying to have a common IO for each.

In this case that would work, but for the actual full program it’s possible the input data would be YAML/XML and I have to convert that to JSON first. Or that the output format is XML/YAML and have to convert the jq output.

I think in all of these cases you could use a pull parser to read from, and a builder to write to an IO.

And I think you can use an IO.pipe to do that. So something like this:

require "json"
require "yaml"

read, write = IO.pipe

spawn do
  # Here we read from ARGF and write to "write"
  pull = JSON::PullParser.new(ARGF)
  builder = YAML::Builder.new(write)
  builder.stream do
    builder.document do
      # read from pull parser, write to builder
      # -- left as exercise for the reader :-P
      # (this isn't trivial)
    end
  end
  write.close # this line is important!
end

# Just send to STDOUT
Process.run("cat", [] of String, input: read, output: STDOUT)

In the last line you can do the same trick if you need to stream the output:

read2, write2 = IO.pipe

chan = Channel(Nil).new

spawn do
  # process the output in a streaming way
  p read2.gets_to_end

  # say we are done!
  chan.send(nil)
end

Process.run("cat", [] of String, input: read, output: write2)

write2.close

# Need to wait the reading part, otherwise it exits too soon
chan.receive
1 Like

Great, thanks for the ideas. I’ll have to play around with it more.

Just to be clear, a user shouldn’t be worried about the GC warning if they get it? It’s just more of an informational thing from the GC?

Just to be clear, a user shouldn’t be worried about the GC warning if they get it? It’s just more of an informational thing from the GC?

I think so, I’m not sure. I think the GC notices you are allocating bigger and bigger chunks of memory and warns you about it because there might be a way to avoid those allocations. But if your app needs to work with stuff in memory then I think those warnings are inevitable.

That said… I didn’t get any warning when I tried your code sample. Maybe it works differently on a mac.

1 Like

Wow, asterite is on fire today

The question I have is:
How does a 190mb file transform into 536875008 bytes of data? 536.87MB?
Is this because of the GC and that data will be free’d later?

I think it’s because the buffer in IO::Memory starts small and duplicates over time, so the GC has to allocate at least 190mb + 190mb*2 for the last allocation (because it’s rounded to powers of 2). And I think in that point it can’t free the old bytes… I’m not sure.

1 Like

@asterite Shouldn’t you be able to do:

read, write = IO.pipe

Process.run("jq", ["."], input: ARGF, output: write)

pp read.gets_to_end

echo '{"name": "Jim"}' | ./test

It just seems to hang. But i’m probably just missing something…

Yes, that’s why I used spawn. I can’t remember exactly why this is needed, but I think it’s that pipe has a small buffer and it won’t write to it anymore if it’s full unless the reader starts consuming it. And that won’t happen in the same fiber. But I though Process did a spawn of it’s own… I’m sure someone knows the answer for this, but for me it’s still blurry.

1 Like

I see this warning a lot in the Savi compiler, which often uses a hefty chunk of memory to compile a program.

And now it’s getting in the way for the integration tests I’m doing where I test the error output of the compiler in certain test source trees - the sporadic presence or absence of these warnings is intermingled with the output I am testing.

I think I’m going to try to use the “warning proc” override mentioned by @asterite in this GitHub comment: GC Warning: Repeated allocation of very large block · Issue #2104 · crystal-lang/crystal · GitHub

I’ll set the warning proc to something with no effect (that doesn’t print), to hopefully disable the warning entirely.

I’m writing here in case this workaround idea is helpful to someone else.

Actually I just noticed that it looks like Crystal will silence this by default in future releases, if I understand this PR correctly: GC/Boehm: Silence GC warnings about big allocations. by yxhuvud · Pull Request #11289 · crystal-lang/crystal · GitHub

That’s great!

1 Like