The Crystal Programming Language Forum

IO.copy buffer too small?

What’s the motivation behind the 4KB buffer in IO.copy? I guess it’s a trade-off between memory and number of syscalls/cpu usage, but might 4KB be a little too restrictive for modern systems? I don’t know if crystal is used much in embedded systems? A larger buffer (up to 128KB) performs up to 5 times better:

require "benchmark"

SIZES = { 4096, 8196, 16384, 131072, 262144 }
class IO
  {% for n in SIZES %}
  def self.copy_{{n}}(src, dst, limit : Int) : UInt64
    raise ArgumentError.new("Negative limit") if limit < 0

    limit = limit.to_u64

    buffer = uninitialized UInt8[{{n}}]
    remaining = limit
    while (len = src.read(buffer.to_slice[0, Math.min(buffer.size, Math.max(remaining, 0))])) > 0
      dst.write buffer.to_slice[0, len]
      remaining -= len
    end
    limit - remaining
  end
  {% end %}
end

File.open("/dev/zero") do |zero|
  File.open("/dev/null", "w") do |null|
    Benchmark.ips do |x|
      {% for n in SIZES %}
        x.report("copy 1GB with {{n}}B buffer") do
          IO.copy_{{n}}(zero, null, 1_048_576)
        end
      {% end %}
    end
  end
end
  copy 1GB with 4096B buffer   2.89k (346.34µs) (± 0.65%)  0.0B/op   4.92× slower
  copy 1GB with 8196B buffer   4.10k (243.67µs) (± 2.03%)  0.0B/op   3.46× slower
 copy 1GB with 16384B buffer   6.93k (144.22µs) (± 1.56%)  0.0B/op   2.05× slower
copy 1GB with 131072B buffer  13.52k ( 73.97µs) (± 1.50%)  0.0B/op   1.05× slower
copy 1GB with 262144B buffer  14.21k ( 70.37µs) (± 1.09%)  0.0B/op        fastest

We have seen similar “read-world” results in our server application by overriding IO.copy with a 128KB buffer.

No motivation at all. 4KB is a typical number for a stack buffer. Feel free to open a PR in GitHub.

That said, the larger a stack buffer, the longer it takes LLVM to compile the program. However, I tried it with 128KB (I guess that’s 131072?) and it made no difference, so maybe that number isn’t that big for LLVM or they fixed something…

1 Like

I think that every now and then the size of the buffer wants to be tweaked and is hard to find a one-size-fit-all.

You make it possible in #7930 to change the buffer size for a IO::Buffered instance.

The IO::Buffered default is 8192 instead of 4096 as in IO.copy.

I’ve found a similar issue in Rust https://github.com/rust-lang/rust/issues/49921

So, I think

  1. we can change the default to 8192 to align things.
  2. I would like to have a way for people to tweak the default buffer size affecting IO.copy and IO::Buffered. Maybe a macro that could offer an api IO.set_default_buffer_size 262144

Since it is in the prelude and it needs to be a constant to use it as an argument for StaticArray it’s a bit trickier. But might be doable. As long as the API to change the default buffer size is nice, I’m satisfied.

WDYT?

1 Like

Here are some ideas:

Allow defining specific copy definitions for different buffer sizes

I was thinking of something like this:

module IO
  macro def_copy(buffer_size)
    def self.copy_{{buffer_size))(src, dst)
      # ...
    end
  end
end

IO.def_copy 8192

IO.copy_8192(src, dst)

IO.set_default_buffer_size

The one Brian mentioned, which I think is nice. However, since this is a global thing maybe a shard wants to do it one way, maybe your code wants a different size, who wins? Load order, which is not intuitive.

Allow configuring the default IO.copy buffer size with environment var at compile time

That way there’s only one place you can tweak this setting. And you can configure it with an env var at compile time.

Wonder if this would influence the benchmarks comparing with sendfile… https://github.com/crystal-lang/crystal/pull/8926

The cons for using an env var is that I don’t want to abuse of that feature solve this kind of thing.

The other day we’ve been having some ideas with waj about how we could inject configuration values like this and others. But is a story for the future.

If the cons of the load order in the macro is too much, I am happy to settle on the env var solution.

The load order problem is only a problem if the buffer size actually matters (in any way other than execution speed)

Even if some shard sets a certain butter size for any reason, it will still work if the user overrides it with a value earlier in the load order.

If that is the case, I would go with a sensible default, overridable in code with a macro as suggested above, but with a check that ensures the first try to set a new value wins.

If a shard sets a value, it gets used, unless the user sets his own value before requiring the shard.
everyone is happy?

Maybe the buffer doesn’t have to be on the stack? Allocate on the heap, use it, deallocate it? In case stack size is the problem that is?

Could you add more powers of 2 to your list of test sizes, to help hunt for the sweet spot, as well? :)

Is it even theoretically possible to have a sweet spot? With all the different operating systems, cpu architectures, system generations, usage cases?