IO.copy buffer too small?

What’s the motivation behind the 4KB buffer in IO.copy? I guess it’s a trade-off between memory and number of syscalls/cpu usage, but might 4KB be a little too restrictive for modern systems? I don’t know if crystal is used much in embedded systems? A larger buffer (up to 128KB) performs up to 5 times better:

require "benchmark"

SIZES = { 4096, 8196, 16384, 131072, 262144 }
class IO
  {% for n in SIZES %}
  def self.copy_{{n}}(src, dst, limit : Int) : UInt64
    raise ArgumentError.new("Negative limit") if limit < 0

    limit = limit.to_u64

    buffer = uninitialized UInt8[{{n}}]
    remaining = limit
    while (len = src.read(buffer.to_slice[0, Math.min(buffer.size, Math.max(remaining, 0))])) > 0
      dst.write buffer.to_slice[0, len]
      remaining -= len
    end
    limit - remaining
  end
  {% end %}
end

File.open("/dev/zero") do |zero|
  File.open("/dev/null", "w") do |null|
    Benchmark.ips do |x|
      {% for n in SIZES %}
        x.report("copy 1GB with {{n}}B buffer") do
          IO.copy_{{n}}(zero, null, 1_048_576)
        end
      {% end %}
    end
  end
end
  copy 1GB with 4096B buffer   2.89k (346.34µs) (± 0.65%)  0.0B/op   4.92× slower
  copy 1GB with 8196B buffer   4.10k (243.67µs) (± 2.03%)  0.0B/op   3.46× slower
 copy 1GB with 16384B buffer   6.93k (144.22µs) (± 1.56%)  0.0B/op   2.05× slower
copy 1GB with 131072B buffer  13.52k ( 73.97µs) (± 1.50%)  0.0B/op   1.05× slower
copy 1GB with 262144B buffer  14.21k ( 70.37µs) (± 1.09%)  0.0B/op        fastest

We have seen similar “read-world” results in our server application by overriding IO.copy with a 128KB buffer.

No motivation at all. 4KB is a typical number for a stack buffer. Feel free to open a PR in GitHub.

That said, the larger a stack buffer, the longer it takes LLVM to compile the program. However, I tried it with 128KB (I guess that’s 131072?) and it made no difference, so maybe that number isn’t that big for LLVM or they fixed something…

1 Like

I think that every now and then the size of the buffer wants to be tweaked and is hard to find a one-size-fit-all.

You make it possible in #7930 to change the buffer size for a IO::Buffered instance.

The IO::Buffered default is 8192 instead of 4096 as in IO.copy.

I’ve found a similar issue in Rust https://github.com/rust-lang/rust/issues/49921

So, I think

  1. we can change the default to 8192 to align things.
  2. I would like to have a way for people to tweak the default buffer size affecting IO.copy and IO::Buffered. Maybe a macro that could offer an api IO.set_default_buffer_size 262144

Since it is in the prelude and it needs to be a constant to use it as an argument for StaticArray it’s a bit trickier. But might be doable. As long as the API to change the default buffer size is nice, I’m satisfied.

WDYT?

2 Likes

Here are some ideas:

Allow defining specific copy definitions for different buffer sizes

I was thinking of something like this:

module IO
  macro def_copy(buffer_size)
    def self.copy_{{buffer_size))(src, dst)
      # ...
    end
  end
end

IO.def_copy 8192

IO.copy_8192(src, dst)

IO.set_default_buffer_size

The one Brian mentioned, which I think is nice. However, since this is a global thing maybe a shard wants to do it one way, maybe your code wants a different size, who wins? Load order, which is not intuitive.

Allow configuring the default IO.copy buffer size with environment var at compile time

That way there’s only one place you can tweak this setting. And you can configure it with an env var at compile time.

1 Like

Wonder if this would influence the benchmarks comparing with sendfile… https://github.com/crystal-lang/crystal/pull/8926

The cons for using an env var is that I don’t want to abuse of that feature solve this kind of thing.

The other day we’ve been having some ideas with waj about how we could inject configuration values like this and others. But is a story for the future.

If the cons of the load order in the macro is too much, I am happy to settle on the env var solution.

The load order problem is only a problem if the buffer size actually matters (in any way other than execution speed)

Even if some shard sets a certain butter size for any reason, it will still work if the user overrides it with a value earlier in the load order.

If that is the case, I would go with a sensible default, overridable in code with a macro as suggested above, but with a check that ensures the first try to set a new value wins.

If a shard sets a value, it gets used, unless the user sets his own value before requiring the shard.
everyone is happy?

Maybe the buffer doesn’t have to be on the stack? Allocate on the heap, use it, deallocate it? In case stack size is the problem that is?

Could you add more powers of 2 to your list of test sizes, to help hunt for the sweet spot, as well? :)

Is it even theoretically possible to have a sweet spot? With all the different operating systems, cpu architectures, system generations, usage cases?

Thanks @bcardiff for the link.
Yeha this is much needed. Right now with such a small buffer size standard HTTP operations when reading files and pages abused the CPU because of the constant loops around new buffer allocation and and again and again.

It would be wonderful to just have a way to set the wanted buffer, and yeha a default which is higher then 4096 makes a LOT of sense in today’s world

@bcardiff I would also say there are other places where this is an issue: crystal/encoding.cr at 2d2b7799cdd4d3e53dee55f83229553f21f443d9 · crystal-lang/crystal · GitHub

We should set some overall way to configure buffers so it will effect all IOs.

One way to tweak the IO buffer size could be to do this:

class IO
  macro finished
    {% unless @type.constant("BUFFER_SIZE") %}
      BUFFER_SIZE = 8192
    {% end %}
  end

  def self.copy(src, dst) : Int64
    puts "copy with buffer size: #{BUFFER_SIZE}" # <= just to verify the value is correct 
    buffer = uninitialized UInt8[BUFFER_SIZE]
    count = 0_i64
    while (len = src.read(buffer.to_slice).to_i32) > 0
      dst.write buffer.to_slice[0, len]
      count &+= len
    end
    count
  end
end

Then the user can simply do this:

class IO
  BUFFER_SIZE = 131072
end

a = IO::Memory.new
b = IO::Memory.new

IO.copy(a, b)

It would be nice I think.

EDIT: However it’s a global thing.

1 Like

There are several distinct angles here, roughly ordered by effort:

  1. Customizing individual buffers
  2. Adjusting the global configuration to a maybe more appropriate default than 4K
  3. Customizing the global configuration

Numbers 2. and 3. need a global configuration (i.e. a constant) instead of magic numbers. This would be a great idea anyways.
And then there’s the question of whether a global configuration applies to all buffer applications, or if there should be separate configurations.
To answer that, we’ll need a list of all buffer applications in stdlib and think about whether it makes sense to treat them all the same or group them into different use cases.

The easiest implementation is certainly number 1., by adding a parameter for passing in an allocated buffer. This allows the user to choose the size and whether it’s heap or stack memory.

My opinion: please increase the default size to a number that other languages use. Go uses 32 * 1024 = 32768. C# uses 81920. Crystal uses 4096, which is too small these days.

I would increase it to 32768 permanently, without allowing configuration. Then I’d say if anyone still complains. If so, I’d introduce a configuration option. But configuring this and using static arrays is hard! For IO.copy maybe we can introduce an IO.copy(src, dst, buffer_size) macro, but for String#gets_to_end it’s hard because it’s an instance method.

3 Likes

Another opinion: in general when you design a product and there are multiple options, the thing you shouldn’t do is give the user options. Instead, you should try to choose the best option you can think of. Only when that’s not enough you should introduce options.

3 Likes

If we want such customizability, I’d just offer a buffer argument which the user can use with any kind of buffer. That removes the need for customizing the internal static array.
Golang provides that same option with io.CopyBuffer.

This sounds good for IO.copy :+1:

But I’m wondering if it’s also a good choice for other buffers. Golang’s bufio package for example uses 4K as well.

Note that the buffer in Go is allocated on the heap. It’s different than Crystal.

I agree with @asterite that the first action can be increasing to 32k most of the buffers (especially network based one).

On a latter date introducing some global or method bound option to customize it might be beneficial as well, especially if your application reads large files over the network or from disk, or if the application want to optimize CPU/Mem.

Right now I see critical CPU consumption issues when handling .gets_to_end because of the low buffer size, this is relevant to all HTTP operations and is felt strongly when using multiple Fibers because it will hang all operations while looping in mem allocation instead of actually allowing other Fibers to work while doing IO operations which is the whole point.

I would love to contribute a PR of mass changing all buffers to 32k if that’s needed

1 Like

For starters, I’d say that identifying those buffers is a good start to get the ball rolling.

2 Likes