Initializing an Array of 1M+ elements at compile time

I have a method inside a loop that performs a numeric computation, that uses a constant array of 1M+ elements. Currently, it generates this array every time it’s called.

How can I generate the array at compile time so I can use it without generating the 1M+ elements at run time?

I suppose you could generate the literal in a macro, but you’ll likely run into extremely long compile times and crazy large binaries if you do it in native Crystal with an array that size. I have a few tables I use for a sinc resampler. One of them is just 340k Float32 elements in size and ran into this. I forget exactly what was happening…

What I did to get around the compiler wackiness is I moved the tables to a C library as globals, built a static library at shard install time, then used this library and referenced the globals using lib. This worked perfectly in the end. You can see this happening in the sinc.cr, sinc-table-best.c, and Makefile within this repo: RemiAudio: Files in src/remiaudio/resampler/ of tip

Also check out the postinstall in the Shard: RemiAudio: shard.yml at tip

Maybe this issue got fixed since the last time I messed with Crystal, but I don’t think it did?

EDIT: some relevant issues I found in the Crystal repo:

2 Likes

Array must be initialized at runtime. An array literal expands to a series of #<< calls, for example.

If the data is static you might not actually need Array though. If Slice is sufficient and the element type is a primitive, then the new Slice.literal could work.

While the others have given hint of how to produce it at actual compile time, I wonder if that is not just complicating things more than it need to be. From the description it sounds like it would probably suffice to put the array in a variable and then have it be reused between invocations.

2 Likes

If you are happy doing it in compile time, that means you only need to do it once. If you are going to do it only once, then you don’t need to do it in compile time, because allocating that takes a negligible amount of time. How negligible? You are saving between microseconds and milliseconds of startup time, depending on the size of the array elements.

Here’s a trivial benchmark:

require "benchmark"

struct Bytes32
  getter v : StaticArray(UInt8, 32)
  def initialize
    @v = StaticArray(UInt8, 32).new(0)
  end
end

struct Bytes64
  getter v : StaticArray(UInt8, 64)
  def initialize
    @v = StaticArray(UInt8, 64).new(0)
  end
end

struct Bytes128
  getter v : StaticArray(UInt8, 128)
  def initialize
    @v = StaticArray(UInt8, 128).new(0)
  end
end

SIZES = [1, 2, 4, 8, 32, 64, 128]
N = 1_000_000

puts "Benchmarking allocation of #{N} elements of different sizes"
puts "-----------------------------------------------------"

SIZES.each do |size|
  puts "
--> Element size: #{size} bytes"
  Benchmark.ips do |x|
    case size
    when 1
      x.report("Array(UInt8)") { Array(UInt8).new(N, 0_u8) }
    when 2
      x.report("Array(UInt16)") { Array(UInt16).new(N, 0_u16) }
    when 4
      x.report("Array(UInt32)") { Array(UInt32).new(N, 0_u32) }
    when 8
      x.report("Array(UInt64)") { Array(UInt64).new(N, 0_u64) }
    when 32
      x.report("Array(Bytes32)") { Array(Bytes32).new(N) }
    when 64
      x.report("Array(Bytes64)") { Array(Bytes64).new(N) }
    when 128
      x.report("Array(Bytes128)") { Array(Bytes128).new(N) }
    end
  end
end

And here’s the output on my machine, running it as `crystal run --release alloc_bench.cr`:

Benchmarking allocation of 1000000 elements of different sizes
-----------------------------------------------------

--> Element size: 1 bytes
Array(UInt8)  46.52k ( 21.49µs) (± 3.34%)  0.95MB/op  fastest

--> Element size: 2 bytes
Array(UInt16)  27.13k ( 36.86µs) (± 7.55%)  1.91MB/op  fastest

--> Element size: 4 bytes
Array(UInt32)  10.85k ( 92.21µs) (± 9.40%)  3.81MB/op  fastest

--> Element size: 8 bytes
Array(UInt64)   2.09k (477.46µs) (± 7.09%)  7.63MB/op  fastest

--> Element size: 32 bytes
Array(Bytes32) 369.98  (  2.70ms) (± 7.21%)  30.5MB/op  fastest

--> Element size: 64 bytes
Array(Bytes64) 192.54  (  5.19ms) (± 4.22%)  61.0MB/op  fastest

--> Element size: 128 bytes
Array(Bytes128)  96.72  ( 10.34ms) (± 3.45%)  122MB/op  fastest

And you only save that much assuming the compile-time instantiation takes no time (which I don’t know for sure)

Also, yes, probably you want to use a slice if you can.

Footnote: the 32/64/128 byte cases are much slower because the members are more complex than the 1/2/4/8/16 ones but your use case may be more like that, who knows.

Footnote 2: why are you creating the array inside the loop? Move it out man!

2 Likes

I did not know about Slice.literals, but this sounds like what I want.

I only need to generate this collection once, to use many times inside another method to do computations with.

It would be something like:

residuesP19 = Slice(Int32).literal(x,x,x)

But I need to programatically generate 1,658,880 (x, x ,x) data values to put in it.

The most important thing is I only want to generate this collection once, whether it be at compile or run time.

Then, how long does it take to instantiate your array?

This is a classic memoization problem to cache a heavy setup routine once. I’m not sure how others have conquered this hill, but class vars seem to be working ok for me. This little script is pseudo-code but you can see the pattern.

class Foo
@@bigTable : Array(SomeStruct)?
@@mutex = Mutex.new

def memoize_big_table : Void
@@mutex.synchronize.do
@@bigTable ||= long_running_setup
end
return Void # defeat implicit return for lazy init / call to memoize
end

def bigTable : Array(SomeStruct)
@@bigTable || Array(SomeStruct).new { SomeStruct.zero }
end

def long_running_setup : Array(SomeStruct)
a = Array(SomeStruct).new
# your init code
return a
end
end

after this runs once it should technicall be cached for all subsequent calls to for as long as the process is running. The mutex may not be necessary for your application, but if it’s possible for multi-threaded / fiber based clients to call the init simulatenously it’s a good idea to protect it to avoid a race condition.

This works fine also if the class var is set to itself as the type. Just use something like @@cachedFoo : Foo? and adjust to suit. I prefer explicit calls during setup to invoke the memoization function before making it available in hot paths but you don’t have to do it that way. You could just set it in the getter directly with something like this.

def bigTable : Foo
@@achedFoo ||= long_running_returns_Foo
end

If there’s a more canonical way to do this I’m all ears! cheers

Thanks for the suggestion.
I’ll study it and see if it can fit my use cases.