Is increasing array index size in the future?

kojix2 · October 12, 2024, 12:13pm

bioinformatics user can use it happily, even it’s very slow.

If it’s really slow, I don’t think they’ll be very happy

straight-shoota · October 12, 2024, 12:44pm

A thing to note: Array is a very generic collection type which should work well for many purposes.
But it’s actually quite similar to Slice with the main difference that Array can grow dynamically. That makes it easy to use when you don’t know in advance how much space you need and it grows to fit. The way this mechanism is implemented doesn’t really work well for large sizes, though. Reallocating a buffer in the size of gigabytes is quite costly.
So dynamic allocations that big would better use a different approach anyway. (ref Is increasing array index size in the future? - #3 by RX14).

Slice should certainly support sizes beyond Int32::MAX.

HertzDevil · October 12, 2024, 1:47pm

If you override BigArray#each then there are no significant differences between the two:

class BigArray(T)
  def each(& : T ->)
    each_index64 do |i|
      yield unsafe_fetch(i)
    end
  end

  # `Indexable#each_index` yields `Int32`s instead
  def each_index64(& : Int64 ->) : Nil
    i = 0_i64
    while i < size
      yield i
      i += 1
    end
  end
end

My guess is that if i is still an Int32 then it could overflow before it reaches the higher indices, whereas if i and size have the same type then LLVM is smart enough to prove that the line i += 1 cannot overflow, and drop the overflow check on each iteration. (Note that simply replacing it with i &+= 1 is wrong as it’d end up with unsafe_fetch(Int32::MIN).)

jzakiya · October 12, 2024, 6:13pm

Actually I wasn’t as rigorous with my language as I could have been.

For my use cases that need larger indexing, I am using Slices, because I know their sizes before their creation, and they never grow|shrink. I kind of forgot in Crystal Array are dynamic, so their capacity changes have to be managed.

So yes, I currently really need larger indices for Slices, but it would also be nice to have them for Arrays too, which would provide more flexibility in designing code.

jgaskins · October 13, 2024, 12:05am

Nice catch. I overrode each_index to yield Int64 values since Indexable methods can take any Int type and, sure enough, iteration is consistently 0.48% slower (very minor, but the consistency is notable) instead of 80-81%.

philipp-classen · February 26, 2025, 8:09pm

As I also ran into use cases where the 2B limit was exceeded, I took the implementation Array, Hash, and Set in the standard library and changed the type to 64-bit:

There may be bugs, but in the examples where I used it so far, it worked well as a drop-in replacement. I did not do a benchmark, so I cannot comment if it introduces overhead. But if you hit the addressing limits and are stuck, migrating to the “Big” types should be straightforward (Array → BigArray, Hash → BigHash, Set → BigSet).

nobodywasishere · February 26, 2025, 8:16pm

I’m curious to know what the performance differences look like for these and the stdlib equivalents, and what differences there are in implementation (if any). If it’s reasonable and there aren’t many differences, I don’t see why we couldn’t increase the size of the stdlib containers.

RespiteSage · February 26, 2025, 8:58pm

Having just caught up on the last 6-ish months of this thread, I want to resurface this analysis from an earlier reply:

It’s easy to dismiss this cost if I have a small number of very large performance-relevant Arrays: if I have a 400 million element array that’s always growing, the allocation time and memory usage cost of those extra eight bytes will be truly, utterly negligible. But if I instead have, for example,

40 million arrays that may grow and shrink between about 0 and 10 elements over the course of my program, or
very, very frequent construction (and therefore allocation) of new array objects because I’m just writing a prototype for something or I’m prioritizing readability over performance

then the difference in pointer size becomes non-negligible (320 megabytes, in the first case). It’s still not likely to be the main factor in your program’s performance, and someone would probably be justified in arguing that you should be using other data structures or algorithms to solve the same problems, but, as has been pointed out, if you’re concerned about handling arrays with tens of billions of elements then you probably shouldn’t be using Array.

Edit with a minor clarification of my intent:

I say all this not to discount the validity of the applications in which folks are using very large arrays but rather to point out that there are other reasonable situations in which a change in the Array index type could result in noticeable performance regressions. There is a real (potential) cost to a change like this, and while that doesn’t mean that the change definitely shouldn’t happen, it’s just not reflective of reality to state that there’s no reason not to make the change.

HertzDevil · February 26, 2025, 10:03pm

Those sizes are slightly inaccurate. We currently have:

struct ReferenceStorage(Array(T))
  @crystal_type_id : Int32 # exposition only
  @size : Int32
  @capacity : Int32
  @offset_to_buffer : Int32
  @buffer : T*
end

The current size is still 24 bytes, because @size has default field alignment. The @offset_to_buffer exists to optimize #shift so that it runs in constant rather than linear time, and it too needs to be extended to UInt64. Along with @size and @buffer, the new size is actually 40 bytes, due to extra padding after the type ID field; on 64-bit targets there is no way around it, on 32-bit targets @buffer could be rearranged to the top to eliminate the padding and reduce the size to 32 bytes.

Also I believe Enumerable will break in subtle ways if an includer overrides #size to return UInt64 (we should probably remove Enumerable#size and leave only Indexable#size intact, but still).

straight-shoota · February 26, 2025, 10:19pm

I don’t think the size and possibly performance concerns are very relevant.
The stdlib implementations of data structures are meant to be generic solutions that provide a “good enough” implementation for a broad range of common use cases. They should be universal and safe to use. Performance is of course relevant, but a lesser goal.
Currently we’re missing the use case of supporting more than 2G items.
(Although as mentioned in Is increasing array index size in the future? - #22 by straight-shoota it’s questionable whether that makes much sense for Array in particular.)

When performance becomes an issue for specific use cases such as huge numbers of very small collections, the answer should be specialized data structures with optimizations for the specific pattern.
This must not impede the generalization of standard data structures.

This is probably the “most” significant problem: we cannot easily break the API.
But even that can be mitigated as suggested in [RFC] Simplify integers and floats · Issue #8111 · crystal-lang/crystal · GitHub by adding a new accessor #size64 (or maybe a more generic #size_t?) while #size still returns Int32 (and thus may overflow for big collections).

jzakiya · February 26, 2025, 11:40pm

FYI.

I just downloaded the current Crystal-1.16.0-dev-1 nightly build, with the change to increase array indexes to u64. It solved the array indexing problem for one program, and it now can run to completion.

Here’s the thing.
If you can’t run a program because you can’t create big enough arrays, it doesn’t matter how much memory you can’t use to execute a program that won’t run.

jgaskins · February 27, 2025, 4:13pm

This is needlessly sensational and irrelevant to the discussion. There has been nothing stopping you from implementing another array type or using one of the implementations mentioned in this thread.

As I mentioned twice now, I have no need for arrays that large, so the one I wrote, clearly didn’t write for myself. I wrote that for you, so that you could do what you needed to do. You refused to use it. Throughout this thread, you’ve been acting like you’re being held back but the only one holding you back this entire time has been yourself.

jzakiya · February 27, 2025, 8:45pm

Huh…
Yo dude, you need to tone down the animus before you pop a vein.
It’s just programming, and you don’t tell me what I have to do.

kojix2 · March 1, 2025, 11:45am

and you don’t tell me what I have to do.

Since jgaskins has already created BigArray, it might be worthwhile to try it out and share your feedback on its usability and performance. If more people start using it, there’s a chance it could eventually be included in the standard library.

Topic		Replies	Views
Int32 limit for the number of Array elements Help & Support	2	653	August 2, 2020
BitArray size limitation Help & Support	5	372	July 15, 2022
Maximum Hash size reached - so low Help & Support	5	347	July 24, 2022
Increasing Floating Point Digits of Precision Help & Support	1	135	February 7, 2024
Kudos to Crystal! News	12	2127	December 29, 2021

Is increasing array index size in the future?

Related topics