bioinformatics user can use it happily, even it’s very slow.
If it’s really slow, I don’t think they’ll be very happy
bioinformatics user can use it happily, even it’s very slow.
If it’s really slow, I don’t think they’ll be very happy
A thing to note: Array
is a very generic collection type which should work well for many purposes.
But it’s actually quite similar to Slice
with the main difference that Array
can grow dynamically. That makes it easy to use when you don’t know in advance how much space you need and it grows to fit. The way this mechanism is implemented doesn’t really work well for large sizes, though. Reallocating a buffer in the size of gigabytes is quite costly.
So dynamic allocations that big would better use a different approach anyway. (ref Is increasing array index size in the future? - #3 by RX14).
Slice
should certainly support sizes beyond Int32::MAX
.
If you override BigArray#each
then there are no significant differences between the two:
class BigArray(T)
def each(& : T ->)
each_index64 do |i|
yield unsafe_fetch(i)
end
end
# `Indexable#each_index` yields `Int32`s instead
def each_index64(& : Int64 ->) : Nil
i = 0_i64
while i < size
yield i
i += 1
end
end
end
My guess is that if i
is still an Int32
then it could overflow before it reaches the higher indices, whereas if i
and size
have the same type then LLVM is smart enough to prove that the line i += 1
cannot overflow, and drop the overflow check on each iteration. (Note that simply replacing it with i &+= 1
is wrong as it’d end up with unsafe_fetch(Int32::MIN)
.)
Actually I wasn’t as rigorous with my language as I could have been.
For my use cases that need larger indexing, I am using Slices
, because I know their sizes before their creation, and they never grow|shrink. I kind of forgot in Crystal Array
are dynamic, so their capacity changes have to be managed.
So yes, I currently really need larger indices for Slices
, but it would also be nice to have them for Arrays
too, which would provide more flexibility in designing code.
Nice catch. I overrode each_index
to yield Int64
values since Indexable
methods can take any Int
type and, sure enough, iteration is consistently 0.48% slower (very minor, but the consistency is notable) instead of 80-81%.
As I also ran into use cases where the 2B limit was exceeded, I took the implementation Array, Hash, and Set in the standard library and changed the type to 64-bit:
There may be bugs, but in the examples where I used it so far, it worked well as a drop-in replacement. I did not do a benchmark, so I cannot comment if it introduces overhead. But if you hit the addressing limits and are stuck, migrating to the “Big” types should be straightforward (Array → BigArray, Hash → BigHash, Set → BigSet).
I’m curious to know what the performance differences look like for these and the stdlib equivalents, and what differences there are in implementation (if any). If it’s reasonable and there aren’t many differences, I don’t see why we couldn’t increase the size of the stdlib containers.
Having just caught up on the last 6-ish months of this thread, I want to resurface this analysis from an earlier reply:
It’s easy to dismiss this cost if I have a small number of very large performance-relevant Array
s: if I have a 400 million element array that’s always growing, the allocation time and memory usage cost of those extra eight bytes will be truly, utterly negligible. But if I instead have, for example,
then the difference in pointer size becomes non-negligible (320 megabytes, in the first case). It’s still not likely to be the main factor in your program’s performance, and someone would probably be justified in arguing that you should be using other data structures or algorithms to solve the same problems, but, as has been pointed out, if you’re concerned about handling arrays with tens of billions of elements then you probably shouldn’t be using Array
.
Edit with a minor clarification of my intent:
I say all this not to discount the validity of the applications in which folks are using very large arrays but rather to point out that there are other reasonable situations in which a change in the Array
index type could result in noticeable performance regressions. There is a real (potential) cost to a change like this, and while that doesn’t mean that the change definitely shouldn’t happen, it’s just not reflective of reality to state that there’s no reason not to make the change.
Those sizes are slightly inaccurate. We currently have:
struct ReferenceStorage(Array(T))
@crystal_type_id : Int32 # exposition only
@size : Int32
@capacity : Int32
@offset_to_buffer : Int32
@buffer : T*
end
The current size is still 24 bytes, because @size
has default field alignment. The @offset_to_buffer
exists to optimize #shift
so that it runs in constant rather than linear time, and it too needs to be extended to UInt64
. Along with @size
and @buffer
, the new size is actually 40 bytes, due to extra padding after the type ID field; on 64-bit targets there is no way around it, on 32-bit targets @buffer
could be rearranged to the top to eliminate the padding and reduce the size to 32 bytes.
Also I believe Enumerable
will break in subtle ways if an includer overrides #size
to return UInt64
(we should probably remove Enumerable#size
and leave only Indexable#size
intact, but still).
I don’t think the size and possibly performance concerns are very relevant.
The stdlib implementations of data structures are meant to be generic solutions that provide a “good enough” implementation for a broad range of common use cases. They should be universal and safe to use. Performance is of course relevant, but a lesser goal.
Currently we’re missing the use case of supporting more than 2G items.
(Although as mentioned in Is increasing array index size in the future? - #22 by straight-shoota it’s questionable whether that makes much sense for Array
in particular.)
When performance becomes an issue for specific use cases such as huge numbers of very small collections, the answer should be specialized data structures with optimizations for the specific pattern.
This must not impede the generalization of standard data structures.
This is probably the “most” significant problem: we cannot easily break the API.
But even that can be mitigated as suggested in [RFC] Simplify integers and floats · Issue #8111 · crystal-lang/crystal · GitHub by adding a new accessor #size64
(or maybe a more generic #size_t
?) while #size
still returns Int32
(and thus may overflow for big collections).
FYI.
I just downloaded the current Crystal-1.16.0-dev-1
nightly build, with the change to increase array indexes to u64
. It solved the array indexing problem for one program, and it now can run to completion.
Here’s the thing.
If you can’t run a program because you can’t create big enough arrays, it doesn’t matter how much memory you can’t use to execute a program that won’t run.
This is needlessly sensational and irrelevant to the discussion. There has been nothing stopping you from implementing another array type or using one of the implementations mentioned in this thread.
As I mentioned twice now, I have no need for arrays that large, so the one I wrote, clearly didn’t write for myself. I wrote that for you, so that you could do what you needed to do. You refused to use it. Throughout this thread, you’ve been acting like you’re being held back but the only one holding you back this entire time has been yourself.
Huh…
Yo dude, you need to tone down the animus before you pop a vein.
It’s just programming, and you don’t tell me what I have to do.
and you don’t tell me what I have to do.
Since jgaskins has already created BigArray, it might be worthwhile to try it out and share your feedback on its usability and performance. If more people start using it, there’s a chance it could eventually be included in the standard library.