Currently the index of arrays is maxed at Int32::MAX
, i.e. (2**31 - 1) => 2147483647
Are there plans to increase that to at least Int64::MAX
?
I seriously need to create|address larger arrays.
Currently the index of arrays is maxed at Int32::MAX
, i.e. (2**31 - 1) => 2147483647
Are there plans to increase that to at least Int64::MAX
?
I seriously need to create|address larger arrays.
I think any addition of larger array would be it’s own array type, optimized to not create the kind of massive contiguous allocations required by an array larger than u32_max. What’s the usecase you’re running into here @jzakiya?
I want to genrate as many prime numbers as possilbe.
The program below will generate primes until you run out of memory.
def sozp240(val, residues)
res = residues
md, rscnt = res.last-1, res.size
kmax = (val - 2) // md + 1
prms = Array(UInt64).new(kmax, 0)
modk, r, k = 0, -1, 0
loop do
if (r += 1) == rscnt; r = 0; modk += md; k += 1 end
next if prms[k] & (1u64 << r) != 0
prm_r = res[r]
prime = modk + prm_r
break if prime > Math.isqrt(val)
res.each do |ri|
kn,rn = (prm_r * ri - 2).divmod md
bit_r = 1u64 << res.index(rn+2).not_nil!
kpm = k * (prime + ri) + kn
while kpm < kmax; prms[kpm] |= bit_r; kpm += prime end
end end
prms
end
The program encodes the primes within the primes candidates array prms
.
I extract the primes individually from prms
to use in another routine that uses them.
Currently I’m array size limited to only doing 19 Billion primes (~45 minutes with single core).
Max array index is: (2**31 - 1) = 2,147,483,647
elements.
The 19th Bil prime is 491_701_819_267.
491_701_819_267 // 240 = 2,048,757,581
residue groups of 64-bits.
Thats 2_048_757_581 * 8 = 16,390,060,648 bytes
(16.39 GB) of memory.
I have 40GB on my laptop, so I can|want to do a lot more.
40th Billion prime is 1,066,173,339,601
That would take an array of: 1,066,173,339,601 // 240 = 4,442,388,915 resgroups (index values)
That’s 35,539,111,320 bytes (35.5GB) of memory.
I can easily do this, and find max possible with my memory.
That’s why I need to be able to create|address larger indexed arrays.
The same issue exists on String, it size limit to Int32::MAX
In bioinformatics, there are many software programs that can only run on computers with 64GB or 128GB of memory. For example, programs such as STAR and Kraken, and several others. Most of these programs load large dictionary-like data into memory.
That’s why my home computer also has 64GB or 128GB of memory. Because I want to try such software at home. I buy memory and build my own PCs to keep costs down, but memory is easily corrupted. For example, if the result of md5sum
changes every time I check it, I can run MemTest86+ to find out that the memory is corrupted. Is it because I always buy cheap memory?
Of course, I don’t have the skills to write such complex software myself, but after reading jzakiya’s comment, I realized that arrays in Int32::MAX
are not as large as I thought. I agree that Crystal’s array and string sizes need to be expanded, and of course I know that Crystal is an OSS project and contributions are important.
as this languages string benchmark , Crystal only slower than Rust, and ranking second on 500M
and 1G
text test, however, it failed for larger text files test.
If Crystal support big string, will have a more impressive result.
As language competitors, Rust can support large strings until 29.2 GB, even, Go can support large strings until 17.1 GB.
That benchmark was proven flawed in this forum post so I don’t think it’s worth considering for performance.
I second RX14’s suggestion of a separate type, largely because the chances of people/business working with such large arrays is so uncommon that it could probably be implemented as a shard, compared to the String
limit which is surprisingly more common.
IIRC someone on the core team said Int32
was used for indexing to reduce the memory footprint of these data structures. That makes sense considering how ubiquitous they are. Increasing indexable size of an Array
from 32->64 bits would increase the baseline footprint of every array in every program from 16->24 bytes (8B pointer + 4B capacity + 4B size → 8B pointer + 8B capacity + 8B size). General-purpose data structures should be optimized for the general case, and N is small in the vast majority of use cases.
There’s nothing stopping anybody from making a BigArray
or BigString
class, though. If you just need an array type for common operations like enumeration, indexing, and mutation-by-index, you can use this shard. The implementation is quite simple and most of the behavior lives in Indexable
and Indexable::Mutable
mixins.
When you think about how SQL databases are often split between OLTP and OLAP use cases (OLAP DBs scale horizontally for a single query but have a large up-front query cost whereas OLTP databases have a trivial up-front cost but are very limited in how they can parallelize a single query), it makes sense to also optimize data structures for how much data they’re expected to hold and how they’re intended to be used.
But it’s also possible that Array
might perform better on 64-bit systems with 64-bit indexes considering some arithmetic operations are faster on 64-bit values and a lot of operations are performed on the index. Hard to predict, though, because it would involve larger allocations. Benchmarks would need to be run.
This seems to be a very simple thing to do that is being rationalized into being hard.
Almost every desktop, laptop, phone, RPI, tablet, etc, use 64-bit systems|hardware, thus their native internal registers and memory are laid out to be optimized for operations at their full bit size.
In fact, the 32-bit indexing limit is archaically out of date for hardware optimization, and application necessities.
Allowing for 64-bit indexing doesn’t mean you are going to use the full 2^64 memory addresses, it means you can get to address upper GB size arrays, which already have 64-bit values.
I can’t use Crystal in some of my applications because of this archaic limitation.
32-bits is just 2+ billion array elements. That just doesn’t cut it in 2024 and beyond. I’m forced to use Rust, et al to even think about doing certain applications.
Having 64-bit indexing should be a natural part of the standard system library, and not some external feature you need to install a shard for.
This reminds me of Bill Gates saying 640Kb should be enough for any application.
Do note that most languages, notably Go, C#, Java and Rust as you’ve mentioned, do not support more than 2^31-1 for arrays by default. In particular with Rust, this comment clearly states the limitations of such an array/vector. So its not unusual that Crystal follows suit. As previously mentioned, it’s optimised for the general use case.
Crystal is always improving, and now that it supports Int128, it makes writing bioinformatics software a little easier.
The amount of data we work with is growing every year, and it’s clear that we’ll need to overcome the size limit of Int32::MAX.
I think it’s simply a matter of development resources and priorities. I expect Crystal and other languages ​​to have some way of handling large strings and arrays in their standard libraries within the next decade.
I think this exact same thing a lot, and it’s surprising how often the reason for it is that circumstances exist that I’m unaware of that make it more complicated.
You and I have both mentioned in this threa that 64-bit CPUs are optimized to deal with 64-bit values, however that doesn’t mean all 64-bit operations are faster every time. I just wrote up some benchmarks and, on my machine, the BigArray
class I wrote that can index larger than 32-bit offsets is 20% slower to allocate and 81% slower to iterate through 1k of the exact same elements. I assumed allocations would be slower, but iteration being slower surprised me considering that both of them are doing the exact same pointer arithmetic on 32- vs 64-bit offsets.
I’m on an ARM64 CPU, so maybe it’ll be different on a different CPU architecture (or even other ARM64 CPUs that are not made by Apple), but this is why I said in my previous post that benchmarks would be important here. Performance is never as straightforward as people think it is. You have to see it running.
require "benchmark"
require "../src/big_array"
puts "Allocation"
values = [Array(String).new, BigArray(String).new]
Benchmark.ips do |x|
x.report "Array" { values.unsafe_put 0, Array(String).new }
x.report "BigArray" { values.unsafe_put 1, BigArray(String).new }
end
# Need a side effect to avoid LLVM optimizations, but we don't actually want it to print
pp values unless values
puts
puts "Indexing"
value = ""
Benchmark.ips do |x|
array = Array(String).new
big_array = BigArray(String).new
1_000.times do |i|
array << i.to_s
big_array << i.to_s
end
x.report "Array" { array.each { |v| value = v } }
x.report "BigArray" { big_array.each { |v| value = v } }
end
pp value unless value
I’m on an ARM64 CPU, so maybe a different CPU architecture (or even an ARM64 CPU not made by Apple) might get different results, but this is why I said in my previous post that benchmarks would be important here. Performance is never straightforward.
If you don’t want to use a shard for it, that’s an understandable opinion to have, but sensationalizing your position by saying you’re being “forced” to use another language after I just gave you the blueprint to do what you want in Crystal is not as strong an argument as you might think it is.
I consider it’s useful to add a BigArray or BigString shard to verify it.
I want say this discussion is about can do or cannot do
, yes, above benchmark is not so meaningful, but, as a competitors, why Go can do it, but Crystal can’t? this is the key point.
As said in discussion, We can add a BigArray or BigString to avoid potential impact on performance, it better than without it, bioinformatics user can use it happily, even it’s very slow.
An Int64-based Array seems like the “general” case as it handles more data without error, whereas the Int32-based Array feels like the “specific” case: reduce capacity, breaking some use cases, for a performance boost.
“popular” case is separate from “general” and “specific”. Either could be the more popular choice at a point in time, and that could change over time.
Anyway, I was curious about benchmarks with C++. I created basic Vector32 type and Vector64 types (the usual (data, size, capacity) triplet). One using int32_t and the other int64_t for size, capacity and indexing. Both holding a T = struct Point { double x, y; }
.
Iteration of either using pointer increment was the same.
Iteration with at(i)
where i
was int32_t or int64_t, matching the vector size, was the same.
But when I say “the same”, they varied wildly per run, outperforming each other back and forth between 1% and 50%. Could be CPUs hitting sustained throughput and/or throttling. I’m on an Apple M2 Max notebook with all apps quit except terminal.
My conclusion is that 32-bit is not a speed advantage on a 64-bit Apple ARM chip–in the C++ ecosystem.
Here is the heart of the benchmark:
template<typename VectorType>
double benchmark(const char* vector_name, int size) {
VectorType vector;
for (int64_t i = 0; i < size; ++i) {
vector.push(Point{static_cast<double>(i), static_cast<double>(i * 2)});
}
auto start = std::chrono::high_resolution_clock::now();
using index_t = VectorType::index_t;
double last = 0.0;
for (index_t i = 0; i < vector.size(); ++i) {
last = vector.at(i).x;
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << std::format("{} iteration time: {} microseconds\n", vector_name, duration.count());
std::cout << std::format("Last: {} (to prevent optimization)\n", last);
return duration.count();
}
Could there be another source for the difference, like built-in Array was compiled with --release
and BigArray
was not?
It was in the same program. As far as I’m aware, it’s not possible to split compilation like that in Crystal. The benchmark code is in the post (just gotta pop open the details element) if you’d like to check my work.
Strong disagree. Array sizes over 2B elements are rare. I’ve been programming for over 30 years and I’ve never needed that many elements in a single in-memory data structure. To be clear, I have no doubt that those use cases exist, but a general-purpose data structure should be optimized for the general case and N is usually small.
While this is correct, unfortunately it isn’t relevant to the discussion. Arrays are used for a broad spectrum of things in Crystal programs and I’d bet my house that >99% of those use cases never exceed 1M elements. When I say “generally”, at the very least, I mean the vast majority of ways arrays are used rather than a few use cases that many people have.
The sequential access itself is fast, but this and the intermittent array reallocations keep invalidating your entire L3 cache, hence the erratic run times (don’t forget that all other processes on your system are also competing for the cache). The L3 cache size is most certainly well below Int32::MAX
, so Int32
and Int64
shouldn’t make a big difference in that benchmark.
One important perspective here is that 32-bit linear memory was rare when C/C++ originally existed, so it is important that size_t
could be smaller than 32 bits, so as not to waste any upper bits. It is not a suggestion that making it 64 bits on a 64-bit system would make the same data structures or algorithms equally fast. This is actually a problem for AVR as well if one tries to use Array
directly there; Arrsy#size
won’t become an Int64
, and if it does become a platform-dependent integer type, it would be due to support for legacy / embedded targets, rather than performance.
There is also the problem that the larger a piece of contiguous allocation is, the more false positive references to it the Boehm GC will identify, and your block of memory will reside longer than necessary. Already the standard library specs allocate several 1 GiB buffers and often they simply don’t go away before the whole suite finishes.
A genuine application for 64-bit sizes is when dealing with memory-mapped I/O, which doesn’t involve allocations, and the best way to represent the data is via some kind of Slice
with a 64-bit size field. See also: `File.read`ing a file bigger than 2GB results in an `OverflowError` · Issue #14485 · crystal-lang/crystal · GitHub