Maximum Hash size reached - so low

pfischer · July 21, 2022, 8:46pm

Hello,

I can’t store 150_000_000 items even into a simple Hash(Int32, Int32).

When I try to create Hash with precreated space - with initial_capacity in constructor - “Arithmetic overflow (OverflowError)” occurs in the Hash creation
When I ignore initial_capacity, “Maximum Hash size reached” error occurs while filling the Hash somwhere around 100_000_000 entries.

What…? Is 150_000_000 items really too much today? What about a billion entries? Why is maximum so low?

So I should implement my own BigHash?

Thanks! pf

straight-shoota · July 21, 2022, 9:45pm

This is unfortunately a limitation of the internal 32-bit index size.

See explanation here in the source code:

github.com

crystal-lang/crystal/blob/c4799c6289f900ee4a92d5dc89c3c8860010a6d9/src/hash.cr#L616-L619


      
          # After this it's 1 << 28, and with entries being Int32
          # (4 bytes) it's 1 << 30 of actual bytesize and the
          # next value would be 1 << 31 which overflows `Int32`.
          private MAXIMUM_INDICES_SIZE = 1 << 28

I’m afraid there is currently no alternative for collections using bigger size types in the standard library.

cf. Data structers for large datasets · Issue #8523 · crystal-lang/crystal · GitHub

This is a long-known problem but a solution that involves changing stdlib’s size type is hard. And apparently this limitation is rarely an issue in practice.

asterite · July 21, 2022, 11:18pm

Maybe you could describe your use case or particular problem. There might be a way to model it without a huge hash.

pfischer · July 21, 2022, 11:52pm

Imagine just a big in-memory index (object_ids → position in a file + some other metadata) or something like this.

It looks like it won’t be a big problem to copy stdlib Hash and make it based on Int64 (unfortunately, it will not be able to implement Enumerable (because size in Enumerable is Int32 etc).

mavu · July 22, 2022, 9:21am

Is my math off, or is that already 1.5gb of data if each entry has even only 10 bytes (total, including internaly used memory of the type)?

If you really need to work with such large indices of stuff, i would probably roll my own datatype.
That makes it easier later to do stuff like lazy loading of data or pagination when you run out of memory on the machine.

jgaskins · July 24, 2022, 1:35am

Interesting, is this something that’s feasible to use Redis for?

I can imagine that populating that giant index all at once wouldn’t work for that, but if that mapping is accumulated over time, it might be a decent tradeoff since it can hold 4 billion keys.

Topic		Replies	Views
How to define the initial size of a hash? Help & Support	4	287	July 15, 2021
Why does Hash.size return Int32 and not UInt32? Crystal Contrib	24	1494	March 11, 2019
Arithmetic overflow when trying to benchmark some digest Help & Support	6	300	February 13, 2021
Is increasing array index size in the future? Crystal Contrib	33	646	March 1, 2025
Is there any way to explicitly allow Arithmetic overflow, for hash calculations? Crystal Contrib	3	689	December 6, 2019

Maximum Hash size reached - so low

Related topics