Maximum Hash size reached - so low

Hello,

I can’t store 150_000_000 items even into a simple Hash(Int32, Int32).

  1. When I try to create Hash with precreated space - with initial_capacity in constructor - “Arithmetic overflow (OverflowError)” occurs in the Hash creation

  2. When I ignore initial_capacity, “Maximum Hash size reached” error occurs while filling the Hash somwhere around 100_000_000 entries.

What…? Is 150_000_000 items really too much today? What about a billion entries? Why is maximum so low?

So I should implement my own BigHash?

Thanks! pf

1 Like

This is unfortunately a limitation of the internal 32-bit index size.

See explanation here in the source code:

I’m afraid there is currently no alternative for collections using bigger size types in the standard library.

cf. Data structers for large datasets · Issue #8523 · crystal-lang/crystal · GitHub

This is a long-known problem but a solution that involves changing stdlib’s size type is hard. And apparently this limitation is rarely an issue in practice.

1 Like

Maybe you could describe your use case or particular problem. There might be a way to model it without a huge hash.

Imagine just a big in-memory index (object_ids → position in a file + some other metadata) or something like this.

It looks like it won’t be a big problem to copy stdlib Hash and make it based on Int64 (unfortunately, it will not be able to implement Enumerable (because size in Enumerable is Int32 etc).

Is my math off, or is that already 1.5gb of data if each entry has even only 10 bytes (total, including internaly used memory of the type)?

If you really need to work with such large indices of stuff, i would probably roll my own datatype.
That makes it easier later to do stuff like lazy loading of data or pagination when you run out of memory on the machine.

3 Likes

Interesting, is this something that’s feasible to use Redis for?

I can imagine that populating that giant index all at once wouldn’t work for that, but if that mapping is accumulated over time, it might be a decent tradeoff since it can hold 4 billion keys.