Why does Hash.size return Int32 and not UInt32?

It would seem better if hash returned an UInt, this is more specific.

UInt is а pretty specific type. It shouldn’t be used to limit possible numbers range, only when it is really needed.
https://carc.in/#/r/6eut

class MyHash
  def size
    0_u32
  end
end

h = MyHash.new
if h.size - 1 > 1000
  puts "hash too big"
end
1 Like

Int32 is the default integer type in Crystal. All stdlib methods returning an integer should return Int32 unless there are very specific reasons not to. The reason for this is that math operations including different types are prone to error (because a unsigned type could easily lead to an overflow). That’s why for sizes and other dimensions, you should always use signed integers, even if the effective value range is limited to positive numbers.

Would the checked overflow allow to use more specific types like UInt32 here? Since there would be no risk for error due to underflow

In that case my example (size - 1 > 1000) will raise at runtime when size is 0. I don’t think it’s a good behavior.
Having a size UInt32 doesn’t add safety. The benefit from it is that you can have arrays with more then 2billions of elements, but you pay for it with reduced convenience of math operations. It would be much better to have an Int64 size, but this would severely degrade performance on 32-bit architectures.

As I see the problem, it doesn’t lie within the size type, which indeed seems natural to fit into an unsigned integer, but with a subtraction operator. So on the one hand, if we declare UInt - Number => UInt, we risk underflow, on the other hand, if UInt - Number => Int, we risk overflow. Which is nothing new, any language has to do something about this.

The size keyword is irrelevant here in my opinion, I read this thread as don't use UInt at all which seems a bit too strong a statement to me.

If you make size Int32 you don’t have problems with overflow\underflow (unless you use really big numbers). So what’s a benefit of making it unsigned?
I don’t think it can help to catch any errors.
I think that UInt32 is needed for binary protocols, hash functions and some other places where its overflow behaviour is exactly what we expect and need. But not for general computations. Yes, it is a controversial topic - i’ve seen two holywars about signed vs unsigned size of containers in different languages. What is more important - one more bit of possible size vs easier to understand behavior?
One more funny example.

#imagine we can't use `reverse` because e.g. array can change in a loop.
i = arr.size
while i >= 0
  puts arr[i]
  i-=2
end

won’t work if size is unsigned. You have to use

i = arr.size
while i < arr.size #yes, that's not mistake. loop while i is less than size because it will overflow once you reach zero.
  puts arr[i]
  i-=2
end
1 Like

No, it’s don't use UInt for math. Most use cases for number types revolve around some kind of mathematical computations. To make this work, you need signed integers because that’s the usual domain for math calculations.

This whole thread is actually a surprise for me. I didn’t think of it that much, but was under the impression that boundary checks are built-in and 0_u - 1 will just raise.

don’t use UInt for math

Why does Crystal have UInt#-() then? Surely it would convey this message much stronger if you would have to cast UInt to an appropriate type by hand?

Raising won’t help here. You will get exception instead of silent bug - that is better, but still isn’t what you need when iterating an array. You have to pay extra effort to deal with unsigned integers (unsigned sizes in this case).

What is a negative size? Does it have a physical sense?

Maybe we should introduce a new type family: Size32 et al. that will not have math operators at all.

I don’t get “physical sense” argument. Yes, we pay 1 bit (half of possible range) to have less problems with overflow. If 1 bit is important to us - we can make own container. If it is important in most cases - we can discuss making a size unsigned by default.
But having a physical sense won’t by itself solve any problems or help in any other way.

No, it’s the other way around with me, I don’t want to think bits, I don’t really care about the integer width, I optimize semantics, thinking. There is no law that says that Array#size should return an Int at all, and in fact in JS it doesn’t. Once you have a type system its only benefit is creating a mental model that helps you reason about the problem, the solution and the code, and having to worry about over- and underflows should be left for those who optimize, not be the default position for an author to be in.

Size in general can’t be negative, so I want my type system to reflect that. In absense of a specialized type I pick something that has this property – an UInt, but now there is a risk to accidentally use it in a wrong context, because it is a number after all, so I propose we fix that.

But why does crystal return an int, but other languages like rust return an Uint on vec.len? I feel like it does guarantee that the length is positive.

There is no universal good answer here, you have to pick a compromise (or expose options to the user). Rust chose one set of positive outcomes, Crystal chose another.

Well, that makes sense.
But on a low level the reality is that if we want to write correct code, we have to either

  • limit a size to 2147483647. Maybe a type that have half an Int32 range (https://github.com/crystal-lang/crystal/issues/2747) could be useful, but it will most likely have a performance impact that is undesirable for such basic types like a Hash.
  • correctly work with UInt32, taking care about overflows (as in my example).
  • convert it to Int64 for calculations (will degrade performance on 32-bit architectures, possibly even on 64-bit ones).

Rust and DLang have unsigned size, Go and Kotlin (and perhaps all JVM-based languages) - signed.

Of course I understand the implications. Also n / size will raise on zero size, which is fine. My position is this: just like there is no number that could be correctly represented as a division by zero result, there should be no number that represents 0_u - 1 result, even if mechanically it’s simple to just wrap around the zero.

This is not a low level discussion on my part, keeping in mind that all this has already been discussed a million times, this is an argument on safety, correctness and static type system.

Let’s imagine there is no such number, compiler checks everything for us. How this code

i = arr.size
while i >= 0
  puts arr[i]
  i-=1
end

should behave? Should it iterate array correctly or be rewritten?

There are several distinct topics here, let me try and unpack them.

Compile time

  • If we make a specialized Size type, then the line i-=1 should become a compile error, because there should be no Size#-() method.
  • Then the user should realize that there is either a need for an explicit cast, or a misconception in the code in general, in this example that would be manual array iteration in the first place, instead of arr.reverse_each { |n| puts n }
  • This is a far-fetched idea, but the compiler could help the user even with no modifications to the type system, for example issueing a warning on the same i-=1 line, suggesting it to be modified to i-=1 unless i.zero? or somesuch.

Runtime

In any case there should be a runtime boundary check, which will have to safeguard against overflows and underflows, keeping the values true to their types. In this example, provided that arr.size is unsigned, it should raise something like an UnderflowError upon encountering this condition.