Understanding the memory layout of union types

While doing some low-level work in Crystal, I started printing out some of my variables’
representation in binary. I am aware that union types are handled at runtime using a
32-bit type ID that prefixes the variable’s data in memory. When I was printing out
this data, however, I found a second 32-bit value after the type ID and before the
variable data. I have no idea what it is, and haven’t been able to find any information
about it.

Here’s an example that shows what I’m talking about:

def print_binary(pointer, byte_count)
    bytes = Bytes.new(pointer.unsafe_as(Pointer(UInt8)), byte_count)
    puts "hex: " + bytes.map { |byte| byte.to_s(16).rjust(2, '0').center(8, ' ') }.join(" ")
    puts "bin: " + bytes.map { |byte| byte.to_s(2).rjust(8, '0') }.join(" ")
end

puts "Because `val` is not a union type, there is no type id stored in memory for it."
val = 0x01234567_u32
print_binary(pointerof(val), 4)

puts

puts "Here, `val2` has a compile-time type of (UInt32 | UInt16), so a type id is stored"
val2 = [0x01234567_u32, 0xabcd_u16][0]
print_binary(pointerof(val2), 12)
puts "Type ID: 0x#{val2.crystal_type_id.to_s(base: 16)}"

When I run the program above on my computer, this is the output:

Because `val` is not a union type, there is no type id stored in memory for it.
hex:    67       45       23       01   
bin: 01100111 01000101 00100011 00000001

Here, `val2` has a compile-time type of (UInt32 | UInt16), so a type id is stored
hex:    a6       00       00       00       ba       55       00       00       67       45       23       01   
bin: 10100110 00000000 00000000 00000000 10111010 01010101 00000000 00000000 01100111 01000101 00100011 00000001
Type ID: 0xa6

As you can see, the first word of val2 stores the little-endian type ID 0xa6. The last word of val2 is 67 45 23 01, the little-endian version of 0x01234567. However, there is a word between these two that is seemingly random.

What exactly is this middle value, and what purpose does it serve?

Thanks!

I think that is a padding introduced by LLVM to have 64-bits pointers correctly aligned

This is a bit old, but there is relevant information if you want to dig into memory representation Internals - The Crystal Programming Language

2 Likes

Thanks so much! That explains a lot - it wasn’t deterministic, so it makes sense that it was just padding.

Yes, it’s done like that so that memory is aligned for the GC to find roots. We internaly represent the union value part as a series of opaque Int64, and the type id is Int32, but that one is extended to Int64 so that it’s aligned to 8 bytes boundaries.

1 Like