Comparison and hashing of Windows Path
s rely on Char#downcase
for case insensitivity. This is accurate to Unicode, but not to filesystems like NTFS which use a per-volume upcase table. This table:
- is outdated;
- contains all UTF-16 code units but treats the surrogate units as opaque characters, and they map to themselves, so there are no case mappings outside the BMP;
- can change between OS versions;
- is not guaranteed to be the same even on identical OS versions.
Refer to Playing with case-insensitive file names – My DFIR Blog for more details.
That means two Windows Path
s that compare equal may represent different paths on the system:
LJ = Path.windows("casing\\Lj.txt") # U+01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J
LJ_UPPER = Path.windows("casing\\LJ.txt") # U+01C7 LATIN CAPITAL LETTER LJ
Dir.mkdir_p("casing")
File.touch(LJ)
contents = Dir.glob("casing/*").map { |path| Path.windows(path) }
contents # => [Path["casing\\Lj.txt"]]
LJ == LJ_UPPER # => true
contents.includes?(LJ_UPPER) # => true
File.exists?(LJ) # => true
File.exists?(LJ_UPPER) # => false
First we need to find a way to use this NTFS upcase table. CharUpperBuffW
is locale-independent, and on my system it appears to match exactly what the upcase table produces. (I couldn’t find a way to use RtlUpcaseUnicodeChar
.) Then we can find all offending Char
pairs as below:
@[Link("user32")]
lib LibC
fun CharUpperBuffW(lpsz : LPWSTR, cchLength : DWORD) : DWORD
end
struct Char
@@user32_upcase = Pointer(UInt16).malloc(0x10000, &.to_u16)
LibC.CharUpperBuffW(@@user32_upcase, 0x10000)
def ntfs_upcase
return self unless ord <= 0xFFFF
@@user32_upcase[ord].unsafe_chr
end
end
def every_char_pair
(0x0000_u16..0xFFFE_u16).each do |ord1|
next if 0xD800 <= ord1 <= 0xDFFF
ch1 = ord1.unsafe_chr
(ord1 + 1..0xFFFF_u16).each do |ord2|
next if 0xD800 <= ord2 <= 0xDFFF
ch2 = ord2.unsafe_chr
yield ord1, ord2, ch1, ch2
end
end
end
puts "different physical, same `Path`"
every_char_pair do |ord1, ord2, ch1, ch2|
if ch1.ntfs_upcase != ch2.ntfs_upcase && ch1.downcase == ch2.downcase
printf "%04X\t%04X\t%s\t%s\n", ord1, ord2, ch1, ch2
end
end
puts
puts "same physical, different `Path`"
every_char_pair do |ord1, ord2, ch1, ch2|
if ch1.ntfs_upcase == ch2.ntfs_upcase && ch1.downcase != ch2.downcase
printf "%04X\t%04X\t%s\t%s\n", ord1, ord2, ch1, ch2
end
end
puts
The 209 pairs of the first kind are here. I found none of the second kind, but with a corrupted upcase table that would also be possible.
There seems to be no authoritative source of the contents of the upcase table, but more importantly the table simply doesn’t exist on filesystems like Ext4, and yet a Crystal program under Linux can continue to construct Windows Path
s. I honestly don’t know if there is anything we can do about this.