Windows `Path` case mapping is inaccurate

Comparison and hashing of Windows Paths rely on Char#downcase for case insensitivity. This is accurate to Unicode, but not to filesystems like NTFS which use a per-volume upcase table. This table:

  • is outdated;
  • contains all UTF-16 code units but treats the surrogate units as opaque characters, and they map to themselves, so there are no case mappings outside the BMP;
  • can change between OS versions;
  • is not guaranteed to be the same even on identical OS versions.

Refer to Playing with case-insensitive file names – My DFIR Blog for more details.

That means two Windows Paths that compare equal may represent different paths on the system:

LJ = Path.windows("casing\\Lj.txt")       # U+01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J
LJ_UPPER = Path.windows("casing\\LJ.txt") # U+01C7 LATIN CAPITAL LETTER LJ

Dir.mkdir_p("casing")
File.touch(LJ)

contents = Dir.glob("casing/*").map { |path| Path.windows(path) }

contents                     # => [Path["casing\\Lj.txt"]]
LJ == LJ_UPPER               # => true
contents.includes?(LJ_UPPER) # => true
File.exists?(LJ)             # => true
File.exists?(LJ_UPPER)       # => false

First we need to find a way to use this NTFS upcase table. CharUpperBuffW is locale-independent, and on my system it appears to match exactly what the upcase table produces. (I couldn’t find a way to use RtlUpcaseUnicodeChar.) Then we can find all offending Char pairs as below:

@[Link("user32")]
lib LibC
  fun CharUpperBuffW(lpsz : LPWSTR, cchLength : DWORD) : DWORD
end

struct Char
  @@user32_upcase = Pointer(UInt16).malloc(0x10000, &.to_u16)
  LibC.CharUpperBuffW(@@user32_upcase, 0x10000)

  def ntfs_upcase
    return self unless ord <= 0xFFFF
    @@user32_upcase[ord].unsafe_chr
  end
end

def every_char_pair
  (0x0000_u16..0xFFFE_u16).each do |ord1|
    next if 0xD800 <= ord1 <= 0xDFFF
    ch1 = ord1.unsafe_chr

    (ord1 + 1..0xFFFF_u16).each do |ord2|
      next if 0xD800 <= ord2 <= 0xDFFF
      ch2 = ord2.unsafe_chr

      yield ord1, ord2, ch1, ch2
    end
  end
end

puts "different physical, same `Path`"
every_char_pair do |ord1, ord2, ch1, ch2|
  if ch1.ntfs_upcase != ch2.ntfs_upcase && ch1.downcase == ch2.downcase
    printf "%04X\t%04X\t%s\t%s\n", ord1, ord2, ch1, ch2
  end
end
puts

puts "same physical, different `Path`"
every_char_pair do |ord1, ord2, ch1, ch2|
  if ch1.ntfs_upcase == ch2.ntfs_upcase && ch1.downcase != ch2.downcase
    printf "%04X\t%04X\t%s\t%s\n", ord1, ord2, ch1, ch2
  end
end
puts

The 209 pairs of the first kind are here. I found none of the second kind, but with a corrupted upcase table that would also be possible.

There seems to be no authoritative source of the contents of the upcase table, but more importantly the table simply doesn’t exist on filesystems like Ext4, and yet a Crystal program under Linux can continue to construct Windows Paths. I honestly don’t know if there is anything we can do about this.

Probably not. The properties of file system paths are inherently file system specific.

It’s just a very rough approximation that paths are usually case-sensitive on POSIX systems and insensitive on Windows.
But you can have case-insensitive file systems on POSIX systems and case-sensitive file systems on Windows. Not even considering different comparison rules for case-insensitive.

So it’s completely impossible to get this done right. We can’t win that fight.

I think it would be worth looking at other relevant implementations to see what they’re doing and if we can copy whatever has proven to be somewhat reasonable.
I suppose the current approach is pretty good actually. It’s very simple and unsophisticated. Thus it’s very obviously incomplete and insufficient for completely accurate path handling.
The more extra cases we throw in, the more grows the expectation that it’s always correct.

3 Likes