Wikipedia is wrong on the technical details. E.g. the filesystem accepts any seq...

Const-me · on June 11, 2019

From your comment I was replying to:

> "UTF-16" on Windows usually means UCS-2, so you risk losing information if you reencode.

On Windows, you normally call this API to convert UTF-8 to UTF-16: https://docs.microsoft.com/en-us/windows/desktop/api/stringa... As you see, the documentation says it converts to UTF-16, not UCS-2, so no information is lost re-encoding.

And the article you’ve linked says “file system treats path and file names as an opaque sequence of WCHARs.” This means no information is lost in the kernel, either.

Indeed, kernel doesn’t validate nor normalize these WCHARs, but should it? I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Linux kernel doesn’t do that either, https://www.kernel.org/doc/html/latest/admin-guide/ext4.html says “the file name provided by userspace is a byte-per-byte match to what is actually written in the disk”

ChrisSD · on June 11, 2019

I'm sorry if I was unclear but my point was that when you receive a string from the Windows API you cannot make any assumptions about it being valid UTF-16. Therefore converting it to UTF-8 is potentially lossy. So if you then convert it back from UTF-8 to UTF-16 and feed it to the WinAPI you'll get unexpected results. Which is why I feel converting back and forth all the time is risky.

This is one reason why the WTF-8[0] encoding was created as a UTF-8 like encoding that supports invalid unicode.

[0] https://simonsapin.github.io/wtf-8/

ygra · on June 11, 2019

> I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Doesn't OS X do that? AFAIK files names are in NFD there.

loeg · on June 11, 2019

Yes, Mac normalizes and decomposes. It's weird.