Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wikipedia is wrong on the technical details.

E.g. the filesystem accepts any sequence of WCHARs, whether or not they're valid UTF-16: https://docs.microsoft.com/en-us/windows/desktop/FileIO/nami...

> the file system treats path and file names as an opaque sequence of WCHARs.

The same is true more generally, there's no validation so anything goes.



From your comment I was replying to:

> "UTF-16" on Windows usually means UCS-2, so you risk losing information if you reencode.

On Windows, you normally call this API to convert UTF-8 to UTF-16: https://docs.microsoft.com/en-us/windows/desktop/api/stringa... As you see, the documentation says it converts to UTF-16, not UCS-2, so no information is lost re-encoding.

And the article you’ve linked says “file system treats path and file names as an opaque sequence of WCHARs.” This means no information is lost in the kernel, either.

Indeed, kernel doesn’t validate nor normalize these WCHARs, but should it? I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Linux kernel doesn’t do that either, https://www.kernel.org/doc/html/latest/admin-guide/ext4.html says “the file name provided by userspace is a byte-per-byte match to what is actually written in the disk”


I'm sorry if I was unclear but my point was that when you receive a string from the Windows API you cannot make any assumptions about it being valid UTF-16. Therefore converting it to UTF-8 is potentially lossy. So if you then convert it back from UTF-8 to UTF-16 and feed it to the WinAPI you'll get unexpected results. Which is why I feel converting back and forth all the time is risky.

This is one reason why the WTF-8[0] encoding was created as a UTF-8 like encoding that supports invalid unicode.

[0] https://simonsapin.github.io/wtf-8/


> I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Doesn't OS X do that? AFAIK files names are in NFD there.


Yes, Mac normalizes and decomposes. It's weird.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: