Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unicode programming, with examples in C (begriffs.com)
175 points by mmoez on June 11, 2019 | hide | past | favorite | 58 comments


> This is called Normalization Form Canonical Composition (NFC).

Is there something like a "Round Midnight and No Coffee Form" where the programmer just renders the text to check whether the output of each set of codepoints matches pixel for pixel?



I cooped at IBM in the 1980s and this is how we tested and verified the newest IBM PC hardware (at this point IIRC it would have been around the 286/pc junior/os2/smaller form factor ibm pc designs) by creating tests on current systems with manual keyboard entry, and recording all the key strokes/timing and screen buffers and output files at intervals then running the same tests on new hardware with a playback of recorded input and comparing the new output to the saved runs.


The closest unicode has seems to be NFKC, but last I checked it still didn't correctly handle greek and cyrillic aliases of latin characters, never mind anything more obscure.


Do you have examples of this? I wonder if it's implementation specific or due to the NFKC method itself?


The issue does not have anything to do with normalization per se, but stems from the fact that there are Unicode characters that are semantically different, but nevertheless look exactly same in most fonts. For example 'a' (U+0061 LATIN SMALL LETTER A) vs. 'а' (U+0430 CYRILLIC SMALL LETTER A)


It's probably also futile to keep homoglyph tables without the context of the font being used, as т and т are the same letter, just with different styles (and in some fonts resemble T and m respectively in Latin). Unicode mentions homoglyphs briefly in UTR #36, but overall I feel that it's not really their job to solve that issue, as visually similar characters already exist in ASCII and other limited character sets and every context where those things are a problem probably needs to be evaluated differently with different mitigations.


Btw: In C99 you don't have to declare all your variables at the top. You can also use // comments.


We embedded devs might be able to start consistently using C99 in about fifty years or so. Hopefully.

Lots of development kits barely support C89.


Ironically, most legacy C environments are commercial in nature. Open source tools seem to have a lot better track record in this regard.


Couldn't pre-processors be used to add such features? (I don't know anything about embedded development.)


CPP would not be enough, you would need full C99-to-C89 compiler with parsing to AST.


But the ICU headers require C99 anyway.


As it says in the article:

> The examples in this article conform to the C89 standard, but we specify C99 in the Makefile because the ICU header files use C99-style (//) comments.


I found this quite interesting. Perhaps the closest I’ve come to grasping the core concepts, even though I’ve done a fair bit of work around the edges previously.


I'm just confused the libicu shared libs are ~25MB big. Why is that?


I think it’s because the library, when standing alone, would need to include the entire Unicode database. What I wonder is, when you install the library from a distribution’s package manager, does it package that data separately (so that it can hopefully be used by related software)?


Probably not. I would guess, from working with similar “static-data-heavy” libraries, that much of the data is “burned into” the libicu in the form of lookup functions that execute native-code representations of decision trees to give their responses. This data is meant to be “used by related software” by just linking libicu and asking it about the data.

Though, this does mean that in language runtimes that don’t want to pull in native libraries (because the runtime is trying to ensure some guarantee like soft-realtime or fault-tolerance), libicu can’t be linked, so the runtime actually has to do the same thing it does, if it wants to have efficient Unicode lookup to the level that people expect: ship a copy of the database files from Unicode.org with the source, and convert them into source-code representations of decision-tree functions.

The language I’m most familiar with, Elixir, employs this strategy to support Unicode; see e.g. https://github.com/elixir-lang/elixir/blob/master/lib/elixir...


I've seen many distros provide a unicode-data package, containing the human-readable text files from the Unicode Character Database. These are sometimes handy to have around, but I'd be surprised if much software used them directly.


Only libucidata is big.

Reason — tons of data. Contrary to what other commenters said, it’s not (only) Unicode properties (they are tiny), but a lot more: from rules to spelling out the numerics, to transliteration, collation, to word/sentence segmentation (some of which are absolutely non-trivial and sometimes require dictionaries of special cases).


> word/sentence segmentation (some of which are absolutely non-trivial and sometimes require dictionaries of special cases)

Not just special cases: for Thai, you need a dictionary to do segmentation at all.


Because it has to store all properties of all valid codepoints 0-0x10fff. It does it via perfect hashes for fastest lookup, not via space saving 3-level arrays as most others do. I described various implementation strategies here : http://perl11.org/blog/foldcase.html


That article contained this text and code:

Many developers believe that that a case-insensitive comparison is achieved by mapping both strings being compared to either upper- or lowercase and then comparing the resulting bytes. The existence of functions such as ‘strcasecmp’ in some C libraries, for example, or common examples in programming books reinforces this belief:

    if (strcmp(toupper(foo),toupper(bar))==0) { // a typical caseless comparison
which I guess should be C, but makes no sense at all. The standard functions toupper() and tolower() operate on single characters, not strings. Modifying entire strings in place and returning them also seems odd.

Also the text leading up to the code talks about strcasecmp(), but the code doesn't use it, and claims the existance of strcasecmp() proves that people like to smash the case of strings before comparing them. Of course, strcasecmp() is the exact opposite, it just does a case-insensitive comparison and doesn't say anything about how that is achieved.

Very confusing.


It is not defined in any standard. wcscmp, wcsfc and wcsnorm are missing. They should follow the unicode rules.

And not be locale run-time dependent, only config-time.

And then this wchar_t turkey in the standard which no-one needs at all. We need an u8* API only, nothing else.

The next C standard deliberately did nothing on all big open issues. Not even constexpr which is broken in gcc.

It's not confusing, it's just a huge mess.


Benchmarked? For one lookup, or for repeated lookups?

Hashes have terrible cache locality. Unicode itself has locality, with the greek characters generally separate from the chinese characters and so on. The tree-based and array-based methods take advantage of this locality.


Just guessing, but based on statistics of web pages in asian languages, most text comsists of mostly the lower code points, no matter the language. So then hash lookups end up being pretty much heavily biased towards small subsets of the data. And I wouldn't be surprised if cache sizes of modern processors comspire to accelerate this pretty lopsided distribution of accesses considerably.


I’ve always wondered whether, in the context of segmenting/layout-ing entire Unicode documents (or large streams where you’re willing to buffer kilobytes at a time, like browser page rendering), there’d be an efficiency win for Unicode processing, in:

1. detecting (either heuristically, or using in-band metadata like HTML “lang”) the set of languages in use in the document; and then

2. rewriting the internal representation of the received document/stream-chunk from “an array of codepoints” to “an array of pairs {language ID, offset within a language-specific tokens table}.”

In other words, one could—with knowledge of which languages are in use in a document—denormalize the codepoints that are considered valid members of multiple languages’ alphabet/ideograph sets, into separate tokens for each language they appear in.

Each such token would “inherit” all the properties of the original Unicode codepoint it is a proxy for, but would only have to actually encode such properties as actually matter in the the language it’s a token of.

And, as well, each language would be able to set defaults for the properties of its tokens, such that the tokens would only have to encode the exceptions to the defaults; or there could even be language-specific functions for decoding each property, such that languages could Huffman-compress together the particular properties that apply to them, given known frequencies of those properties among its tokens, making it cheaper to decode properties of commonly-encountered tokens, at the expense of decoding time for rarely-encountered tokens.

And, of course, this would give each language’s tokens data locality, such that the CPU could keep only the data (or embodied decision trees) in cache, for the languages that it’s actually using.

Since each token would know what its codepoint is, so you could map this back to regular Unicode (e.g. UTF-8) when serializing it.

(Yes, I’m sort of talking about reimplementing code pages. But 1. they’d be code pages as materialized views of Unicode, and 2. you’d never expose the code-page representation to the world, only using it in your own text system.)


I don't know why ICU did it that way. libunistring did it a bit better, but they also are too big and not performant enough to power coreutils.

The best approach is currently a hybrid of 3-level arrays and a bsearch in a small list of exceptions. This is about 10x smaller and has the same performance. The properties can be boolean, int or strings, so there's no one-fits all solution.


Your article mentions:

> Unicode is pretty established, some use it with the wchar_t API in POSIX, some more as non-POSIX via external non-standardized utf-8 libraries

Just wanted to note that wchar_t is not POSIX per se, but comes from the C standard. It also suffers from various problems, see

https://begriffs.com/posts/2019-01-19-inside-c-standard-lib....


Money quote: "Puh-leaze, if your program can’t handle Medieval Irish carvings then I want nothing to do with it."


> Reading lines into internal UTF-16 representation

Fail.

> It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s true that every code unit can hold a full codepoint.

wchar_t is 32 bits on a number of platforms such as GNU/Linux, MacOS and Solaris. It behooves you to use that, and all the associated library functionality, rather than roll your own.


Curiously, the paragraph before the line "it's unwise to use UTF-32" ends with "Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program."

And that is the best advice there is. If you have a choice, use UTF-8, otherwise use whatever your libraries use.

Unless you have very special needs, forget about UTF-16.


> Unless you have very special needs, forget about UTF-16.

I don’t think programming for Windows, Android, or iOS, or program in / interop with Java, JavaScript, .NET, qualify as “very special needs”.


Even on Windows it's best to keep your text in UTF-8 and convert it to and from UTF-16 when interacting with win32 APIs. Java, dotNet and JavaScript are the worst of all worlds because you're both stuck with wide characters (in their native string types) and have the intricacies of UTF-16 to consider. I guess the advice might have been better phrased as "Unless you're forced to, or have very special needs, stay away from UTF-16".


> it's best to keep your text in UTF-8 and convert it to and from UTF-16 when interacting with win32 APIs

It’s extra source code to write and then support, extra machine code to execute, and likely extra memory to malloc/free. Too slow, in my book automatically means “not best”.

> Java, dotNet and JavaScript are the worst of all worlds because you're both stuck with wide characters (in their native string types) and have the intricacies of UTF-16 to consider.

Just a normal UTF-16, like in WinAPI and many other popular languages, frameworks and libraries. E.g. QT is used a lot in the wild.

> the advice might have been better phrased as

It says exactly the opposite, “Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.”


Spot on. When coding against "raw" win32 API (or NT kernel APIs and perhaps rare native usermode NT API), using UTF-16 is the only way to keep your sanity. Converting strings back and forth between UTF-8 and UTF-16 in that kind of case is just senseless waste of CPU cycles.

One API call might take multiple strings and each conversion often means memory allocation and freeing — something you usually try to avoid as much as possible if it's something that's going to run most of the time the system is powered on.

The situation can be different in cross-platform code. In those cases, UTF-8 is a preferable abstraction.

Just don't use it for filenames. Filenames are just bags of bytes on at least on Windows (well, 16-bit WCHARs, but the idea is same) and Linux, and considering them anything else is not a great idea.


"Too" slow depends on a lot of factors.


When you’re writing code that you 100% sure won’t ever become a performance bottleneck, you still care about time of development. Very often, unless it’s a throwaway code, also about cost of support.

Writing any code at all when that code is not needed is always too slow, this is regardless of any technical factors.


Very little code in this world is needed. Much of it is, however, useful.

The person you replied to obviously isn't advocating for something they find useless.

Perhaps you could have instead asked "Why do you recommend doing this? I don't understand the benefit." But instead, you decided that they're advocating to do something useless for no reason.


> you decided that they're advocating to do something useless for no reason.

No, I decided they’re advocating to do something harmful for no reason.

They're advocating to waste hardware resources (as a developer I don’t like doing that), waste development time (as a manager I don’t like when developers do that). But the worst of all, UTF8 on Windows and converting to/from UTF16 at WinAPI boundary is a source of bugs, the kernel doesn’t guarantee the bytes you get from these APIs are valid UTF16, quite the opposite, it guarantees to treat them as opaque chunk of words.

UTF-8 has it’s place even on Windows, e.g. it makes sense for some network services, and even for RAM data when you know it’ll be 99% English so it saves resources, and that data never hits WinAPI. But as soon as you’re consuming WinAPI, COM, UWP, windows shell, any other native stuff, UTF-8 is just not good.


That very much depends on what you're doing. Constantly reencoding between UTF-16 and UTF-8 would be pointless. Not to mention that "UTF-16" on Windows usually means UCS-2, so you risk losing information if you reencode.

But if your application's strings are mostly independent of the WinAPI then sure, use UTF-8 and only convert when absolutely necessary.


> "UTF-16" on Windows usually means UCS-2

Wikipedia says it's UTF-16 since Windows 2000: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows

Supporting Windows NT4 or Windows 95 in 2019 is what I would call "very special needs".


UTF-16 is only a convention for file names on Windows. This is why we have things like WTF-8: https://simonsapin.github.io/wtf-8/


It’s works exactly the same way on Linux. Neither kernel nor file system changes the bytes passed from userspace to kernel, regardless whether they are valid UTF-8 or not.

Pass invalid UTF-8 file name, and these exact bytes will be written to the drive. https://www.kernel.org/doc/html/latest/admin-guide/ext4.html says “the file name provided by userspace is a byte-per-byte match to what is actually written in the disk”

Also try this test: https://gist.github.com/Const-me/dcdc40b206fe41ba200fa46b2e1... Runs just fine on my system.


It's not exactly the same on Linux, because Linux doesn't have duplicate pairs of system calls for one-byte character strings and wide strings. Linux system calls are all char strings: null-terminated arrays of bytes. It's a very clear model. Any interpretation of path-names as multi-byte character set data is up to user space.


> Linux doesn't have duplicate pairs of system calls for one-byte character strings and wide strings.

Neither is Windows. These “DoSomethingA” APIs aren’t system calls, they’re translated into Unicode-only NtDoSomething system calls, implemented in the kernel by OS or kernel mode drivers as ZwDoSomething.

Windows system calls all operate on null-terminated arrays of 16-bit integers. It's a very clear model. Any interpretation of path names as characters is up to user space.


I don't know what you're talking about. You said Windows uses UTF-16 and pointed to Wikipedia. I'm only pointing out that that's only true by convention. Windows, even today, does not require that its file names be UTF-16.

Whether Linux analogously does the same or not (indeed it does) isn't something I was contesting.


The file system / object manager is only one part of the whole, though. Object names and namespaces in general will have that restriction, but in user-space there's a lot of Unicode that's treated as text, not a bag of code units. And those things are UTF-16.


Wikipedia is wrong on the technical details.

E.g. the filesystem accepts any sequence of WCHARs, whether or not they're valid UTF-16: https://docs.microsoft.com/en-us/windows/desktop/FileIO/nami...

> the file system treats path and file names as an opaque sequence of WCHARs.

The same is true more generally, there's no validation so anything goes.


From your comment I was replying to:

> "UTF-16" on Windows usually means UCS-2, so you risk losing information if you reencode.

On Windows, you normally call this API to convert UTF-8 to UTF-16: https://docs.microsoft.com/en-us/windows/desktop/api/stringa... As you see, the documentation says it converts to UTF-16, not UCS-2, so no information is lost re-encoding.

And the article you’ve linked says “file system treats path and file names as an opaque sequence of WCHARs.” This means no information is lost in the kernel, either.

Indeed, kernel doesn’t validate nor normalize these WCHARs, but should it? I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Linux kernel doesn’t do that either, https://www.kernel.org/doc/html/latest/admin-guide/ext4.html says “the file name provided by userspace is a byte-per-byte match to what is actually written in the disk”


I'm sorry if I was unclear but my point was that when you receive a string from the Windows API you cannot make any assumptions about it being valid UTF-16. Therefore converting it to UTF-8 is potentially lossy. So if you then convert it back from UTF-8 to UTF-16 and feed it to the WinAPI you'll get unexpected results. Which is why I feel converting back and forth all the time is risky.

This is one reason why the WTF-8[0] encoding was created as a UTF-8 like encoding that supports invalid unicode.

[0] https://simonsapin.github.io/wtf-8/


> I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Doesn't OS X do that? AFAIK files names are in NFD there.


Yes, Mac normalizes and decomposes. It's weird.


Most of that is abstracted away by "use what your library uses".

I can't remember if I ever ran into an issue with Java because it used UTF-16.

If you look at the example code of the OP link where it reads a line from a file, you only see UTF-16 mentioned in a comment.

At a first glance, you only see a UChar* being filled.

https://begriffs.com/posts/2019-05-23-unicode-icu.html#readi...


I know, and I was replying to the comment saying that UTF-16 is something that’s very rarely needed.

Personally, when working with strings in RAM, I have slight preference towards UTF-16, 2 reasons:

1. When handling non-Western languages in UTF-8, branch prediction fails all the time. Spaces and punctuations use 1 byte/character, everything else 2-3 bytes/character in UTF-8. With UTF-16 it’s 99% 2 bytes/character, surrogate pairs are very rare, i.e. simple sequential non-vectorized code is likely to be faster for UTF-16.

2. When handling east Asian languages, UTF-16 uses less RAM, these languages use 3 bytes/character in UTF-8, 2 bytes/character in UTF-16.

But that’s only slight preference. In 99% cases I use whatever strings are native on the platform, or will require minimum amount of work to integrate. When doing native Linux development this often means UTF-8, on Windows it’s UTF-16.


1. sounds interesting. Do you have numbers on an example?


This is the correct answer. There's no need for UTF-16 unless you're fixing up code that uses UCS-2. UTF-32 doesn't buy you anything other than bloat. In all cases you have to deal with graphemes that consist of multiple codepoints, so even UTF-32 is a sort of variable length encoding, which is why it buys you nothing but bloat.

UTF-8 is reasonably easy to deal with and very interoperable.


UTF-16 is (still) the default and indeed only way to do ""unicode"" through most of the Windows API.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: