Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because it has to store all properties of all valid codepoints 0-0x10fff. It does it via perfect hashes for fastest lookup, not via space saving 3-level arrays as most others do. I described various implementation strategies here : http://perl11.org/blog/foldcase.html


That article contained this text and code:

Many developers believe that that a case-insensitive comparison is achieved by mapping both strings being compared to either upper- or lowercase and then comparing the resulting bytes. The existence of functions such as ‘strcasecmp’ in some C libraries, for example, or common examples in programming books reinforces this belief:

    if (strcmp(toupper(foo),toupper(bar))==0) { // a typical caseless comparison
which I guess should be C, but makes no sense at all. The standard functions toupper() and tolower() operate on single characters, not strings. Modifying entire strings in place and returning them also seems odd.

Also the text leading up to the code talks about strcasecmp(), but the code doesn't use it, and claims the existance of strcasecmp() proves that people like to smash the case of strings before comparing them. Of course, strcasecmp() is the exact opposite, it just does a case-insensitive comparison and doesn't say anything about how that is achieved.

Very confusing.


It is not defined in any standard. wcscmp, wcsfc and wcsnorm are missing. They should follow the unicode rules.

And not be locale run-time dependent, only config-time.

And then this wchar_t turkey in the standard which no-one needs at all. We need an u8* API only, nothing else.

The next C standard deliberately did nothing on all big open issues. Not even constexpr which is broken in gcc.

It's not confusing, it's just a huge mess.


Benchmarked? For one lookup, or for repeated lookups?

Hashes have terrible cache locality. Unicode itself has locality, with the greek characters generally separate from the chinese characters and so on. The tree-based and array-based methods take advantage of this locality.


Just guessing, but based on statistics of web pages in asian languages, most text comsists of mostly the lower code points, no matter the language. So then hash lookups end up being pretty much heavily biased towards small subsets of the data. And I wouldn't be surprised if cache sizes of modern processors comspire to accelerate this pretty lopsided distribution of accesses considerably.


I’ve always wondered whether, in the context of segmenting/layout-ing entire Unicode documents (or large streams where you’re willing to buffer kilobytes at a time, like browser page rendering), there’d be an efficiency win for Unicode processing, in:

1. detecting (either heuristically, or using in-band metadata like HTML “lang”) the set of languages in use in the document; and then

2. rewriting the internal representation of the received document/stream-chunk from “an array of codepoints” to “an array of pairs {language ID, offset within a language-specific tokens table}.”

In other words, one could—with knowledge of which languages are in use in a document—denormalize the codepoints that are considered valid members of multiple languages’ alphabet/ideograph sets, into separate tokens for each language they appear in.

Each such token would “inherit” all the properties of the original Unicode codepoint it is a proxy for, but would only have to actually encode such properties as actually matter in the the language it’s a token of.

And, as well, each language would be able to set defaults for the properties of its tokens, such that the tokens would only have to encode the exceptions to the defaults; or there could even be language-specific functions for decoding each property, such that languages could Huffman-compress together the particular properties that apply to them, given known frequencies of those properties among its tokens, making it cheaper to decode properties of commonly-encountered tokens, at the expense of decoding time for rarely-encountered tokens.

And, of course, this would give each language’s tokens data locality, such that the CPU could keep only the data (or embodied decision trees) in cache, for the languages that it’s actually using.

Since each token would know what its codepoint is, so you could map this back to regular Unicode (e.g. UTF-8) when serializing it.

(Yes, I’m sort of talking about reimplementing code pages. But 1. they’d be code pages as materialized views of Unicode, and 2. you’d never expose the code-page representation to the world, only using it in your own text system.)


I don't know why ICU did it that way. libunistring did it a bit better, but they also are too big and not performant enough to power coreutils.

The best approach is currently a hybrid of 3-level arrays and a bsearch in a small list of exceptions. This is about 10x smaller and has the same performance. The properties can be boolean, int or strings, so there's no one-fits all solution.


Your article mentions:

> Unicode is pretty established, some use it with the wchar_t API in POSIX, some more as non-POSIX via external non-standardized utf-8 libraries

Just wanted to note that wchar_t is not POSIX per se, but comes from the C standard. It also suffers from various problems, see

https://begriffs.com/posts/2019-01-19-inside-c-standard-lib....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: