Because it has to store all properties of all valid codepoints 0-0x10fff. It does it via perfect hashes for fastest lookup, not via space saving 3-level arrays as most others do.
I described various implementation strategies here : http://perl11.org/blog/foldcase.html
Many developers believe that that a case-insensitive comparison is achieved by mapping both strings being compared to either upper- or lowercase and then comparing the resulting bytes. The existence of functions such as ‘strcasecmp’ in some C libraries, for example, or common examples in programming books reinforces this belief:
if (strcmp(toupper(foo),toupper(bar))==0) { // a typical caseless comparison
which I guess should be C, but makes no sense at all. The standard functions toupper() and tolower() operate on single characters, not strings. Modifying entire strings in place and returning them also seems odd.
Also the text leading up to the code talks about strcasecmp(), but the code doesn't use it, and claims the existance of strcasecmp() proves that people like to smash the case of strings before comparing them. Of course, strcasecmp() is the exact opposite, it just does a case-insensitive comparison and doesn't say anything about how that is achieved.
Benchmarked? For one lookup, or for repeated lookups?
Hashes have terrible cache locality. Unicode itself has locality, with the greek characters generally separate from the chinese characters and so on. The tree-based and array-based methods take advantage of this locality.
Just guessing, but based on statistics of web pages in asian languages, most text comsists of mostly the lower code points, no matter the language. So then hash lookups end up being pretty much heavily biased towards small subsets of the data. And I wouldn't be surprised if cache sizes of modern processors comspire to accelerate this pretty lopsided distribution of accesses considerably.
I’ve always wondered whether, in the context of segmenting/layout-ing entire Unicode documents (or large streams where you’re willing to buffer kilobytes at a time, like browser page rendering), there’d be an efficiency win for Unicode processing, in:
1. detecting (either heuristically, or using in-band metadata like HTML “lang”) the set of languages in use in the document; and then
2. rewriting the internal representation of the received document/stream-chunk from “an array of codepoints” to “an array of pairs {language ID, offset within a language-specific tokens table}.”
In other words, one could—with knowledge of which languages are in use in a document—denormalize the codepoints that are considered valid members of multiple languages’ alphabet/ideograph sets, into separate tokens for each language they appear in.
Each such token would “inherit” all the properties of the original Unicode codepoint it is a proxy for, but would only have to actually encode such properties as actually matter in the the language it’s a token of.
And, as well, each language would be able to set defaults for the properties of its tokens, such that the tokens would only have to encode the exceptions to the defaults; or there could even be language-specific functions for decoding each property, such that languages could Huffman-compress together the particular properties that apply to them, given known frequencies of those properties among its tokens, making it cheaper to decode properties of commonly-encountered tokens, at the expense of decoding time for rarely-encountered tokens.
And, of course, this would give each language’s tokens data locality, such that the CPU could keep only the data (or embodied decision trees) in cache, for the languages that it’s actually using.
Since each token would know what its codepoint is, so you could map this back to regular Unicode (e.g. UTF-8) when serializing it.
(Yes, I’m sort of talking about reimplementing code pages. But 1. they’d be code pages as materialized views of Unicode, and 2. you’d never expose the code-page representation to the world, only using it in your own text system.)
I don't know why ICU did it that way. libunistring did it a bit better, but they also are too big and not performant enough to power coreutils.
The best approach is currently a hybrid of 3-level arrays and a bsearch in a small list of exceptions. This is about 10x smaller and has the same performance. The properties can be boolean, int or strings, so there's no one-fits all solution.