Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is using Punycode encoding, see https://en.m.wikipedia.org/wiki/Emoji_domain.


Why is this even a thing?


There are a bunch of different human writing systems. All of them are weird because they were invented by humans, most of them are very weird indeed because they were invented by humans a long time ago and then gradually mutated.

The Latin system is the one you're using here. It's very popular. Most humans in the integrated tribes are somewhat familiar with it‡. It has twenty six "letters" and then twenty six more "capital letters" which look different but mean almost the same thing for some reason, and then a bunch more symbols that aren't quite letters although some (apostrophe, ampersand) have a better claim than others. But other popular systems include Han, which has a shitload of logograms, and Cyrillic and Greek which have different letters than Latin and different rules about how letters work.

Anyway, the people who invented DNS only or primarily used the Latin system and they weren't much into capital letters. So, their system doesn't treat capital letters as different and only has one set of twenty six Latin letters, ten digits, a dash and an underscore for some reason.

But, lots of people who do NOT have Latin as the primary or only writing system found this annoying to work with. They wanted to use their writing system with DNS especially once the Web came along and popularized the Internet.

Punycode is a way to use some reserved nonsense-looking Latin text in any DNS label to mean that actually this DNS name should be displayed as some Unicode text. Unicode encodes all popular human writing systems (and lots of unpopular ones) fairly well, so this sorts the problem out. Specifically Punycode reserves Latin names that start xn-- for this purpose. By only caring about display this avoids changing anything in the technical underpinnings. Only user-facing code needed to change, every other layer is unaltered.

The rules about IDNs say that a registry (such as .com) should have rules to ensure the names in that registry are unambigous and meaningful. But in practice the purpose of the .com registry in particular is to make as much money as possible regardless of the consequences. So you can register any name you please and they won't stop you even if it's deliberately confusing.

‡ None of the extant unintegrated tribes have writing. This probably doesn't mean anything important.


Perhaps ironically, URLs were never meant for human consumption in the first place. You were meant to "travel" to various sites via search indices, etc. (Think Google.)

Viewed in that light, restricting DNS names to ASCII as a way to reduce bugs, security issues, etc., makes a lot of sense.


Search engines happened remarkably late. Nobody understood how easy search engines would be.

People thought you would follow links from directory pages linked from your "home page", hence the house symbol still seen in browsers. Yahoo! is a leftover of an attempt at a directory.


Indeed--this is what was meant by "indices".

It's kind of like XML. It was never meant to be seen by human eyes, except in a few debugging scenarios. Unfortunately, that intention was ignored, and now we have a usability disaster. (At least in the case of XML, it can just die.)


I agree with what you said about XML. But in its stead we have a lot of JSON which is several respects is even worse.

The worst part of it is that it doesn't have a stable definition for numbers, making it impossible to guarantee you're getting the same value back if you encode and then decode a number. Reliably preserving the data you serialise using it should be a primary feature for an encoding format. JSON can't even do that.


Why do you need a stable serialization for an unstable data type? Use a string of you want stability.


The point is that a 64-bit integer is stable in the language I'm using (which is most languages).

My opinion is that a serialisation format that explicitly makes something as fundamental as the representation of numbers unspecified is not useful as a data representation format.


Oh man, do you remember aol keywords?


And who can forget the original "me too"? :-)


It's another reminder how much internet infrastructure predates the Unicode standard (along with other equally wacky hacks like email's UTF-7). URLs weren't restricted to ASCII versus Unicode for any sort of bug reduction/security issue/whatnot, they were restricted to (a subset of) ASCII simply because Unicode did not exist at the time to even be considered.


Unicode (1987-1991) predates URLs (1990-1994) but not DNS (1983)


The most critical Unicode date for this sort of discussion is the standardization of UTF-8 (1993), because HTTP (1989) piggy-backed on Telnet (1969), an 8-bit text channel.


> You were meant to "travel"

You were meant to surf the web ^^.


"Surfing" came later afair it's traced back to some Australian librarian who had a surf board on here mousepad while looking for an analogy for an article she wrote. [Citation needed]


I won't judge URLs, but we are talking about DNS and hostnames here, which predate URLs and we're ment to give human-readable names to hosts.


Citation needed that URLs were never meant to be seen. Every web browser had a location bar.


I vaguely recall that the double-hyphen used to be disallowed in domain names, reserving them for future definition. "xn--" becoming that future definition.

I'm having trouble confirming this as the double-hyphen is now very much allowed, thanks to IDN.


FYI, xn stands for eXtended Names.


A+ comment except for this part:

> so this sorts the problem out.


That’s unnecessarily patronizing.


I thought it was a well thought-out and clearly written answer to a question that comes across as dismissive.


Yes, I wrote it this dismissive because I already knew the answer and obviously don't agree with the whole idea but it's a good answer for people who don't know the relationship between emojis and character encodings.


Nothing patronising about it. I find it interesting, and educative.


Just like the comments that say "ASCII ought to be enough for everybody"?


Imagine if DNS consisted only of Arabic characters. Imagine how grumpy you'd be that you couldn't go to newyork.gov.us for your local government website, and instead had to go to نيويورك instead?

Assuming you can't read Arabic, how on earth would you even recognise that address, let alone remember how to type it.


Well, I'd probably had to learn something like this: https://en.wikipedia.org/wiki/Arabic_chat_alphabet

I always assumed the non-latin-alphabet people know their way around this because they already used the internet before unicode-domains.


> I always assumed the non-latin-alphabet people know their way around this because they already used the internet before unicode-domains.

But that's not a very good reason to continue it (besides which, there are new people coming to the Internet every day).


That (Arabizi and other romanizations) are in wide use now specifically because the latin alphabet was the de facto standard in the computer world. In hypotheical above, that is not likely the case, and you would have to use actual Arabic script.


In Japan there's a trend to use phone numbers as domain names, probably for this very reason.


Even then those numbers are western. I guess it's easy enough to learn - I'm useless at languages but learnt to recognise arabic and urdu numbers. At least almost every culture in the world has a base 10 system for encoding numbers -- learning Gujarati numbers in addition to your native number system is trivial.


Interesting. I agree that latin isn't sufficient but they could have just extended the list of allowed characters by a few other defined alphabets. No need to include every(?) unicode-point just because it's the technical more elegant solution.


Allowing only a few alphabets doesn't solve much (Cyrillic "а" still looks like latin "a" in most fonts) but opens new cans of worms: how do you update this list, why isn't alphabet x of important minority y included, etc.

Much easier to just allow everything in the technical standards and let domain registries set reasonable standards that make sense for that tld. Sometimes that works well (.de for example has a list of 93 allowed unicode characters [1] which covers everything a German might plausibly use from German or neighboring country's alphabets), but some registries just don't seem to care much (e.g. .com).

1: https://www.denic.de/en/know-how/idn-domains/idn-character-l...


DNS was already 8-bit clean. IDNA was chosen for a complex set of compatibility, management, and user interface reasons.


I have a few years of rusty undergrad Arabic fading in my past, and my take is that non-native speakers would struggle with an ad hoc transliteration system like this (imagine a non-native English-speaker trying to read 133tspeak...)


Punycode? It's to allow non-ASCII characters in domain names without breaking compatibility with all the standards and software that specify that domain names only use ASCII characters.


Unicode and Punycode were originally designed for letting computers handle the wide variety of human writing systems, common ones like Chinese and Arabic, as well as rarer ones. Much more recently, the emoji characters were added to that same Unicode system, so any encoding system these days that's international-friendly also ends up supporting emoji as well.

So Punycode wasn't specifically designed to work with emoji, it just happens to work with emoji because emoji are part of Unicode.


It turns out sometimes people might just want to register mañana.com or 魔法少女まどかマギカ.jp or whatever.


In a lot of languages it is good to be able to render something other than us-ascii. In many places the Latin alphabet without diacritics will be well understood and no problem, but that isn't true everywhere.


Do you mean Punycode, or this joke?


Punycode is the joke


Limiting dns names to 26 latin characters and the 10 arabic digits is the joke. From the top of my head I can't come up with a single language other than English that doesn't use additional letters (or at least additional modifiers on some letters). Punycode is the most sane solution to an insane situation.


DNS should have been UTF-8 from the beginning; that would have been a good solution. As things stand, slapping punycode on it is a terrible solution to a dumb problem.

But then again, when you invent something, you decide how it works, and the world could just re-invent a more international version of DNS but chose not to.

Also, ASCII is still probably the best charset to use, if you have to choose one; it can represent most possible sounds in some way and symbols represent single sounds (as opposed to, for example, chinese characters). It's very widely used (as opposed to, for example, the greek alphabet).

So yes, limiting DNS names to 26 latin characters and 10 arabic digits was probably the best option at the time.


Even english has some borrowed expressions like "à la" that can are more correctly written with non-ascii characters, so not even english is really safe, if you want to be a bit fancy about it.


Swahili, Hawaiian, Italian, that's just off the top of my head, there are many others.

Hawaiian uses ' as a proper letter, granted.


Doesn't Italian use accented vowels in some cases?


Ah, yes of course it does. Silly me.

None of this detracts from the greater point, being that we need a way for all writing systems to somehow squeeze down into the subset of ASCII supported by legacy protocols.


Well, latin works, but it's not that popular outside of Vatican nowadays :)


DNS is the joke, punycode is one of the punchlines. ;-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: