It took me a while to get the joke because I didn't recognize
that white ghost on a white background. I take this as another
example why these non-ascii urls are a bad idea for security.
The use of the color is quite different from a regular character, so measuring by the same yardstick makes no sense. This is a blob of color, not just some thin lines.
> these bizarre characters
At least on Twitter.com, they're not characters, they're SVG images.
At first, I was thinking that I could see the difference quite clearly between the two colors and I had no clue what you were talking about. Then I hit the "swap colors" button and happened to notice that there was a bunch of text on the left side that I hadn't even seen...
Why? if you’re talking about the homograph attack, AFAIK both Firefox and Chrome use whitelisting, which for example prevents you from using the Cyrillic “a” with other Latin letters.
There are a bunch of different human writing systems. All of them are weird because they were invented by humans, most of them are very weird indeed because they were invented by humans a long time ago and then gradually mutated.
The Latin system is the one you're using here. It's very popular. Most humans in the integrated tribes are somewhat familiar with it‡. It has twenty six "letters" and then twenty six more "capital letters" which look different but mean almost the same thing for some reason, and then a bunch more symbols that aren't quite letters although some (apostrophe, ampersand) have a better claim than others. But other popular systems include Han, which has a shitload of logograms, and Cyrillic and Greek which have different letters than Latin and different rules about how letters work.
Anyway, the people who invented DNS only or primarily used the Latin system and they weren't much into capital letters. So, their system doesn't treat capital letters as different and only has one set of twenty six Latin letters, ten digits, a dash and an underscore for some reason.
But, lots of people who do NOT have Latin as the primary or only writing system found this annoying to work with. They wanted to use their writing system with DNS especially once the Web came along and popularized the Internet.
Punycode is a way to use some reserved nonsense-looking Latin text in any DNS label to mean that actually this DNS name should be displayed as some Unicode text. Unicode encodes all popular human writing systems (and lots of unpopular ones) fairly well, so this sorts the problem out. Specifically Punycode reserves Latin names that start xn-- for this purpose. By only caring about display this avoids changing anything in the technical underpinnings. Only user-facing code needed to change, every other layer is unaltered.
The rules about IDNs say that a registry (such as .com) should have rules to ensure the names in that registry are unambigous and meaningful. But in practice the purpose of the .com registry in particular is to make as much money as possible regardless of the consequences. So you can register any name you please and they won't stop you even if it's deliberately confusing.
‡ None of the extant unintegrated tribes have writing. This probably doesn't mean anything important.
Perhaps ironically, URLs were never meant for human consumption in the first place. You were meant to "travel" to various sites via search indices, etc. (Think Google.)
Viewed in that light, restricting DNS names to ASCII as a way to reduce bugs, security issues, etc., makes a lot of sense.
Search engines happened remarkably late. Nobody understood how easy search engines would be.
People thought you would follow links from directory pages linked from your "home page", hence the house symbol still seen in browsers. Yahoo! is a leftover of an attempt at a directory.
It's kind of like XML. It was never meant to be seen by human eyes, except in a few debugging scenarios. Unfortunately, that intention was ignored, and now we have a usability disaster. (At least in the case of XML, it can just die.)
I agree with what you said about XML. But in its stead we have a lot of JSON which is several respects is even worse.
The worst part of it is that it doesn't have a stable definition for numbers, making it impossible to guarantee you're getting the same value back if you encode and then decode a number. Reliably preserving the data you serialise using it should be a primary feature for an encoding format. JSON can't even do that.
The point is that a 64-bit integer is stable in the language I'm using (which is most languages).
My opinion is that a serialisation format that explicitly makes something as fundamental as the representation of numbers unspecified is not useful as a data representation format.
It's another reminder how much internet infrastructure predates the Unicode standard (along with other equally wacky hacks like email's UTF-7). URLs weren't restricted to ASCII versus Unicode for any sort of bug reduction/security issue/whatnot, they were restricted to (a subset of) ASCII simply because Unicode did not exist at the time to even be considered.
The most critical Unicode date for this sort of discussion is the standardization of UTF-8 (1993), because HTTP (1989) piggy-backed on Telnet (1969), an 8-bit text channel.
"Surfing" came later afair it's traced back to some Australian librarian who had a surf board on here mousepad while looking for an analogy for an article she wrote. [Citation needed]
I vaguely recall that the double-hyphen used to be disallowed in domain names, reserving them for future definition. "xn--" becoming that future definition.
I'm having trouble confirming this as the double-hyphen is now very much allowed, thanks to IDN.
Yes, I wrote it this dismissive because I already knew the answer
and obviously don't agree with the whole idea but it's a good
answer for people who don't know the relationship between emojis
and character encodings.
Imagine if DNS consisted only of Arabic characters. Imagine how grumpy you'd be that you couldn't go to newyork.gov.us for your local government website, and instead had to go to نيويورك instead?
Assuming you can't read Arabic, how on earth would you even recognise that address, let alone remember how to type it.
That (Arabizi and other romanizations) are in wide use now specifically because the latin alphabet was the de facto standard in the computer world. In hypotheical above, that is not likely the case, and you would have to use actual Arabic script.
Even then those numbers are western. I guess it's easy enough to learn - I'm useless at languages but learnt to recognise arabic and urdu numbers. At least almost every culture in the world has a base 10 system for encoding numbers -- learning Gujarati numbers in addition to your native number system is trivial.
Interesting. I agree that latin isn't sufficient but they could
have just extended the list of allowed characters by a few other
defined alphabets. No need to include every(?) unicode-point
just because it's the technical more elegant solution.
Allowing only a few alphabets doesn't solve much (Cyrillic "а" still looks like latin "a" in most fonts) but opens new cans of worms: how do you update this list, why isn't alphabet x of important minority y included, etc.
Much easier to just allow everything in the technical standards and let domain registries set reasonable standards that make sense for that tld. Sometimes that works well (.de for example has a list of 93 allowed unicode characters [1] which covers everything a German might plausibly use from German or neighboring country's alphabets), but some registries just don't seem to care much (e.g. .com).
I have a few years of rusty undergrad Arabic fading in my past, and my take is that non-native speakers would struggle with an ad hoc transliteration system like this (imagine a non-native English-speaker trying to read 133tspeak...)
Punycode? It's to allow non-ASCII characters in domain names without breaking compatibility with all the standards and software that specify that domain names only use ASCII characters.
Unicode and Punycode were originally designed for letting computers handle the wide variety of human writing systems, common ones like Chinese and Arabic, as well as rarer ones. Much more recently, the emoji characters were added to that same Unicode system, so any encoding system these days that's international-friendly also ends up supporting emoji as well.
So Punycode wasn't specifically designed to work with emoji, it just happens to work with emoji because emoji are part of Unicode.
In a lot of languages it is good to be able to render something other than us-ascii. In many places the Latin alphabet without diacritics will be well understood and no problem, but that isn't true everywhere.
Limiting dns names to 26 latin characters and the 10 arabic digits is the joke. From the top of my head I can't come up with a single language other than English that doesn't use additional letters (or at least additional modifiers on some letters). Punycode is the most sane solution to an insane situation.
DNS should have been UTF-8 from the beginning; that would have been a good solution. As things stand, slapping punycode on it is a terrible solution to a dumb problem.
But then again, when you invent something, you decide how it works, and the world could just re-invent a more international version of DNS but chose not to.
Also, ASCII is still probably the best charset to use, if you have to choose one; it can represent most possible sounds in some way and symbols represent single sounds (as opposed to, for example, chinese characters). It's very widely used (as opposed to, for example, the greek alphabet).
So yes, limiting DNS names to 26 latin characters and 10 arabic digits was probably the best option at the time.
Even english has some borrowed expressions like "à la" that can are more correctly written with non-ascii characters, so not even english is really safe, if you want to be a bit fancy about it.
None of this detracts from the greater point, being that we need a way for all writing systems to somehow squeeze down into the subset of ASCII supported by legacy protocols.
I hate to be a spoilsport, but this is a good reminder that https://en.wikipedia.org/wiki/IDN_homograph_attack is still possible in some domains and with some browsers and CLI tools, although some of the easier tricks to detect have been mitigated.
if you run pihole or a local dnsmasq/unbound it should be possible to mitigate it by sinkholing any unicode domains, e.g. with dnsmasq (requires a patch https://github.com/spacedingo/dnsmasq-regexp_2.76) you can do this:
address=/:xn--*:/0.0.0.0
does anyone know if something like this is possible with unbound?
I have, but I live in Norway, where we have æøå in the standard alphabet. I suspect it’s more common still in Asian countries, because at least in Norwegian, there are standard ascii replacements for all the extra letters, å = aa, ø = oe, æ = ae
I know http://www.xn--sknetrafiken-ucb.se, but like many it just redirects to an ASCII version. (Does it look weird seeing "Skane" when you know it ought to be "Skaane"?)
Similarly for a power company, http://xn--rsted-uua.dk, just a redirect, but they do use it on adverts and my electricity bill.
In countries that use diacritical characters, umlauts etc., unicode domains are somewhat common (but are often just an alias for a non-unicode primary domain).
I've yet to see a legit unicode domain, and my country doesn't speak English as a first language.
To tell you the truth IDN domains feel like a failure, a gimmick. Their biggest market probably was meant to be countries that don't use the Latin alphabet, and they've failed spectacularly.
If you use Firefox, for your own security, set network.IDN_show_punycode to true.
it used to be possible to block this with a patched dnsmasq that allows setting a regex, but the fork is not maintained and merging the patches to upstream is also not much fun.
I predict that blacklisting unicode domains will become a "best practice" for security, but that eventually their use will become normal and accepted when the security issues have been worked out (perhaps by something as simple as using a different background color for unicode characters in the browser's location bar).
1. Use a browser extension that throws a warning on all unicode domains (maybe even with unicode highlighting). Drawback: Needs to be done per-device.
2. Let your pihole MitM all https traffic with a certificate you do NOT trust (maybe create one per domain, so you can add it to the trusted list); if the connection is over http, upgrade it to https (if the server doesn't speak https, proxy it). Drawback: It's much more complicated, and if your bank happens to be called e.g. "Bank of Zürich" you still need to take a look at the IDN to determine if you're on the right website (or add an exception).
Modification of 2nd idea. Run two dnsmasq servers: one which would do resolving and listening on loopback interface, and other listening 53/udp with no-resolv, whitelist of IDNs and filtering rules to pass normal and block other punycode DNs.
As far as I know, no, it’s not possible to do in Unbound since it doesn’t support regex or wildcards on part of a domain (except whole parts as in DNS itself e.g. *.foo.example).
I came up with another solution to work separate from unbound/dnsmasq using the NFQUEUE in the linux kernel. I basically am processing the dns packets in user-land :) ... bit of a hack but it made for a great afternoon https://news.ycombinator.com/item?id=22003933
I wanted to make a new website using emojis instead of "www" as a joke about the number of syllables. ("Angry Face Angry Face Angry Face" takes the same amount of time to say "www".)
Browsers kept insisting on showing this as xn--b38haa.crankybill.com, so I went with "grr" instead.
Interesting, both Chrome and Firefox seem to show the Punycode encoding after I enter the emoji in the URL for me.
Do browsers always show the Punycode encoding, or do they show the encoded glyphs only in some scenarios? I can't find examples of Punycode in the wild used by normal websites.
I believe the config of which glyphs to show depends on the TLD. There is a hardcoded list of which character ranges are acceptable per TLD, and if any characters are outside those ranges, the xn-- form is shown instead.
As with so many things domain name related, what is or is not valid varies by and is determined by registrar. The biggest registrars (.com, .net, .org as three examples) generally have a lot of restrictions on IDNs, whereas many countries can afford to allow just about the full gamut of Unicode if they wish.
Hmm, I wonder if that's going to be the next battle field for URLs: Facebook will try to register its logo as an emoji, and you'd just need to go to http://[f] to open their site.
There already is an emoji for apple (the fruit, not the company). Oh the horrors. I should start an emoji NIC!
These would be less than convenient to type, but perhaps as we go more and more towards a non-typing web where a walled-garden start page and predefined links lead to the most popular sites with a click, these URLs will become fashionable. I think if so, this will herald the impending death of the human-read and typed URL in favor of start page links and search results.
You don't have to search. Just hit the apple emoji. If you use it frequently, it'll probably be on the front. If not, it takes two seconds to hit the category it's in and then press it.
Hmm, so i have to press ctrl+cmd+space then scroll through the picker or type at least "gh" to find that particular emoji. Pretty once or twice I guess.