Add '127.0.0.1 xn–9q8h' to /etc/hosts gives you "localghost"

jackewiehose · on Jan 8, 2020

It took me a while to get the joke because I didn't recognize that white ghost on a white background. I take this as another example why these non-ascii urls are a bad idea for security.

icebraining · on Jan 8, 2020

It's not white, it's the blueish grey #E1E8ED. You need a better monitor :)

paulddraper · on Jan 8, 2020

Fails all contrast measures abysmally: https://contrast-ratio.com/#%23E1E8ED-on-white

And this is despite that fact that the actual "font color" is #14171A with excellent contrast: https://contrast-ratio.com/#%2314171A-on-white

But these bizarre characters don't respond the text color and thus utterly unpredictable in their legibility.

icebraining · on Jan 8, 2020

The use of the color is quite different from a regular character, so measuring by the same yardstick makes no sense. This is a blob of color, not just some thin lines.

> these bizarre characters

At least on Twitter.com, they're not characters, they're SVG images.

paulddraper · on Jan 9, 2020

PNG, but yes.

icebraining · on Jan 9, 2020

Maybe it differs based on the browser? Shows up as https://abs-0.twimg.com/emoji/v2/svg/1f47b.svg to me.

paulddraper · on Jan 9, 2020

https://abs.twimg.com/emoji/v2/72x72/1f47b.png ¯\_(ツ)_/¯

saghm · on Jan 8, 2020

> Fails all contrast measures abysmally: https://contrast-ratio.com/#%23E1E8ED-on-white

At first, I was thinking that I could see the difference quite clearly between the two colors and I had no clue what you were talking about. Then I hit the "swap colors" button and happened to notice that there was a bunch of text on the left side that I hadn't even seen...

rahuldottech · on Jan 8, 2020

Very much depends on what browser/platform you're on. Many different emoji implementations out there.

icebraining · on Jan 8, 2020

Nope, Twitter replaces the unicode character with an SVG image, so it should show the same for everyone.

hippish · on Jan 8, 2020

Depends on how you're browsing. The legacy fallback does not replace it, eg when you have js disabled

dubcanada · on Jan 8, 2020

None of the emoji implementations use a solid white ghost without something else (border or shadow).

kick · on Jan 8, 2020

I have JS disabled, and just viewed it in both the legacy mobile view and the legacy desktop view. Neither have what was originally described.

m90 · on Jan 8, 2020

Not everyone has near perfect vision, in which case a better monitor won't fix too much if contrast is bad.

tenryuu · on Jan 8, 2020

"being a better consumer is the answer" no thanks

icebraining · on Jan 8, 2020

Yes, being a good consumer (which is not the same as spending much) is important. Research your stuff before buying.

gruez · on Jan 8, 2020

Why? if you’re talking about the homograph attack, AFAIK both Firefox and Chrome use whitelisting, which for example prevents you from using the Cyrillic “a” with other Latin letters.

shawkinaw · on Jan 8, 2020

This is using Punycode encoding, see https://en.m.wikipedia.org/wiki/Emoji_domain.

jackewiehose · on Jan 8, 2020

Why is this even a thing?

tialaramex · on Jan 8, 2020

There are a bunch of different human writing systems. All of them are weird because they were invented by humans, most of them are very weird indeed because they were invented by humans a long time ago and then gradually mutated.

The Latin system is the one you're using here. It's very popular. Most humans in the integrated tribes are somewhat familiar with it‡. It has twenty six "letters" and then twenty six more "capital letters" which look different but mean almost the same thing for some reason, and then a bunch more symbols that aren't quite letters although some (apostrophe, ampersand) have a better claim than others. But other popular systems include Han, which has a shitload of logograms, and Cyrillic and Greek which have different letters than Latin and different rules about how letters work.

Anyway, the people who invented DNS only or primarily used the Latin system and they weren't much into capital letters. So, their system doesn't treat capital letters as different and only has one set of twenty six Latin letters, ten digits, a dash and an underscore for some reason.

But, lots of people who do NOT have Latin as the primary or only writing system found this annoying to work with. They wanted to use their writing system with DNS especially once the Web came along and popularized the Internet.

Punycode is a way to use some reserved nonsense-looking Latin text in any DNS label to mean that actually this DNS name should be displayed as some Unicode text. Unicode encodes all popular human writing systems (and lots of unpopular ones) fairly well, so this sorts the problem out. Specifically Punycode reserves Latin names that start xn-- for this purpose. By only caring about display this avoids changing anything in the technical underpinnings. Only user-facing code needed to change, every other layer is unaltered.

The rules about IDNs say that a registry (such as .com) should have rules to ensure the names in that registry are unambigous and meaningful. But in practice the purpose of the .com registry in particular is to make as much money as possible regardless of the consequences. So you can register any name you please and they won't stop you even if it's deliberately confusing.

‡ None of the extant unintegrated tribes have writing. This probably doesn't mean anything important.

downerending · on Jan 8, 2020

Perhaps ironically, URLs were never meant for human consumption in the first place. You were meant to "travel" to various sites via search indices, etc. (Think Google.)

Viewed in that light, restricting DNS names to ASCII as a way to reduce bugs, security issues, etc., makes a lot of sense.

ncmncm · on Jan 8, 2020

Search engines happened remarkably late. Nobody understood how easy search engines would be.

People thought you would follow links from directory pages linked from your "home page", hence the house symbol still seen in browsers. Yahoo! is a leftover of an attempt at a directory.

downerending · on Jan 8, 2020

Indeed--this is what was meant by "indices".

It's kind of like XML. It was never meant to be seen by human eyes, except in a few debugging scenarios. Unfortunately, that intention was ignored, and now we have a usability disaster. (At least in the case of XML, it can just die.)

lokedhs · on Jan 9, 2020

I agree with what you said about XML. But in its stead we have a lot of JSON which is several respects is even worse.

The worst part of it is that it doesn't have a stable definition for numbers, making it impossible to guarantee you're getting the same value back if you encode and then decode a number. Reliably preserving the data you serialise using it should be a primary feature for an encoding format. JSON can't even do that.

gowld · on Jan 9, 2020

Why do you need a stable serialization for an unstable data type? Use a string of you want stability.

lokedhs · on Jan 10, 2020

The point is that a 64-bit integer is stable in the language I'm using (which is most languages).

My opinion is that a serialisation format that explicitly makes something as fundamental as the representation of numbers unspecified is not useful as a data representation format.

LadyCailin · on Jan 8, 2020

Oh man, do you remember aol keywords?

downerending · on Jan 8, 2020

And who can forget the original "me too"? :-)

WorldMaker · on Jan 8, 2020

It's another reminder how much internet infrastructure predates the Unicode standard (along with other equally wacky hacks like email's UTF-7). URLs weren't restricted to ASCII versus Unicode for any sort of bug reduction/security issue/whatnot, they were restricted to (a subset of) ASCII simply because Unicode did not exist at the time to even be considered.

gowld · on Jan 9, 2020

Unicode (1987-1991) predates URLs (1990-1994) but not DNS (1983)

WorldMaker · on Jan 9, 2020

The most critical Unicode date for this sort of discussion is the standardization of UTF-8 (1993), because HTTP (1989) piggy-backed on Telnet (1969), an 8-bit text channel.

johnchristopher · on Jan 8, 2020

> You were meant to "travel"

You were meant to surf the web ^^.

johannes1234321 · on Jan 8, 2020

"Surfing" came later afair it's traced back to some Australian librarian who had a surf board on here mousepad while looking for an analogy for an article she wrote. [Citation needed]

johannes1234321 · on Jan 8, 2020

I won't judge URLs, but we are talking about DNS and hostnames here, which predate URLs and we're ment to give human-readable names to hosts.

gowld · on Jan 9, 2020

Citation needed that URLs were never meant to be seen. Every web browser had a location bar.

billpg · on Jan 8, 2020

I vaguely recall that the double-hyphen used to be disallowed in domain names, reserving them for future definition. "xn--" becoming that future definition.

I'm having trouble confirming this as the double-hyphen is now very much allowed, thanks to IDN.

anticensor · on Jan 8, 2020

FYI, xn stands for eXtended Names.

gowld · on Jan 9, 2020

A+ comment except for this part:

> so this sorts the problem out.

nisuni · on Jan 8, 2020

That’s unnecessarily patronizing.

nkrisc · on Jan 8, 2020

I thought it was a well thought-out and clearly written answer to a question that comes across as dismissive.

jackewiehose · on Jan 8, 2020

Yes, I wrote it this dismissive because I already knew the answer and obviously don't agree with the whole idea but it's a good answer for people who don't know the relationship between emojis and character encodings.

davidcollantes · on Jan 8, 2020

Nothing patronising about it. I find it interesting, and educative.

Avamander · on Jan 9, 2020

Just like the comments that say "ASCII ought to be enough for everybody"?

iso1210 · on Jan 8, 2020

Imagine if DNS consisted only of Arabic characters. Imagine how grumpy you'd be that you couldn't go to newyork.gov.us for your local government website, and instead had to go to نيويورك instead?

Assuming you can't read Arabic, how on earth would you even recognise that address, let alone remember how to type it.

jackewiehose · on Jan 8, 2020

Well, I'd probably had to learn something like this: https://en.wikipedia.org/wiki/Arabic_chat_alphabet

I always assumed the non-latin-alphabet people know their way around this because they already used the internet before unicode-domains.

JadeNB · on Jan 8, 2020

> I always assumed the non-latin-alphabet people know their way around this because they already used the internet before unicode-domains.

But that's not a very good reason to continue it (besides which, there are new people coming to the Internet every day).

anon31945710 · on Jan 9, 2020

That (Arabizi and other romanizations) are in wide use now specifically because the latin alphabet was the de facto standard in the computer world. In hypotheical above, that is not likely the case, and you would have to use actual Arabic script.

wongarsu · on Jan 8, 2020

In Japan there's a trend to use phone numbers as domain names, probably for this very reason.

iso1210 · on Jan 14, 2020

Even then those numbers are western. I guess it's easy enough to learn - I'm useless at languages but learnt to recognise arabic and urdu numbers. At least almost every culture in the world has a base 10 system for encoding numbers -- learning Gujarati numbers in addition to your native number system is trivial.

jackewiehose · on Jan 8, 2020

Interesting. I agree that latin isn't sufficient but they could have just extended the list of allowed characters by a few other defined alphabets. No need to include every(?) unicode-point just because it's the technical more elegant solution.

wongarsu · on Jan 8, 2020

Allowing only a few alphabets doesn't solve much (Cyrillic "а" still looks like latin "a" in most fonts) but opens new cans of worms: how do you update this list, why isn't alphabet x of important minority y included, etc.

Much easier to just allow everything in the technical standards and let domain registries set reasonable standards that make sense for that tld. Sometimes that works well (.de for example has a list of 93 allowed unicode characters [1] which covers everything a German might plausibly use from German or neighboring country's alphabets), but some registries just don't seem to care much (e.g. .com).

1: https://www.denic.de/en/know-how/idn-domains/idn-character-l...

wahern · on Jan 8, 2020

DNS was already 8-bit clean. IDNA was chosen for a complex set of compatibility, management, and user interface reasons.

blacksmith_tb · on Jan 8, 2020

I have a few years of rusty undergrad Arabic fading in my past, and my take is that non-native speakers would struggle with an ad hoc transliteration system like this (imagine a non-native English-speaker trying to read 133tspeak...)

retrac · on Jan 8, 2020

Punycode? It's to allow non-ASCII characters in domain names without breaking compatibility with all the standards and software that specify that domain names only use ASCII characters.

alangpierce · on Jan 8, 2020

Unicode and Punycode were originally designed for letting computers handle the wide variety of human writing systems, common ones like Chinese and Arabic, as well as rarer ones. Much more recently, the emoji characters were added to that same Unicode system, so any encoding system these days that's international-friendly also ends up supporting emoji as well.

So Punycode wasn't specifically designed to work with emoji, it just happens to work with emoji because emoji are part of Unicode.

fennecfoxen · on Jan 8, 2020

It turns out sometimes people might just want to register mañana.com or 魔法少女まどかマギカ.jp or whatever.

asveikau · on Jan 8, 2020

In a lot of languages it is good to be able to render something other than us-ascii. In many places the Latin alphabet without diacritics will be well understood and no problem, but that isn't true everywhere.

tsavola · on Jan 8, 2020

Do you mean Punycode, or this joke?

DarkWiiPlayer · on Jan 8, 2020

Punycode is the joke

wongarsu · on Jan 8, 2020

Limiting dns names to 26 latin characters and the 10 arabic digits is the joke. From the top of my head I can't come up with a single language other than English that doesn't use additional letters (or at least additional modifiers on some letters). Punycode is the most sane solution to an insane situation.

DarkWiiPlayer · on Jan 9, 2020

DNS should have been UTF-8 from the beginning; that would have been a good solution. As things stand, slapping punycode on it is a terrible solution to a dumb problem.

But then again, when you invent something, you decide how it works, and the world could just re-invent a more international version of DNS but chose not to.

Also, ASCII is still probably the best charset to use, if you have to choose one; it can represent most possible sounds in some way and symbols represent single sounds (as opposed to, for example, chinese characters). It's very widely used (as opposed to, for example, the greek alphabet).

So yes, limiting DNS names to 26 latin characters and 10 arabic digits was probably the best option at the time.

DarkWiiPlayer · on Jan 9, 2020

Even english has some borrowed expressions like "à la" that can are more correctly written with non-ascii characters, so not even english is really safe, if you want to be a bit fancy about it.

samatman · on Jan 9, 2020

Swahili, Hawaiian, Italian, that's just off the top of my head, there are many others.

Hawaiian uses ' as a proper letter, granted.

Liquid_Fire · on Jan 9, 2020

Doesn't Italian use accented vowels in some cases?

samatman · on Jan 9, 2020

Ah, yes of course it does. Silly me.

None of this detracts from the greater point, being that we need a way for all writing systems to somehow squeeze down into the subset of ASCII supported by legacy protocols.

PeterisP · on Jan 8, 2020

Well, latin works, but it's not that popular outside of Vatican nowadays :)

carapace · on Jan 8, 2020

DNS is the joke, punycode is one of the punchlines. ;-)

IHLayman · on Jan 8, 2020

I hate to be a spoilsport, but this is a good reminder that https://en.wikipedia.org/wiki/IDN_homograph_attack is still possible in some domains and with some browsers and CLI tools, although some of the easier tricks to detect have been mitigated.

DyslexicAtheist · on Jan 8, 2020

if you run pihole or a local dnsmasq/unbound it should be possible to mitigate it by sinkholing any unicode domains, e.g. with dnsmasq (requires a patch https://github.com/spacedingo/dnsmasq-regexp_2.76) you can do this:

  address=/:xn--*:/0.0.0.0

does anyone know if something like this is possible with unbound?

rahuldottech · on Jan 8, 2020

But... What about legit unicode domains? I own a couple that I use for personal projects or file sharing.

_ytji · on Jan 8, 2020

I've never seen a legit unicode domain personally, but, is this not compatible with whitelisting specific domains? (in pihole, anyways...)

LadyCailin · on Jan 8, 2020

I have, but I live in Norway, where we have æøå in the standard alphabet. I suspect it’s more common still in Asian countries, because at least in Norwegian, there are standard ascii replacements for all the extra letters, å = aa, ø = oe, æ = ae

Nullabillity · on Jan 8, 2020

Here in Sweden I have never encountered a single legit IDN domain.

Symbiote · on Jan 9, 2020

I know http://www.xn--sknetrafiken-ucb.se, but like many it just redirects to an ASCII version. (Does it look weird seeing "Skane" when you know it ought to be "Skaane"?)

Similarly for a power company, http://xn--rsted-uua.dk, just a redirect, but they do use it on adverts and my electricity bill.

Some that don't redirect: http://xn--mgk--jra.dk/ https://www.xn---strm-uuae.dk/ https://xn--magnusbrth-85a.se/

(HN has converted the displayed URLs to Punycode, presumably as a quick security measure without reference to the reasonable characters for each TLD.)

LadyCailin · on Jan 8, 2020

rødt.no comes to mind, but I know I’ve seen others.

lokedhs · on Jan 9, 2020

www.hörby.se

You probably don't want to remove the dots on that one.

dual_dingo · on Jan 8, 2020

In countries that use diacritical characters, umlauts etc., unicode domains are somewhat common (but are often just an alias for a non-unicode primary domain).

deith · on Jan 8, 2020

I've yet to see a legit unicode domain, and my country doesn't speak English as a first language.

To tell you the truth IDN domains feel like a failure, a gimmick. Their biggest market probably was meant to be countries that don't use the Latin alphabet, and they've failed spectacularly.

If you use Firefox, for your own security, set network.IDN_show_punycode to true.

Symbiote · on Jan 9, 2020

There are a few in Roman-lettered Europe, but they're not exactly common. Many are redirects to an ASCII domain.

They seem to be used in Russia a little more, especially with the .рф TLD[1], although many are still just redirects.

I have no idea what the text means, but these sites look reasonable: https://www.xn--80aicstx0byb.xn--p1ai/ http://xn--j1abth1c.xn--p1ai/ https://xn--d1ai6ai.xn--p1ai/

[1] https://en.wikipedia.org/wiki/.%D1%80%D1%84

(HN seems to have translated these to Punycode, presumably a quick security measure without respect for non-ASCII languages.)

DyslexicAtheist · on Jan 9, 2020

it used to be possible to block this with a patched dnsmasq that allows setting a regex, but the fork is not maintained and merging the patches to upstream is also not much fun.

so I hacked something together that uses the linux kernel NFQUEUE: https://github.com/DyslexicAtheist/nfq

this way I have guarantee that these domains will never be resolved (which is what I want :))

TimTheTinker · on Jan 8, 2020

I predict that blacklisting unicode domains will become a "best practice" for security, but that eventually their use will become normal and accepted when the security issues have been worked out (perhaps by something as simple as using a different background color for unicode characters in the browser's location bar).

DyslexicAtheist · on Jan 8, 2020

indeed those would get thrown under the bus. it's a trade-off and depends on how paranoid you want to be.

archi42 · on Jan 8, 2020

Two ideas to lessen the trade-off:

1. Use a browser extension that throws a warning on all unicode domains (maybe even with unicode highlighting). Drawback: Needs to be done per-device.

2. Let your pihole MitM all https traffic with a certificate you do NOT trust (maybe create one per domain, so you can add it to the trusted list); if the connection is over http, upgrade it to https (if the server doesn't speak https, proxy it). Drawback: It's much more complicated, and if your bank happens to be called e.g. "Bank of Zürich" you still need to take a look at the IDN to determine if you're on the right website (or add an exception).

chupasaurus · on Jan 8, 2020

Modification of 2nd idea. Run two dnsmasq servers: one which would do resolving and listening on loopback interface, and other listening 53/udp with no-resolv, whitelist of IDNs and filtering rules to pass normal and block other punycode DNs.

DyslexicAtheist · on Jan 9, 2020

I came up with another "solution" using the linux kernel NFQUEUE: https://news.ycombinator.com/item?id=22003933

amarshall · on Jan 8, 2020

As far as I know, no, it’s not possible to do in Unbound since it doesn’t support regex or wildcards on part of a domain (except whole parts as in DNS itself e.g. *.foo.example).

DyslexicAtheist · on Jan 9, 2020

I came up with another solution to work separate from unbound/dnsmasq using the NFQUEUE in the linux kernel. I basically am processing the dns packets in user-land :) ... bit of a hack but it made for a great afternoon https://news.ycombinator.com/item?id=22003933

IHLayman · on Jan 8, 2020

That's a good hack that I'll have to play with tonight. Thank you!

jiveturkey · on Jan 8, 2020

tip of the decade!

billpg · on Jan 9, 2020

I wanted to make a new website using emojis instead of "www" as a joke about the number of syllables. ("Angry Face Angry Face Angry Face" takes the same amount of time to say "www".)

Browsers kept insisting on showing this as xn--b38haa.crankybill.com, so I went with "grr" instead.

banana_giraffe · on Jan 8, 2020

Interesting, both Chrome and Firefox seem to show the Punycode encoding after I enter the emoji in the URL for me.

Do browsers always show the Punycode encoding, or do they show the encoded glyphs only in some scenarios? I can't find examples of Punycode in the wild used by normal websites.

londons_explore · on Jan 8, 2020

I believe the config of which glyphs to show depends on the TLD. There is a hardcoded list of which character ranges are acceptable per TLD, and if any characters are outside those ranges, the xn-- form is shown instead.

AndyMcConachie · on Jan 8, 2020

Not all unicode characters are valid IDN labels. For example, emojis are not valid in IDN labels.

daxterspeed · on Jan 8, 2020

Valid or not several emoji domains have existed since 2001 https://en.wikipedia.org/wiki/Emoji_domain

WorldMaker · on Jan 8, 2020

As with so many things domain name related, what is or is not valid varies by and is determined by registrar. The biggest registrars (.com, .net, .org as three examples) generally have a lot of restrictions on IDNs, whereas many countries can afford to allow just about the full gamut of Unicode if they wish.

netsharc · on Jan 8, 2020

Hmm, I wonder if that's going to be the next battle field for URLs: Facebook will try to register its logo as an emoji, and you'd just need to go to http://[f] to open their site.

There already is an emoji for apple (the fruit, not the company). Oh the horrors. I should start an emoji NIC!

manifestsilence · on Jan 8, 2020

These would be less than convenient to type, but perhaps as we go more and more towards a non-typing web where a walled-garden start page and predefined links lead to the most popular sites with a click, these URLs will become fashionable. I think if so, this will herald the impending death of the human-read and typed URL in favor of start page links and search results.

JadeNB · on Jan 8, 2020

> as we go more and more towards a non-typing web where a walled-garden start page and predefined links lead to the most popular sites with a click

Just a reminder that this is backwards, not forwards: this was my first experience of the Internet, through AOL, and I imagine I'm not the only one.

kick · on Jan 8, 2020

There are more mobile users than there are desktop users, and for them it's just the same to type.

htfu · on Jan 8, 2020

How is switch to emoji input -> press search box -> start typing apple -> press apple symbol and so on “just the same” as app<CR>?

kick · on Jan 8, 2020

You don't have to search. Just hit the apple emoji. If you use it frequently, it'll probably be on the front. If not, it takes two seconds to hit the category it's in and then press it.

htfu · on Jan 9, 2020

Because apple happens to start with an a. Ok fine, specifically for apple.com it's almost the same. But that's not really the point I was arguing.

brandonhorst · on Jan 8, 2020

There is already an emoji for the company:  (only works on iStuff)

nottorp · on Jan 8, 2020

And how do i type that in the address bar?

billpg · on Jan 8, 2020

If you use Windows, [Logo]+[.]

lordgrenville · on Jan 8, 2020

Wow, this is amazing! I knew about Ctrl-cmd-space on macOS, didn't know Windows had caught up.

wongarsu · on Jan 8, 2020

That's a neat trick. It even has kaomoji and useful symbols ╰(°▽°)╯

Avamander · on Jan 9, 2020

This sounds like the modern version of this http://bash.org/?835030 quote.

alangpierce · on Jan 8, 2020

On Mac, Control+Command+Space brings up the emoji picker. Every OS has something like that these days, I'm pretty sure.

nottorp · on Jan 8, 2020

Hmm, so i have to press ctrl+cmd+space then scroll through the picker or type at least "gh" to find that particular emoji. Pretty once or twice I guess.

londons_explore · on Jan 8, 2020

Until it goes into your frequently accessed sites list, which as a dev, I imagine it would.

nottorp · on Jan 8, 2020

Only if you're a web dev :)

JadeNB · on Jan 8, 2020

I'm a keyboard junkie, and I didn't know that. Thanks!

markandrewj · on Jan 8, 2020

Ghost in the shell...

huxflux · on Jan 10, 2020

This made my day!