Now that we have the background, from the first part, about how human-language characters are encoded for computers, we’ll look at how that affects Internet standards as we try to internationalize them. We’ll start by looking at how we browse the web.
Stick a URL into your web browser, and off you go, loading a web site that could be anywhere, in any language. And, indeed, it’s easy to find web pages in any character set, and your browser will render most of them just fine: HTML (hyper-text markup language) and HTTP (hyper-text transfer protocol) are fairly recent standards (from 1990-ish), and are set up to handle all sorts of character sets.
But what about the URL itself? URLs are also fairly new things, and parts of them are already set up for sending all sorts of characters. There’re two problems, though:
- The encoding is exposed in the URL, so the raw URL looks really ugly, and doesn’t actually show the international characters.
- The domain name, which is part of the URL, was still limited to US-ASCII, for compatibility with other protocols.
The first of those can be sorted out by the browser — it can accept input from any keyboard and display any weird characters it likes, and then translate URLs that it gets into pretty characters... but still send the ugly, encoded ones on the wire. There’s a bit of an interoperability issue there, but not too much — the worst that can happen is that your browser might sometimes get one wrong, and display an odd URL or send a wrong one.
Until now, there has been no standard method for domain names to use characters outside the ASCII repertoire. This document defines internationalized domain names (IDNs) and a mechanism called Internationalizing Domain Names in Applications (IDNA) for handling them in a standard fashion. IDNs use characters drawn from a large repertoire (Unicode), but IDNA allows the non-ASCII characters to be represented using only the ASCII characters already allowed in so-called host names today. This backward-compatible representation is required in existing protocols like DNS, so that IDNs can be introduced with no changes to the existing infrastructure. IDNA is only meant for processing domain names, not free text.With IDNs, the raw domain names have the same sort of ugly look as the rest of the internationalized URLs, but, again, the browsers can mitigate that by presenting them nicely.
Unfortunately, that causes its own problems, and there are several. One major one is that by hiding the fact that there are special characters in the domain name, the browser may actually be helping “bad guys”. Consider someone trying to fool you into giving up your PayPal user name and password. You might or might not be fooled by a bogus web site called, say,
paypal-secure-login.com. But would one or more of these fool you into thinking you were at the real
paypal.com web site?:
Those are all domain names that look similar to
paypal.com, but each one has a non-ASCII character in it. Depending upon the font you’re using and how closely you look at the URL, you might not notice the substitution (in some fonts, of course, the difference will be very noticeable). What’s even worse in this regard is that there are zero-width characters defined in Unicode, and such characters could be placed between any two visible characters to make an entirely different domain name that’s absolutely indistinguishable, visually, from the original.
Next time, we’ll look at internationalization of email.