Friday, December 14, 2007

.

I18n, la deuxième partie

Now that we have the background, from the first part, about how human-language characters are encoded for computers, we’ll look at how that affects Internet standards as we try to internationalize them. We’ll start by looking at how we browse the web.

Stick a URL into your web browser, and off you go, loading a web site that could be anywhere, in any language. And, indeed, it’s easy to find web pages in any character set, and your browser will render most of them just fine: HTML (hyper-text markup language) and HTTP (hyper-text transfer protocol) are fairly recent standards (from 1990-ish), and are set up to handle all sorts of character sets.

But what about the URL itself? URLs are also fairly new things, and parts of them are already set up for sending all sorts of characters. There’re two problems, though:

  1. The encoding is exposed in the URL, so the raw URL looks really ugly, and doesn’t actually show the international characters.
  2. The domain name, which is part of the URL, was still limited to US-ASCII, for compatibility with other protocols.

The first of those can be sorted out by the browser — it can accept input from any keyboard and display any weird characters it likes, and then translate URLs that it gets into pretty characters... but still send the ugly, encoded ones on the wire. There’s a bit of an interoperability issue there, but not too much — the worst that can happen is that your browser might sometimes get one wrong, and display an odd URL or send a wrong one.

But the second is more of a problem, and the IETF addressed that with the Internationalized Domain Names work. The abstract of RFC 3490:

Until now, there has been no standard method for domain names to use characters outside the ASCII repertoire. This document defines internationalized domain names (IDNs) and a mechanism called Internationalizing Domain Names in Applications (IDNA) for handling them in a standard fashion. IDNs use characters drawn from a large repertoire (Unicode), but IDNA allows the non-ASCII characters to be represented using only the ASCII characters already allowed in so-called host names today. This backward-compatible representation is required in existing protocols like DNS, so that IDNs can be introduced with no changes to the existing infrastructure. IDNA is only meant for processing domain names, not free text.
With IDNs, the raw domain names have the same sort of ugly look as the rest of the internationalized URLs, but, again, the browsers can mitigate that by presenting them nicely.

Unfortunately, that causes its own problems, and there are several. One major one is that by hiding the fact that there are special characters in the domain name, the browser may actually be helping “bad guys”. Consider someone trying to fool you into giving up your PayPal user name and password. You might or might not be fooled by a bogus web site called, say, paypal-secure-login.com. But would one or more of these fool you into thinking you were at the real paypal.com web site?:

pаypаl.com
pąypąl.com
paУpal.com
paỳpal.com
payṗal.com
paypaĺ.com
Those are all domain names that look similar to paypal.com, but each one has a non-ASCII character in it. Depending upon the font you’re using and how closely you look at the URL, you might not notice the substitution (in some fonts, of course, the difference will be very noticeable). What’s even worse in this regard is that there are zero-width characters defined in Unicode, and such characters could be placed between any two visible characters to make an entirely different domain name that’s absolutely indistinguishable, visually, from the original.

Next time, we’ll look at internationalization of email.

2 comments:

Anonymous said...

So this post got me thinking - how would I detect that "pаypаl.com" is not all ASCII, but contains \u0430 instead of ASCII 'a'? I was able to do so by copying the string, saving it to a file, then opening the file with my programmer's editor and showing the hex, but that's way too many steps for casual use.

Do you know of any tools that allow you to look past the glyphs and see the Unicode codepoints? Such a tool wouldn't be too hard to write, so I may just go ahead and do so, but that's not a general solution.

Barry Leiba said...

Yeah, that's the problem.

There've been proposals that browsers should warn the user (perhaps with some sort of highlighting in the address field) when the domain name is in mixed scripts. The trouble with the idea is that studies have already shown that users generally neither understand nor pay attention to such cues.

Maybe a better idea would be to insist that domain names be in a single script, not a mixed one. That might be a problem too — we certainly want to allow foreign characters to be followed by ".com", and there might be other legitimate reasons for mixing scripts.

In the technical plenary session at the IETF's 69th meeting this March, there were some presentations about the state of internationalization and the work ahead. See http://www3.ietf.org/proceedings/07mar/plenaryt.html and look for links to the slides at the bottom of the page.