Saturday, December 15, 2007

.

I18n, la troisième partie

In the second part, of this series, we talked about internationalization of URLs and domain names. Let’s shift, now, to another Internet application: email. Almost 20 years ago, RFC 1049 defined an email header field called content-type, which took the first step toward internationalization of email (this from the document's abstract): “The ability to recognize this field and invoke the appropriate display process accordingly will, however, improve the readability of messages, and allow the exchange of messages containing mathematical symbols, or foreign language characters.”

In 1992, the first version of the Multipurpose Internet Mail Extensions (MIME) standard expanded on that, and led to widespread adoption of internationalization of email message bodies. At the same time, RFC 1342 showed us a way to do the same for message headers (such as the subject line of a message). That early work has since been updated; for reference, the current versions of the appropriate standards are in a three-part MIME technical specification: RFC 2045, RFC 2046, and RFC 2047.

But MIME suffers from the same problem that we noted above: the raw information is ugly. During the early days of MIME, that caused some deployment difficulty. If I sent you a message that used “funny” character encodings, and your mail program didn’t understand how to decode them, what you saw was strewn with cruft, at best, and was completely unintelligible, at worst. To reduce the occasions for the worst case, MIME included a transition mechanism, multipart/alternative.

The good part, though, is that MIME proved so useful and popular that it was widely adopted fairly quickly, and we soon came to a situation where all but the oldest and crustiest plain-text email programs can handle non-English character sets in both the body and the subject line.

But not in the email address.

Email addresses are stuck firmly in US-ASCII, for a variety of protocol-defined reasons. And since the average computer user has become more like my mother than like me (not a technology specialist, but someone who uses a computer as a tool, much like a toaster or a screwdriver), it’s not really acceptable to continue turning email addresses into ASCII. As it stands, we wind up with these sorts of things:

Hélène Brûlée <helene.brulee@example.fr>
Jürgen Kölsch <juergen.koelsch@example.de>
Алла Ленина <alla.lenina@example.ru>
אדם צוי <adam.tzvi@example.il>
All of these people would likely prefer to write their email addresses in their native languages, as they have done with the human-readable names.[1] And it’s worse in languages like Chinese, where a great deal is lost in transliteration.

Along comes Email Address Internationalization (EAI), an IETF working group that aims to take the first steps to correcting this. When the working group was started, it was clear that it wouldn’t be easy, that there’d need to be some trying things out, that there might be some false steps. So the charter specifies a series of experimental protocol extensions, not to be put on the standards track until some experimentation’s been done and the results are evaluated and reported.

Of course, a number of protocols need to be extended to handle non-ASCII email addresses, and that has to be done carefully. But beyond that, the most difficult part is coping with what happens when email has to pass through a server that hasn’t yet implemented the new mechanisms — that’s defined in the “downgrading” document (the current specifications can be found on this page), along with stuff in the “utf8headers” document that provides alternative information for some cases.

The major difficulty here is the transition, while many Internet servers (most of them, at first) do not understand the new specifications and can’t handle the “funny” characters. And the key, as it was with MIME, is to consider the transition carefully and provide as much assistance as possible to make it go smoothly. Nothing on the Internet changes overnight, so it’s critical to maintain interoperability over protocol transitions.

The EAI documents are almost ready, and experimentation with the protocol changes has begun. The next step is to get reports on the experiments and see where we need to go to put this stuff on the standards track.

Maybe in a few years, if there’s quick adoption of the EAI changes, Hélène and Jürgen and Alla and Adam will be able to get the sorts of email addresses that they’ve wanted from the start.
 


[1] In a raw email message, the human-readable names would actually be encoded according to RFC 2047, and would look awful... but the email programs would display them nicely, as shown here.

No comments: