Thursday, December 13, 2007

.

I18n, la première partie

I’ve mentioned “internationalization” in some of my posts about the IETF and Internet standards, and I thought I’d write a post about what it means and what the issues are. Let’s see how long it gets, and whether I should split it into two. First, some setup, somewhat oversimplified.

Computers, of course, don’t know anything about letters or words, in the sense that we do — the shapes of the characters. They just know bits, binary digits, ones and zeros. And so we have to represent written characters as sequences of bits somehow, for the computer. The way we chose to do this initially, since it was all done in the U.S., was with a scheme called ASCII (American Standard Code for Information Interchange).

ASCII was developed in times when every bit was precious, so the original ASCII was a seven-bit code, allowing it to represent 128 characters. The first 32 are “control characters”, codes that caused the old teletype machines to do things like move to a new line and ring the bell. The rest include 26 upper-case letters, 26 lower-case letters, ten numbers, and assorted punctuation marks.

In order to add the sorts of accented characters that appear in many European languages, extended codes were developed, using eight bits instead of seven. Standard US-ASCII is now eight bits, and the International Organization for Standardization (ISO) devised a number of encodings that variously include accented characters, Greek characters, Cyrillic characters, and characters for Asian languages.

For Chinese and other east-Asian scripts, eight bits (256 characters) are not sufficient. Double-byte character sets were devised for those, allowing for more than 65,000 characters. That morphed into more general multi-byte character sets, and eventually into Unicode, an extensible and open-ended mechanism for encoding characters, and the associated Unicode Transformation Formats (UTF). Unicode is maintained and coordinated by the Unicode Consortium.

Individual computer programs can choose what character encodings they need to in order to represent the languages they want to support. But if they have to exchange information with other programs — the sort of thing that Internet standards are set up to do — they have to agree on the encoding(s) to use.

Unfortunately for modern computing needs, that agreement was always US-ASCII, in the old days. And much of what was done in the old days persists today, leaving us with the task of pushing, prodding, and tweaking, gradually replacing the old encodings with newer ones that allow us to represent the many languages of modern computer users.

Very gradually.

There are some places where we can go on using US-ASCII and not worry about it — some computer-to-computer protocols are written in human-readable form for convenience in development and debugging, and it’s OK to leave those as they are. But there are many places where it’s important to be able to use accented letters, along with other scripts such as Greek, Cyrillic, Hebrew, Arabic, Hindi, and Chinese, and for those we have to make changes to the standards.

We call that change process “internationalization” — enabling the protocols and data formats to carry international information, not just information in English. We abbreviate it “i18n”, because there are 18 letters between the “i” and the “n”, and because we like to abbreviate things.

In another post, I’ll talk about what we need to change and why, and why it’s hard to do.

No comments: