The Cupertino Effect Posted Mon, 10 Mar 2008

I recently wrote about spellcheckers and profanity. Of course, spellcheckers are the site of many other notable revealing errors.

One well-known class of errors is called the Cupertino Effect. The effect is named after an error caused by the fact that some early spellchecker wordlists contained the hyphenated co-operation but not cooperation (both are correct while the former is less common). The ultimate effect, due to the fact that spellchecking algorithms treat hyphenated words as separate words, was that several spellcheckers would suggest Cupertino as a substitute for the "misspelled" cooperation. As the lone suggestion, some people "corrected" cooperation to Cupertino in haste. The weblog Language Log noticed that quite a few people made the mistake in official documents from the UN, EU, NATO and more! These included the following examples found in real documents:

Within the GEIT BG the Cupertino with our Italian comrades proved to be very fruitful. (NATO Stabilisation Force, "Atlas raises the world," 14 May 2003)

Could you tell us how far such policy can go under the euro zone, and specifically where the limits of this Cupertino would be? (European Central Bank press conference, 3 Nov. 1998)

While Language Log authors were incredulous about the idea that there might be spellchecking dictionaries that contain the word Cupertino and not the unhyphenated co-operation, a reader sent in this screenshot from Microsoft Outlook Express circa 1996 using a Microsoft word list from Houghton Mifflin Company. Sure enough, they'd found the culprit.

Cupertino spellchecker screenshot.

Of course, the Cupertino effect is by no means limited to the word cooperation. The Oxford University Press also points out how the Cupertino Effect can rear its head when foreign words and proper nouns are involved. This lead to Reuters referring to Pakistan's Muttahida Quami Movement as the Muttonhead Quail Movement and to Rocky Mountain News naming Leucadia National as La-De-Da National instead. To top that off, Language Log found examples of confusion that led to discussion of copulation which make Cupertino look entirely excusable:

The Western Balkan countries confirmed their intention to further liberalise trade amongst each other. They requested that they be included in the pan-european system of diagonal copulation, which would benefit trade and economic development. (International Organization for Migration, Foreign Ministers Meeting, 22 Nov. 2004)

Of course, the Cupertino Effect is possible every time any spellchecking correction is suggested and the top result is incorrect. As a result, many common misspellings open the door to humorous errors. In a follow-up post, Language Log pointed out if one leaves the "i" off "identified", Microsoft Word 97 will give exactly one suggestion: denitrified which describes the the state of having nitrogen removed. That has led newspapers to report that, "Police denitrified the youths and seized the paintball guns." Which seems unlikely. Similarly, if you leave out the "c" from acquainted, spellcheckers frequently suggest aquatinted as a substitute. As the Oxford University Press blogs pointed out, folks who want to get aquatinted do not often want to be etched with nitric acid!

You can find parallels to the Cupertino effect in the Bucklame Effect I discussed previously. Many of the take-away lessons are the same. Spellcheckers make it easier to say some things correctly and place an additional cost on others. The effect our communication may be subtle but it's real. For example, a spelling mistake might be less forgivable in an era of spellcheckers. Like many communication technologies spellcheckers are normally invisible in the documents they create; nobody is reminded of spellcheckers by a perfectly spelled document. It is only through errors like the Cupertino effect that spellcheckers are revealed.

Further, these nonsensicle suggestions are made only because of the particular way that spellcheckers are built. Microsoft's Natural Language team is apparently working on "contextual" spellcheckers that will be smart enough to guess that you probably don't mean "Cupertino" when you mean cooperation. Of course other errors will remain and new ones will be introduced.

Mojibake Posted Mon, 25 Feb 2008

One of my favorite Japanese words is mojibake (文字化け) which literally translates as "character changing." The term is used to describe an error experienced frequently by computers users who read and write non-Latin scripts -- like Japanese. When readers of non-Latin scripts open a document, email, web page, or some other text, text is sometimes displayed mangled and unreadable. Japanese speakers refer to the resulting garbage as "mojibake." Here's a great example from the mojibake article in Wikipedia (the image is supposed to be in Japanese and to display the the Mojibake article itself).

The UTF-8-encoded Japanese Wikipedia article for mojibake, as
displayed in the Windows-1252 ('ISO-8859-1') encoding.

The problem has been so widespread in Japanese that webpages would often place small images in the top corners of pages that say "mojibake." If a user cannot read the content on the page, the image links to pages which will try to fix the problem for the user.

From a more technical perspective, mojibake might be better described as, "incorrect character decoding," and it hints at a largely hidden part of the way our computers handle text that we usually take for granted.

Of course, computers don't understand Latin or Japanese characters. Instead they operate on bits and bytes -- ones and zeros that represent numbers. In order to input or or output text, computer scientists created mappings of letters and characters to numbers represented by bits and bytes. These mappings end up forming a sequence of characters or letters in a particular order often called a character set. To display two letters, a computers might ask for the fifth and tenth characters from a particular set. These character sets are codes; they map numbers (i.e., positions in the list) to letters just as Morse code maps dots and dashes to letters. Letters can be converted to numbers by a computer for storage and then converted back to be redisplayed. The process is called character encoding and decoding and it happens every time a computer inputs or outputs text.

While there may be some natural orderings, (e.g., A through Z), there are many ways to encode or map a set of letters and numbers (e.g., Should one put numbers before letters in the set? Should capital and lowercase letters be interspersed?). The most important computer character encoding is a ASCII which was first defined in 1963 and is the de facto standard for almost all modern computers. It defines 128 characters including the letters and numbers used in English. But ASCII says nothing about how one should encode accented characters in Latin, scientific symbols, or the characters in any other scripts -- they are simply not in the list of letters and numbers ASCII provides and no mapping is available. Users of ASCII can only use the characters in the set.

Left with computers unable to represent their languages, many non-English speakers have added to and improved on ASCII to create new encodings -- different mappings of bits and bytes to different sets of letters. Japanese text can frequently be found in encodings with obscure technical names likes EUC-JP, ISO-2022-JP, Shift_JIS, and UTF-8. It's not important to understand how they differ -- although I'll come back to this in a future blog post. It's merely important to realize that these each represents different ways to map a set of bits and bytes into letters, numbers, and punctuation.

For example The set of bytes that says "文字化け" (the word for "mojibake" in Japanese) encoded in UTF-8 would show up as "��絖�����" in EUC-JP, "������������" in ISO-2022-JP, and "文字化け" in ISO-8859-1. Each of the strings above is a valid decoding of identical data -- the same ones and zeros. But of course, only the first is correct and comprehensible by a human. Although the others are are displaying the same data, the data is unreadable by humans because it is decoded according to a different character sets mapping! This is mojibake.

For every scrap of text that a computer shows to or takes from a human, the computer needs to keep track of the encoding the data is in. Every web browser must know the encoding of the page it is receiving and the encoding that it will be displayed to the user in. If the data sent is a different format than the one that will be displayed, the computer must convert the text from one encoding to another. Although we don't notice it. Encoding metadata is passed along with almost every webpage we read and every email we send. Data is being converted between encodings millions of times each day. We don't even notice that text is encoded -- until it doesn't decode properly.

Mojibake makes this usually invisible process extremely visible and provides an opportunity to understand that our text is coded -- and how. Encoding introduces important limitations -- it limits our expression to the things that are listed in pre-defined character sets. Until the creation of an encoding called Unicode, one couldn't mix Japanese and Thai in the same document; while there were encodings for both, there were no character sets that encoded the letters for both. Apparently, in Chinese, there are older more obscure characters that no computers can encode yet. Computer users simply can't write these letters on computers. I've seen computers users in Ethiopia emailing each other in English because support for Amharic encodings at the time was so poor and uneven! All of these limits, and many more, are part and parcel of our character encoding systems. They become visible only when the usually invisible process of character encoding is thrust into view. Mojibake provides one such opportunity.

Creating Kanji Posted Tue, 15 Jan 2008

Errors reveal characteristics of the languages we use and the technologies we use to communicate them -- everything from scripts and letter forms (which while very fundamental to written communication are technologies nonetheless) to the computer software we use to create and communicate text.

I've spent the last few weeks in Japan. In the process, I've learned a bit about the Japanese language; no small part of this through errors. Here's one error that taught me quite a lot. The sentence is shown in Japanese and then followed by a translation into English:

今年から貝が胃に棲み始めました。
This year, a clam started living in my stomach.

Needless to say perhaps, this was an error. It was supposed to say:

今年から海外に住み始めました。
This year, I started living abroad.

When the sentences are translated into romaji (i.e., Japanese written in an Roman script) the similarity becomes much more clear to readers that don't understand Japanese:

Kotoshikara kaiga ini sumihajimemashita.
Kotoshikara kaigaini sumihajimemashita.

Kotoshikara means "since this year." Sumihajimemashita means, "has started living." The word kaigaini means "abroad" or "overseas." Kaiga ini (two words) means "clam in stomach." When written phonetically in romaji, the only difference in the two sentences lie in the introduction of a word-break in the middle of "kaigaini." Written out in Japanese, the sentences are quite different; even without understanding, one can see that more than a few of the characters in the sentences differ.

In English word spacing plays an essential role in making written language understandable. Japanese, however, is normally written without spaces between words.

This isn't a problem in Japanese because the Japanese script uses a combination of logograms -- called kanji -- and phonetic characters -- called hiragana and katakana or simply kana -- to delimit words and to describe structure. The result, to Japanese readers, is unambiguous. Phonetically and without spaces, the two sentences are identical in either kana or romaji:

ことしからかいがいにすみはじめました。
Kotoshikarakaigainisumihajimemashita.

In purely phonetic form, the sentence is ambiguous. Using kanji, as shown in the opening examples, this ambiguity is removed. While phonetically identical, "kaigaini" (abroad) and "kaiga ini" (clam in stomach) are very different when kanji is used; they are written "海外に" and "貝が胃に" respectively and are not easily confusable by Japanese readers.

This error, and many others like it, stems from the way that Japanese text is input into computers. Because there are more than 4,000 kanji in frequent use in Japan, there simply are not enough keys on a keyboard to input kanji directly. Instead, text in Japanese is input into computers phonetically (i.e., in kana) without spaces or explicit word boundaries. Once the kana is input, users then transform the phonetic representation of their sentence or phrase into a version using the appropriate kanji logograms. To do so, Japanese computer users employ special software that contains a database of mappings of kana to kanji. In the process, this software makes educated guesses about where word boundaries are. Usually, computers guess correctly. When computers get it wrong, users need to go back and tweak the conversion by hand or select from other options in a list. Sometimes, when users are in a rush, they use an incorrect kana to kanji conversion. It would be obvious to any Japanese computer users that this is precisely what happened in the sentence above.

This type of error has few parallels in English but is extremely common in Japanese writing. The effects, like this one, are often confusing or hilarious. For a Japanese reader, this error reveals the kana to kanji mapping system and the computer software that implements it -- nobody would make such a mistake with a pen and paper. For a person less familiar with Japanese, the error reveals a number of technical particularities about the Japanese writing system and, in the process, about the ways in Japanese differs from other languages they might speak.