language – Revealing Errors

October 17, 2009January 10, 2013

Transparency

I caught this revealing error on the always entertaining Photoshop Disasters and thought it was too good to resist pointing out here:

The picture, of course, is a bag of Tao brand jasmine rice for sale in Germany. The error is pretty obvious if you understand a little German: the phrase transparentes sichtfeld literally means transparent field of view. In this case, the phrase is a note written by the graphic designer of the rice bag’s packaging that was never meant to be read by a consumer. The phrase is supposed to indicate to someone involved in the bag’s manufacture than the pink background on which the text is written is supposed to remain unprinted (i.e., as transparent plastic) so that customers get a view directly onto the rice inside the bag.

The error, of course, is that the the pink background and the text was never removed. This was possible, in part, because the the pink background doesn’t look horribly out of place on the bag. A more important factor, however, is the fact that the person printing the bag and bagging the rice almost certainly didn’t speak German.

In this sense, this bears a lot of similarity with some errors I’ve written up before — e.g., the Welsh autoresponder and the Translate server error restaurant. And as in those cases, there are takeaways here about all the things we take for granted when communicating using technology — things we often don’t realize until language barriers make errors like this thrust hidden processes into view.

This error revealed a bit of the processes through which these bags of rice are produced and a little bit about the people and the division of labor that helped bring it to us. Ironically, this error is revealing precisely through the way that the bag fails to reveal its contents.

April 12, 2009January 10, 2013

Quorum of the Twelve Apostates

A number of people (including the New York Times) wrote about a costly error at Brigham Young University last week that was originally reported by the Utah Valley Daily Herald. The error itself was subtle. First, it is important to realize that Brigham Young is a private university owned by the Church of Jesus Christ of Latter-day Saints (i.e., the Mormon Church or LDS for short). The front of the the Daily Universe — the BYU university newspaper — featured a photograph of a group of men who form one of the most important governing bodies in the LDS church with the heading, “Quorum of the Twelve Apostates.”

Quorum of the Twelve Apostates

The caption should have said the “Quorum of the Twelve Apostles” which is the name of the governing body in question. An apostle, of course, is a messenger or ambassador although the term is most often used to refer to Jesus’ twelve closest disciples. The term apostle is used in LDS to refer to a special high rank of priest within the church. An apostate is something else entirely; the term refers to a person who is disloyal and unfaithful to a cause — particularly to a religion.

Shocked that the paper was labeling the highest priests in the church as disloyal and unfaithful, thousands of copies of the paper (18500 by one report) were pulled from news stands around campus. New editions of the paper with a fixed caption were produced and replaced at what must have been enormous cost to BYU and the Daily Universe.

The source of the error, says the university’s spokesperson, was in a spellchecker. Working under a tight deadline, the person spell-checking the captions ran across a misspelled version of “apostles” in the text. In a rush, they clicked the first term in the suggestion list which, unfortunately, happened to be a similarly spelled near-antonym of the word they wanted.

From a technical perspective, this error is a version of the Cupertino effect although the impact was much more strongly felt than most examples of Cupertino. Like Cupertino, BYU’s small disaster can teach us a whole lot about the power and effect of technological affordances. The spell-checking algorithm made it easier for the Daily Universe’s copy editor to write “apostate” than it was to write “apostle” and, as a result, they did exactly that. A system with different affordances would have had different effects.

The affordances in our technological systems are constantly pushing us toward certain choices and actions over others. In an important way, the things we produce and says and the ways we communicate are the product of these affordances. Through errors like BYU’s, we get a glimpse of these usually-hidden affordances in every-day technologies.

February 21, 2009January 19, 2016

The Case of the Welsh Autoresponder

Last year, I talked about some of the dangers of machine translation that resulted in a Chinese restaurant advertised as “Translate Server Error” and another restaurant serving “Stir Fried Wikipedia.” This article from the BBC a couple months ago shows that embarassing translation errors are hardly limited to either China or to machine translation systems.

Mistranslated Welsh road sign

The English half of the sign is printed correctly and says, “No entry for heavy goods vehicles. Residential site only.” Clearly enough, the point of the sign is to prohibit truck drivers from entering a residential neighborhood.

Since the sign was posted in Swansea, Wales, the bottom half of the sign is written in Welsh. The translation of the Welsh is, “I am not in the office at the moment. Send any work to be translated.”

It’s not too hard to piece together what happened. The bottom half of the sign was supposed to be a translation of the English. Unfortunately, the person ordering the sign didn’t speak Welsh. When he or she sent it off to be translated, they received a quick response from an email autoresponder explaining that the email’s intended recipient was temporarily away and that they would be back soon — in Welsh.

Unfortunately, the representative of the Swansea council thought that the autoresponse message — which is coincidentally, about the right length — was the translation. And onto the sign it went. The autoresponse system was clearly, and widely, revealed by the blunder.

One thing we can learn from this mishap is simply to be wary of hidden intermediaries. Our communication systems are long and complex; every message passes through dozens of computers with a possibility of error, interception, surveillance, or manipulation at every step. Although the representative of the Swansea council thought they were getting a human translation, they, in fact, never talked to a human at all. Because the Swansea council didn’t expect a computerized autoresponse, they didn’t consider that the response was not sent by the recipient.

Another important lesson, and one also present in the Chinese examples, is that software needs to give users responses in the language they are interacting in to be interpreted correctly. In the translation context where users plan to use, but may not understand, their program’s output, this is often impossible. That’s why when a person has someone, or some system, translate into a language they do not speak, they open themselves up to these types of errors. If a user does not understand the output of a system they are using, they are put completely at the whim of that system. The fact that we usually do understand our technology’s output provides a set of “sanity checks” that can keep this power in check. We are so susceptible to translation errors because these checks are necessarily removed.

October 26, 2008January 10, 2013

Beef Panties

Many of the gems from the newspaper correction blog Regret the Error qualify as a revealing errors. One particularly entertaining example was this Reuters syndicated wire story on the recall of beef whose opening paragraph explained that (emphasis mine):

Quaker Maid Meats Inc. on Tuesday said it would voluntarily recall 94,400 pounds of frozen ground beef panties that may be contaminated with E. coli.

ABC News Beef Panties Article

Of course the article was talking about beef patties, not beef panties.

This error can be blamed, at least in part, on a spellchecker. I talked about spellcheckers before when I discussed the Cupertino effect which happens when someone spells a word correctly but is prompted to change it to an incorrect word because the spellchecker does not contain the correct word in its dictionary. The Cupertino effect explains why the New Zealand Herald ran a story with Saddam Hussein’s named rendered as Saddam Hussies and Reuters ran a story referring to Pakistan’s Muttahida Quami Movement as the Muttonhead Quail Movement.

What’s going on in the beef panties example seems to be a little different and more subtle. Both “patties” and “panties” are correctly spelled words that are one letter apart. The typo that changes patties to panties is, unlike swapping Cupertino in for cooperation, an easy one for a human to make. Single letter typos in the middle of a word are easy to make and easy to overlook.

As nearly all word processing programs have come to include spellcheckers, writers have become accustomed to them. We look for the red squiggly lines underneath words indicating a typo and, if we don’t see it, we assume we’ve got things right. We do so because this is usually a correct assumption: spelling errors or typos that result in them are the most common type of error that writers make.

In a sense though, the presence of spellcheckers has made one class of misspellings — those that result in a correctly spelled but incorrect words — more likely than before. By making most errors easier to catch, we spend less time proofreading and, in the process, make a smaller class of errors — in this case, swapped words — more likely than used to be. The result is errors like “beef panties.”

Although we’re not always aware of them, the affordances of technology changes the way we work. We proofread differently when we have a spellchecker to aid us. In a way, the presence of a successful error-catching technology makes certain types of errors more likely.

One could make an analogy with the arguments made against some security systems. There’s a strong argument in the security community that creation of a bad security system can actually make people less safe. If one creates a new high-tech electronic passport validator, border agents might stop checking the pictures as closely or asking tough questions of the person in front of them. If the system is easy to game, it can end up making the border less safe.

Error-checking systems eliminate many errors. In doing so, they can create affordances that make others more likely! If the error checking system is good enough, we might stop looking for errors as closely as we did before and more errors of the type that are not caught will slip through.

July 21, 2008January 10, 2013

Lost in Machine Translation

While I’ve been traveling over the last week or so, loads of people sent me a link to this wonderful image of a sign in China reading “Translate Server Error” which has been written up all over the place. Thanks everyone!

Billboard saying

It’s pretty easy to imagine the chain of events to led to this revealing error. The sign is describing a restaurant (the Chinese text, 餐厅, means “dining hall”). In the process of making the sign, the producers tried to translate Chinese text into English with a machine translation system. The translation software did not work and produced the error message, “Translation Server Error.” Unfortunately, because the software’s user didn’t know English, they thought that the error message was the translation and the error text went onto the sign.

This class of error is extremely widespread. When users employ machine translations systems, it’s because they want to communicate to people with whom they do not have a language in common. What that means is that the users of these systems are often in no position to understand the output (or input, depending on which way the translation is going) of such systems and have to trust the translation technology and its designers to get things right.

Here’s another one of my favorite examples that shows a Chinese menu selling stir-fried Wikipedia.

Billboard saying

It’s not entirely clear how this error came about but it seems likely that someone did a search for the Chinese word for a type of edible fungus and its translation into English. The most relevant and accurate page very well might have been an article on the fungus on Wikipedia. Unfamiliar with Wikipedia, the user then confused the name of the article with the name of the website. There have been several distinct citings of “wikipedia” on Chinese menus.

There are a few errors revealed in these examples. Of course, there are errors in the use of language and the broken translation server itself. Machine translations tools are powerful intermediaries that determine (often with very little accountability) the content of one’s messages. The authors of the translation software might design their tool to avoid certain terminology and word choices over others or to silently censor certain messages. When the software is generating reasonable sounding translations, the authors and readers of machine translated texts are usually unaware of the ways in which messages are being changed. By revealing the presence of a translation system or process, this power is hinted at.

Of course, one might be able to recognize a machine translation system simply by the roughness and nature of a translation. In this particular case, the server itself came explicitly into view; it was mentioned by name! In that sense, the most serious failure was not that the translation server worked or that Wikipedia was used incorrectly, but rather that each system failed to communicate the basic fact that there was an error in the first place.

June 30, 2008January 10, 2013

Tyson Homosexual

Thanks to everyone who pointed me to the flub below. It was reported all over the place today.

Screenshot showing Tyson Homosexual instead of Tyson Gay

The error occurred on One News Now, a news website run by the conservative Christian American Family Association. The site provides Christian conservative news and commentary. One of the things they do, apparently, is offer a version of the standard Associated Press news feed. Rather than just republishing it, they run software to clean up the language so it more accurately reflects their values and choice of terminology. They do so with a computer program.

The error is a pretty straightforward variant of the clbuttic effect — a run-away filter trying to clean up text by replacing offensive terms with theoretically more appropriate ones. Among other substitutions, AFA/ONN replaced the term “gay” with “homosexual.” In this case, they changed the name of champion sprinter and U.S. Olympic hopeful Tyson Gay to “Tyson Homosexual.” In fact, they did it quite a few times as you can see in the screenshot below.

Screenshot showing Tyson Homosexual instead of Tyson Gay.

Now, from a technical perspective, the technology this error reveals is identical to the clbuttic mistake. What’s different, however, is the values that the error reveals.

AFA doesn’t advertise the fact that it changes words in its AP stories — it just does it. Most of its readers probably never know the difference or realize that the messages and terminology they are being communicated to in is being intentionally manipulated. AFA prefers the term “homosexual,” which sounds clinical, to “gay” which sounds much less serious. Their substitution, and the error it created, reflects a set of values that AFA and ONN have about the terminology around homosexuality.

It’s possible than the AFA/ONN readers already know about AFA’s values. This error provides an important reminder and shows, quite clearly, the importance that AFA gives to terminology. It reveals their values and some of the actions they are willing to take to protect them.

June 18, 2008January 10, 2013

Clbuttic

Revealings errors are often most powerful when they reveal the presence of or details about a technology’s designer. One of my favorite ~~clbuttes~~ classes of revealing errors are those that go one step further and reveal the values of the designers of systems. I’ve touched on these twice before in my post about T9 input systems and when I talked about profanity in wordlists.

Another wonderful example surfaced in this humorous anecdote about what was supposed to be an invisible anti-profanity system that instead filled a website with nonsensical terms like “clbuttic.”

Basically, the script in question tried to look through user input and to swap out instances of profanity with less offensive synonyms. For example, “ass” might become “butt”, “shit” might become “poop” or “feces”, and so on. To work correctly, the script should have looked for instances of profanity between word boundaries — i.e., profanity surrounded on both sides by spaces or punctuation. The script in question did not.

The result was hilarious. Not only was “ass” changed to “butt,” but any word that contained the letters “ass” were transformed as well! The word “classic” was mangled as “clbuttic.”

The mistake was an easy one to make. In fact, other programmers made the same mistake and searches for “clbuttic” turn up thousands of instances of the term on dozens of independent websites. Searching around, one can find references to a mbuttive music quiz, a mbuttive multiplayer online game, references to how the average consumer is a pbutterby, a transit pbuttenger executed by Singapore, Fermin Toro Jimenez (Ambbuttador of Venezuela), the correct way to deal with an buttailant armed with a banana, and much, much more.

You can even find a reference to how Hinckley tried to buttbuttinate Ronald Reagan!

Each error reveals the presence of an anti-profanity script; obviously, no human would accidentally misspell or mistake the words in question in any other situation! In each case, the existence of a designer and an often hidden intermediary is revealed. What’s perhaps more shocking than this error is that fact that most programmers won’t make this mistake when implementing similar systems. On thousands of websites, our posts and messages and interactions are “cleaned-up” and edited without our consent or knowledge. As a matter of routine, our words are silently and invisibly changed by these systems. Few of us, and even fewer of our readers, ever know the difference. While switching “ass” to “butt” may be harmless enough, it’s a stark reminder of the power that technology gives the designers of technical systems to force their own values on their users and to frame — and perhaps to substantively change — the messages that their technologies communicate.

March 10, 2008December 26, 2014

The Cupertino Effect

I recently wrote about spellcheckers and profanity. Of course, spellcheckers are the site of many other notable revealing errors.

One well-known class of errors is called the Cupertino Effect. The effect is named after an error caused by the fact that some early spellchecker wordlists contained the hyphenated co-operation but not cooperation (both are correct while the former is less common). The ultimate effect, due to the fact that spellchecking algorithms treat hyphenated words as separate words, was that several spellcheckers would suggest Cupertino as a substitute for the “misspelled” cooperation. As the lone suggestion, some people “corrected” cooperation to Cupertino in haste. The weblog Language Log noticed that quite a few people made the mistake in official documents from the UN, EU, NATO and more! These included the following examples found in real documents:

Within the GEIT BG the Cupertino with our Italian comrades proved to be very fruitful. (NATO Stabilisation Force, “Atlas raises the world,” 14 May 2003)

Could you tell us how far such policy can go under the euro zone, and specifically where the limits of this Cupertino would be? (European Central Bank press conference, 3 Nov. 1998)

While Language Log authors were incredulous about the idea that there might be spellchecking dictionaries that contain the word Cupertino and not the unhyphenated co-operation, a reader sent in this screenshot from Microsoft Outlook Express circa 1996 using a Microsoft word list from Houghton Mifflin Company. Sure enough, they’d found the culprit.

Cupertino spellchecker screenshot.

Of course, the Cupertino effect is by no means limited to the word cooperation. The Oxford University Press also points out how the Cupertino Effect can rear its head when foreign words and proper nouns are involved. This lead to Reuters referring to Pakistan’s Muttahida Quami Movement as the Muttonhead Quail Movement and to Rocky Mountain News naming Leucadia National as La-De-Da National instead. To top that off, Language Log found examples of confusion that led to discussion of copulation which make Cupertino look entirely excusable:

The Western Balkan countries confirmed their intention to further liberalise trade amongst each other. They requested that they be included in the pan-european system of diagonal copulation, which would benefit trade and economic development. (International Organization for Migration, Foreign Ministers Meeting, 22 Nov. 2004)

Of course, the Cupertino Effect is possible every time any spellchecking correction is suggested and the top result is incorrect. As a result, many common misspellings open the door to humorous errors. In a follow-up post, Language Log pointed out if one leaves the “i” off “identified”, Microsoft Word 97 will give exactly one suggestion: denitrified which describes the state of having nitrogen removed. That has led newspapers to report that, “Police denitrified the youths and seized the paintball guns.” Which seems unlikely. Similarly, if you leave out the “c” from acquainted, spellcheckers frequently suggest aquatinted as a substitute. As the Oxford University Press blogs pointed out, folks who want to get aquatinted do not often want to be etched with nitric acid!

You can find parallels to the Cupertino effect in the Bucklame Effect I discussed previously. Many of the take-away lessons are the same. Spellcheckers make it easier to say some things correctly and place an additional cost on others. The effect on our communication may be subtle but it’s real. For example, a spelling mistake might be less forgivable in an era of spellcheckers. Like many communication technologies spellcheckers are normally invisible in the documents they create; nobody is reminded of spellcheckers by a perfectly spelled document. It is only through errors like the Cupertino effect that spellcheckers are revealed.

Further, these nonsensical suggestions are made only because of the particular way that spellcheckers are built. Microsoft’s Natural Language team is apparently working on “contextual” spellcheckers that will be smart enough to guess that you probably don’t mean “Cupertino” when you mean cooperation. Of course other errors will remain and new ones will be introduced.

February 25, 2008January 10, 2013

Mojibake

One of my favorite Japanese words is mojibake (文字化け) which literally translates as “character changing.” The term is used to describe an error experienced frequently by computers users who read and write non-Latin scripts — like Japanese. When readers of non-Latin scripts open a document, email, web page, or some other text, text is sometimes displayed mangled and unreadable. Japanese speakers refer to the resulting garbage as “mojibake.” Here’s a great example from the mojibake article in Wikipedia (the image is supposed to be in Japanese and to display the the Mojibake article itself).

The problem has been so widespread in Japanese that webpages would often place small images in the top corners of pages that say “mojibake.” If a user cannot read the content on the page, the image links to pages which will try to fix the problem for the user.

From a more technical perspective, mojibake might be better described as, “incorrect character decoding,” and it hints at a largely hidden part of the way our computers handle text that we usually take for granted.

Of course, computers don’t understand Latin or Japanese characters. Instead they operate on bits and bytes — ones and zeros that represent numbers. In order to input or or output text, computer scientists created mappings of letters and characters to numbers represented by bits and bytes. These mappings end up forming a sequence of characters or letters in a particular order often called a character set. To display two letters, a computers might ask for the fifth and tenth characters from a particular set. These character sets are codes; they map numbers (i.e., positions in the list) to letters just as Morse code maps dots and dashes to letters. Letters can be converted to numbers by a computer for storage and then converted back to be redisplayed. The process is called character encoding and decoding and it happens every time a computer inputs or outputs text.

While there may be some natural orderings, (e.g., A through Z), there are many ways to encode or map a set of letters and numbers (e.g., Should one put numbers before letters in the set? Should capital and lowercase letters be interspersed?). The most important computer character encoding is a ASCII which was first defined in 1963 and is the de facto standard for almost all modern computers. It defines 128 characters including the letters and numbers used in English. But ASCII says nothing about how one should encode accented characters in Latin, scientific symbols, or the characters in any other scripts — they are simply not in the list of letters and numbers ASCII provides and no mapping is available. Users of ASCII can only use the characters in the set.

Left with computers unable to represent their languages, many non-English speakers have added to and improved on ASCII to create new encodings — different mappings of bits and bytes to different sets of letters. Japanese text can frequently be found in encodings with obscure technical names likes EUC-JP, ISO-2022-JP, Shift_JIS, and UTF-8. It’s not important to understand how they differ — although I’ll come back to this in a future blog post. It’s merely important to realize that these each represents different ways to map a set of bits and bytes into letters, numbers, and punctuation.

For example The set of bytes that says “文字化け” (the word for “mojibake” in Japanese) encoded in UTF-8 would show up as “��絖��” in EUC-JP, “��” in ISO-2022-JP, and “æ–‡å—åŒ–ã‘” in ISO-8859-1. Each of the strings above is a valid decoding of identical data — the same ones and zeros. But of course, only the first is correct and comprehensible by a human. Although the others are displaying the same data, the data is unreadable by humans because it is decoded according to a different character set’s mapping! This is mojibake.

For every scrap of text that a computer shows to or takes from a human, the computer needs to keep track of the encoding the data is in. Every web browser must know the encoding of the page it is receiving and the encoding that it will be displayed to the user in. If the data sent is a different format than the one that will be displayed, the computer must convert the text from one encoding to another. Although we don’t notice it. Encoding metadata is passed along with almost every webpage we read and every email we send. Data is being converted between encodings millions of times each day. We don’t even notice that text is encoded — until it doesn’t decode properly.

Mojibake makes this usually invisible process extremely visible and provides an opportunity to understand that our text is coded — and how. Encoding introduces important limitations — it limits our expression to the things that are listed in pre-defined character sets. Until the creation of an encoding called Unicode, one couldn’t mix Japanese and Thai in the same document; while there were encodings for both, there were no character sets that encoded the letters for both. Apparently, in Chinese, there are older more obscure characters that no computers can encode yet. Computer users simply can’t write these letters on computers. I’ve seen computers users in Ethiopia emailing each other in English because support for Amharic encodings at the time was so poor and uneven! All of these limits, and many more, are part and parcel of our character encoding systems. They become visible only when the usually invisible process of character encoding is thrust into view. Mojibake provides one such opportunity.

January 15, 2008January 10, 2013

Creating Kanji

Errors reveal characteristics of the languages we use and the technologies we use to communicate them — everything from scripts and letter forms (which while very fundamental to written communication are technologies nonetheless) to the computer software we use to create and communicate text.

I’ve spent the last few weeks in Japan. In the process, I’ve learned a bit about the Japanese language; no small part of this through errors. Here’s one error that taught me quite a lot. The sentence is shown in Japanese and then followed by a translation into English:

今年から貝が胃に棲み始めました。
This year, a clam started living in my stomach.

Needless to say perhaps, this was an error. It was supposed to say:

今年から海外に住み始めました。
This year, I started living abroad.

When the sentences are translated into romaji (i.e., Japanese written in an Roman script) the similarity becomes much more clear to readers that don’t understand Japanese:

Kotoshikara kaiga ini sumihajimemashita.
Kotoshikara kaigaini sumihajimemashita.

Kotoshikara means “since this year.” Sumihajimemashita means, “has started living.” The word kaigaini means “abroad” or “overseas.” Kaiga ini (two words) means “clam in stomach.” When written phonetically in romaji, the only difference in the two sentences lie in the introduction of a word-break in the middle of “kaigaini.” Written out in Japanese, the sentences are quite different; even without understanding, one can see that more than a few of the characters in the sentences differ.

In English word spacing plays an essential role in making written language understandable. Japanese, however, is normally written without spaces between words.

This isn’t a problem in Japanese because the Japanese script uses a combination of logograms — called kanji — and phonetic characters — called hiragana and katakana or simply kana — to delimit words and to describe structure. The result, to Japanese readers, is unambiguous. Phonetically and without spaces, the two sentences are identical in either kana or romaji:

ことしからかいがいにすみはじめました。
Kotoshikarakaigainisumihajimemashita.

In purely phonetic form, the sentence is ambiguous. Using kanji, as shown in the opening examples, this ambiguity is removed. While phonetically identical, “kaigaini” (abroad) and “kaiga ini” (clam in stomach) are very different when kanji is used; they are written “海外に” and “貝が胃に” respectively and are not easily confusable by Japanese readers.

This error, and many others like it, stems from the way that Japanese text is input into computers. Because there are more than 4,000 kanji in frequent use in Japan, there simply are not enough keys on a keyboard to input kanji directly. Instead, text in Japanese is input into computers phonetically (i.e., in kana) without spaces or explicit word boundaries. Once the kana is input, users then transform the phonetic representation of their sentence or phrase into a version using the appropriate kanji logograms. To do so, Japanese computer users employ special software that contains a database of mappings of kana to kanji. In the process, this software makes educated guesses about where word boundaries are. Usually, computers guess correctly. When computers get it wrong, users need to go back and tweak the conversion by hand or select from other options in a list. Sometimes, when users are in a rush, they use an incorrect kana to kanji conversion. It would be obvious to any Japanese computer users that this is precisely what happened in the sentence above.

This type of error has few parallels in English but is extremely common in Japanese writing. The effects, like this one, are often confusing or hilarious. For a Japanese reader, this error reveals the kana to kanji mapping system and the computer software that implements it — nobody would make such a mistake with a pen and paper. For a person less familiar with Japanese, the error reveals a number of technical particularities about the Japanese writing system and, in the process, about the ways in Japanese differs from other languages they might speak.