Medireview

Medireview is a reference to what has become a classic revealing error. The error was noticed in 2001 and 2002 when people started seeing a series of implausibly misspelled words on a wide variety of websites. In particular, website authors were frequently swapping the nonsense word medireview for medieval. Eventually, the errors were traced back to Yahoo: each webpage containing medireview had been sent as an attachment over Yahoo’s free email system.

The explanation of this error shares a lot in common with previous discussions of the the difficulty of searching for AppleScript on Apple’s website and my recent description of the term clbuttic. Medireview was caused, yet again, by an overzealous filter. Like the AppleScript error, the filter was attempt to defeat cross site scripting. Nefarious users were sending HTML attachments that, when clicked, might run scripts and cause bad things to happen — for example, they might gain access to passwords or data without a user’s permission or knowledge. To protect its users, Yahoo scanned through all HTML attachments and simply removed any references to “problematic” terms frequently used in cross-site scripting. Yahoo made the follow changes to HTML attachments — each line shows a term that can be used to invoke a script and the “safe” synonym it was replaced with:

  • javascript → java-script
  • jscript → j-script
  • vbscript → vb-script
  • livescript → live-script
  • eval → review
  • mocha → espresso
  • expression → statement

This caused problems because, like in the Clbuttic error, Yahoo didn’t check for word boundaries. This mean that any word containing eval (for example) would be changed to review. The term evaluate was rendered reviewuate. The term medieval was rendered medireview.

Of course, neither sender nor receiver knew that their attachments had been changed! Many people emailed webpages or HTML fragments which, complete with errors introduced by Yahoo, were then put online. The Indian newspaper The Hindu published an article referring to “medireview Mughal emperors of India.” Hundreds of others made similar mistakes.

The flawed script was in effect on Yahoo’s email system from at least March 2001 through July 2002 before the error was reported by the BBC, New Scientist and others.

Like a growing number of errors that I’ve covered here, this error pointed out the presence and power of an often hidden intermediary. The person who controls the technology one uses to write, send, and read email has power over one’s messages. This error forced some users of Yahoo’s system to consider this power and to make a choice about their continued use of the system. Quite a few stopped using Yahoo after this news hit the press. Others turned to other technologies, like public-key cryptography, to help themselves and others verify that their future messages’ integrity could be protected from accidental or intentional corruption.

Revealing Errors Talks

I’ve recently given two talks on Revealing Errors at LUG Radio Live USA and at Penguicon. I’ll be giving at least two more in the near future: today (June 18, 2008 — sorry for the late notice!) at MIT for Boston Linux Unix and at OSCON in Portland, Oregon on July 25.

There’s a post on Copyrighteous (my personal blog) with more details and links from my talks page to notes and slides.

So far, they’ve been successful and lots of fun. I travel frequently and am interested in doing more. Contact me if you’d be interested in organizing something.

Clbuttic

Revealings errors are often most powerful when they reveal the presence of or details about a technology’s designer. One of my favorite clbuttes classes of revealing errors are those that go one step further and reveal the values of the designers of systems. I’ve touched on these twice before in my post about T9 input systems and when I talked about profanity in wordlists.

Another wonderful example surfaced in this humorous anecdote about what was supposed to be an invisible anti-profanity system that instead filled a website with nonsensical terms like “clbuttic.”

Basically, the script in question tried to look through user input and to swap out instances of profanity with less offensive synonyms. For example, “ass” might become “butt”, “shit” might become “poop” or “feces”, and so on. To work correctly, the script should have looked for instances of profanity between word boundaries — i.e., profanity surrounded on both sides by spaces or punctuation. The script in question did not.

The result was hilarious. Not only was “ass” changed to “butt,” but any word that contained the letters “ass” were transformed as well! The word “classic” was mangled as “clbuttic.”

The mistake was an easy one to make. In fact, other programmers made the same mistake and searches for “clbuttic” turn up thousands of instances of the term on dozens of independent websites. Searching around, one can find references to a mbuttive music quiz, a mbuttive multiplayer online game, references to how the average consumer is a pbutterby, a transit pbuttenger executed by Singapore, Fermin Toro Jimenez (Ambbuttador of Venezuela), the correct way to deal with an buttailant armed with a banana, and much, much more.

You can even find a reference to how Hinckley tried to buttbuttinate Ronald Reagan!

Each error reveals the presence of an anti-profanity script; obviously, no human would accidentally misspell or mistake the words in question in any other situation! In each case, the existence of a designer and an often hidden intermediary is revealed. What’s perhaps more shocking than this error is that fact that most programmers won’t make this mistake when implementing similar systems. On thousands of websites, our posts and messages and interactions are “cleaned-up” and edited without our consent or knowledge. As a matter of routine, our words are silently and invisibly changed by these systems. Few of us, and even fewer of our readers, ever know the difference. While switching “ass” to “butt” may be harmless enough, it’s a stark reminder of the power that technology gives the designers of technical systems to force their own values on their users and to frame — and perhaps to substantively change — the messages that their technologies communicate.

Okami Watermark

The blog Photoshop Disasters recently wrote a story about a small fiasco regarding cover art for the popular video game Okami.

Okami was originally released for the Sony Playstation 2 (PS2) in 2006. The developers of the game, Clover Studios closed up shop several months later. Here is the cover art for the PS2 game which is indicative of the unique sumi-e inspired game art.

Original Okami Cover

Despite Clover’s failure, Okami won many award and was a commercial success. It was ported (i.e., made to run on a different platform) to the Nintendo Wii by a video game production house called Ready at Dawn and by the PS2 version’s distributor Capcom. The Wii version was released in April, 2008. Here is the cover art for that version:

Okami Cover for Wii version

People looking closely at the cover of the Wii game noticed something strange right near the wolf’s mouth. Here’s a highlight with the area circled.

Watermark highlight

The blurry symbol near Okami’s mouth was a watermark — an artifact intentionally added to an image to denote the source of the picture and often to prevent others from taking undue credit. In fact, it was the logo for IGN — a very large video game website and portal. As part of writing reviews, IGN frequently takes screenshots of games, watermarks them, and posts them on their website.

Sure enough, a little bit of digging on the IGN website revealed this high-resolution image from the cover, complete with IGN watermark in the appropriate place. Apparently, a designer working for Capcom had found it easier to use the images posted by IGN than to go and get the original art from the game itself.

Source image from the IGN website

This error revealed quite a bit about the process and constraints that the cover designers for the Wii version were working under. Rather than getting original source images — which Capcom presumably owned — they found it easier to take it from the Internet-available source. Through the error, the usually invisible process, people, and technologies involved in this type of artwork preparation were revealed.

Embarrassed by the whole affair, Capcom offered to replace the covers with non-watermarked ones — free of charge.

Sıkısınca

Last week saw the popularization of some older news about a misunderstanding, prompted by an error caused by technological limitations of mobile phones, that resulted in two deaths and three imprisonments. The whole sad story took place in Turkey. You can read the original story in the Turkish language Hürriyet.

Basically, Emine and her husband Ramazan Çalçoban has recently been separated and were feuding daily on their mobile phones and over SMS text messages. At one point, Ramazan sent a message saying, “you change the subject every time you get backed into a corner.” The word for “backed into a corner” is sıkısınca. Notice the lack of dots on the i’s in the word. The very similar sikisince — spelled with dots — means “getting fucked.” Ramazan’s mobile phone could not produce the “closed” dotless ı so it he wrote the word with dots and sent it anyway. Reading quickly, Emine misinterpreted the message thinking that Ramazan was saying, “you change the subject every time they are fucking you.” Emine showed the message to her father and sisters who, outraged that Ramazan was calling Emine a whore, attacked Ramazan with knifes when he showed up at the house later. In the fight, Ramazan stabbed Emine back and she later died of bleeding. Ramazan committed suicide in jail and Emine’s father and sisters were all arrested.

This is certainly the gravest example of a revealing error I’ve looked at yet and it stands as an example of the degree to which tiny technological constraints can have profound unanticipated consequences. In this case, the lack of technological support for characters used in Turkish resulted in the creation of text that was deeply, even fatally, ambiguous.

Of course, many messages sent with SMS, email, or chat systems are ambiguous. Emoticons are an example of a tool that society has created to disambiguate phrases in text-based chatting and their popularity can tell us a lot about what purely word-based chatting fails to convey easily. For example, a particular emoticon might be employed to help convey sarcasm in a statement that would have been obvious through tone of voice. One can think of verbal communication as happening over many channels (e.g., voice, facial expressions, posture, words, etc). Text-based communication technologies provide certain new channels that may be good at conveying certain types of messages but not others. Emoticons, and accents or diacritical marks for that matter, are an attempt to concisely provide missing information that might be taken for granted in spoken conversation.

Any communication technology conveys certain things better than others. Each provides a set of channels that convey some types of messages but not others. The result of a shift toward many technologies is lost channels and an increase in ambiguity.

In spoken Turkish, the open and closed i sounds are easily distinguishable. In written communication, however, things become more difficult. Some writing system are better at conveying these tonal differences. Hebrew, for example, historically contained no vowels at all! And yet, the consequence of not conveying these differences can be profound. As a result, Turkish speakers frequently use diacritics and the open and closed i notation to disambiguate phrases like the one at the center of this saga. Unfortunately the open and closed i technology is not always available to communicators. Notably, it was not available on Ramazan’s mobile phone.

People in Turkey have ways of coping with the lack of accents and diacritical marks. For example, some people would choose to write sıkısınca as SIKISINCA because the capital I in the Roman alphabet has no dot. Emoticons are similar in that they are created by users to work around limitations of the system to convey certain messages and to disambiguate others. In these ways and others, users of technologies find creative ways of working with and around the limitations and affordances imposed on then.

With time though, the users of emoticons and all-caps Turkish words stop seeing and thinking about the limitations that these tactics expose in their technology. In fact, it is only through errors that these limitations become familiar again. While we cannot undo the damage done by Ramazan, Emine and her family, we can “learn from their errors” and reflect on the ways that the limits imposed by our communication technology frames and affects our communications and our lives.

Interpolation

One set of errors that almost everyone has seen — even if they don’t know it — involve the failure of a very common process in computer programming called interpolation. While they look quite different, both of the following errors — each taken from the Daily WTF’s Error’d Series — represent an error whose source would be obvious to most computer programmers.

You Saved a total of {@Total-Tkt-Discount} off list prices.

The term interpolation, of course, is not unique to programmers. It is a much older term that was historically used to describe errors in hand-copied documents. Interpolation in a manuscript refers to text not written by an original author that was inserted over time — either through nefarious adulteration or just by accident. As texts were copied by hand, this type of error ended up happening quite frequently! In its article on manuscript interpolation, Wikipedia describes one way that these errors occurred:

If a scribe made an error when copying a text and omitted some lines, he would have tended to include the omitted material in the margin. However, margin notes made by readers are present in almost all manuscripts. Therefore a different scribe seeking to produce a copy of the manuscript perhaps many years later could find it very difficult to determine whether a margin note was an omission made by the previous scribe (which should be included in the text), or simply a note made by a reader (which should be ignored or kept in the margin).

But while manuscript interpolation described a type of error, interpolation in computer programming refers to a type of text swapping that is fully intentional.

Computer interpolation happens when computers create customized and contextualized messages — and they do so constantly. Whereas a newspaper or a book will be the same for each of its readers, computers create custom pages designed for each user — you see these all the time as most messages that computers print are, in some way, dynamic. In many cases, these dynamic messages are created through a process called string or variable interpolation. For those who are unfamiliar with the process, an explanation of the errors above can reveal the details.

In the first example, the receipt read (emphasis mine):

You Saved a total of {@Total-Tkt-Discount} off list prices.

In fact, the computer is supposed to swap out the phrase {@Total-Tkt-Discount} for the value of a variable called Total-Tkt-Discount. The {@SOMETHING} syntax is one programming language’s way of signifying to the computer, “take the variable called SOMETHING and use its value in this string instead of the everything between (and including) the curly braces.” Of course, something didn’t quite work right and the unprocessed — or uninterpolated — text was spit out instead. With this error, the computer program that is supposed to be computing our ticket price was revealed. Additionally, we have a glimpse into the program, its variable names, and even its programming language.

The second error from a (not very helpful) dialog box in Mozilla Firefox is a more complicated but fundamentally similar example (emphasis mine):

The file “#3” is of type #2 (#1), and #4 does not know how to handle this file type.

The numbers, in this case, reflect a series of variables. The dialog is supposed to be passed a list of values including the file name (#3), the file type (#2 and #1), and the name of the program that is trying to open it (#4). This list is supposed to be swapped in from placeholder values — interpolated — before any user sees it. Again, something went wrong here and a user was presented with the empty template that only the programmer and the program are ever supposed to see.

Nearly every message a computer or a computerized system presents us will be processed and interpolated in this way. In this sense, computer programs act as powerful intermediaries processing and displaying data. Perhaps more importantly, interpolation reveals just how limited computers’ expression really is. These messages are not more complicated than simple fill-in-the-blank messages. Simple as they may be, they are entirely typical of the way that computers communicate with us.

From a user’s perspective, it’s easy to imagine sophisticated systems creating and presenting highly dynamic messages to us — or to simply not think about it at all. In reality, few computer programs’ ability to communicate with us is more sophisticated than a game of Mad Libs. The simplicity of these systems, the limitations that they impose on what computers can and can’t say, and the limitations they place on we can and can’t say with computers, are revealed through these simple, common, interpolation errors. To understand all of this, we need only recognize these errors and reflect on what they might reveal.

The Cupertino Effect

I recently wrote about spellcheckers and profanity. Of course, spellcheckers are the site of many other notable revealing errors.

One well-known class of errors is called the Cupertino Effect. The effect is named after an error caused by the fact that some early spellchecker wordlists contained the hyphenated co-operation but not cooperation (both are correct while the former is less common). The ultimate effect, due to the fact that spellchecking algorithms treat hyphenated words as separate words, was that several spellcheckers would suggest Cupertino as a substitute for the “misspelled” cooperation. As the lone suggestion, some people “corrected” cooperation to Cupertino in haste. The weblog Language Log noticed that quite a few people made the mistake in official documents from the UN, EU, NATO and more! These included the following examples found in real documents:

Within the GEIT BG the Cupertino with our Italian comrades proved to be very fruitful. (NATO Stabilisation Force, “Atlas raises the world,” 14 May 2003)

Could you tell us how far such policy can go under the euro zone, and specifically where the limits of this Cupertino would be? (European Central Bank press conference, 3 Nov. 1998)

While Language Log authors were incredulous about the idea that there might be spellchecking dictionaries that contain the word Cupertino and not the unhyphenated co-operation, a reader sent in this screenshot from Microsoft Outlook Express circa 1996 using a Microsoft word list from Houghton Mifflin Company. Sure enough, they’d found the culprit.

Cupertino spellchecker screenshot.

Of course, the Cupertino effect is by no means limited to the word cooperation. The Oxford University Press also points out how the Cupertino Effect can rear its head when foreign words and proper nouns are involved. This lead to Reuters referring to Pakistan’s Muttahida Quami Movement as the Muttonhead Quail Movement and to Rocky Mountain News naming Leucadia National as La-De-Da National instead. To top that off, Language Log found examples of confusion that led to discussion of copulation which make Cupertino look entirely excusable:

The Western Balkan countries confirmed their intention to further liberalise trade amongst each other. They requested that they be included in the pan-european system of diagonal copulation, which would benefit trade and economic development. (International Organization for Migration, Foreign Ministers Meeting, 22 Nov. 2004)

Of course, the Cupertino Effect is possible every time any spellchecking correction is suggested and the top result is incorrect. As a result, many common misspellings open the door to humorous errors. In a follow-up post, Language Log pointed out if one leaves the “i” off “identified”, Microsoft Word 97 will give exactly one suggestion: denitrified which describes the state of having nitrogen removed. That has led newspapers to report that, “Police denitrified the youths and seized the paintball guns.” Which seems unlikely. Similarly, if you leave out the “c” from acquainted, spellcheckers frequently suggest aquatinted as a substitute. As the Oxford University Press blogs pointed out, folks who want to get aquatinted do not often want to be etched with nitric acid!

You can find parallels to the Cupertino effect in the Bucklame Effect I discussed previously. Many of the take-away lessons are the same. Spellcheckers make it easier to say some things correctly and place an additional cost on others. The effect on our communication may be subtle but it’s real. For example, a spelling mistake might be less forgivable in an era of spellcheckers. Like many communication technologies spellcheckers are normally invisible in the documents they create; nobody is reminded of spellcheckers by a perfectly spelled document. It is only through errors like the Cupertino effect that spellcheckers are revealed.

Further, these nonsensical suggestions are made only because of the particular way that spellcheckers are built. Microsoft’s Natural Language team is apparently working on “contextual” spellcheckers that will be smart enough to guess that you probably don’t mean “Cupertino” when you mean cooperation. Of course other errors will remain and new ones will be introduced.

Mojibake

One of my favorite Japanese words is mojibake (文字化け) which literally translates as “character changing.” The term is used to describe an error experienced frequently by computers users who read and write non-Latin scripts — like Japanese. When readers of non-Latin scripts open a document, email, web page, or some other text, text is sometimes displayed mangled and unreadable. Japanese speakers refer to the resulting garbage as “mojibake.” Here’s a great example from the mojibake article in Wikipedia (the image is supposed to be in Japanese and to display the the Mojibake article itself).

The UTF-8-encoded Japanese Wikipedia article for mojibake, as displayed in the Windows-1252 ('ISO-8859-1') encoding.

The problem has been so widespread in Japanese that webpages would often place small images in the top corners of pages that say “mojibake.” If a user cannot read the content on the page, the image links to pages which will try to fix the problem for the user.

From a more technical perspective, mojibake might be better described as, “incorrect character decoding,” and it hints at a largely hidden part of the way our computers handle text that we usually take for granted.

Of course, computers don’t understand Latin or Japanese characters. Instead they operate on bits and bytes — ones and zeros that represent numbers. In order to input or or output text, computer scientists created mappings of letters and characters to numbers represented by bits and bytes. These mappings end up forming a sequence of characters or letters in a particular order often called a character set. To display two letters, a computers might ask for the fifth and tenth characters from a particular set. These character sets are codes; they map numbers (i.e., positions in the list) to letters just as Morse code maps dots and dashes to letters. Letters can be converted to numbers by a computer for storage and then converted back to be redisplayed. The process is called character encoding and decoding and it happens every time a computer inputs or outputs text.

While there may be some natural orderings, (e.g., A through Z), there are many ways to encode or map a set of letters and numbers (e.g., Should one put numbers before letters in the set? Should capital and lowercase letters be interspersed?). The most important computer character encoding is a ASCII which was first defined in 1963 and is the de facto standard for almost all modern computers. It defines 128 characters including the letters and numbers used in English. But ASCII says nothing about how one should encode accented characters in Latin, scientific symbols, or the characters in any other scripts — they are simply not in the list of letters and numbers ASCII provides and no mapping is available. Users of ASCII can only use the characters in the set.

Left with computers unable to represent their languages, many non-English speakers have added to and improved on ASCII to create new encodings — different mappings of bits and bytes to different sets of letters. Japanese text can frequently be found in encodings with obscure technical names likes EUC-JP, ISO-2022-JP, Shift_JIS, and UTF-8. It’s not important to understand how they differ — although I’ll come back to this in a future blog post. It’s merely important to realize that these each represents different ways to map a set of bits and bytes into letters, numbers, and punctuation.

For example The set of bytes that says “文字化け” (the word for “mojibake” in Japanese) encoded in UTF-8 would show up as “��絖�����” in EUC-JP, “������������” in ISO-2022-JP, and “文字化け” in ISO-8859-1. Each of the strings above is a valid decoding of identical data — the same ones and zeros. But of course, only the first is correct and comprehensible by a human. Although the others are displaying the same data, the data is unreadable by humans because it is decoded according to a different character set’s mapping! This is mojibake.

For every scrap of text that a computer shows to or takes from a human, the computer needs to keep track of the encoding the data is in. Every web browser must know the encoding of the page it is receiving and the encoding that it will be displayed to the user in. If the data sent is a different format than the one that will be displayed, the computer must convert the text from one encoding to another. Although we don’t notice it. Encoding metadata is passed along with almost every webpage we read and every email we send. Data is being converted between encodings millions of times each day. We don’t even notice that text is encoded — until it doesn’t decode properly.

Mojibake makes this usually invisible process extremely visible and provides an opportunity to understand that our text is coded — and how. Encoding introduces important limitations — it limits our expression to the things that are listed in pre-defined character sets. Until the creation of an encoding called Unicode, one couldn’t mix Japanese and Thai in the same document; while there were encodings for both, there were no character sets that encoded the letters for both. Apparently, in Chinese, there are older more obscure characters that no computers can encode yet. Computer users simply can’t write these letters on computers. I’ve seen computers users in Ethiopia emailing each other in English because support for Amharic encodings at the time was so poor and uneven! All of these limits, and many more, are part and parcel of our character encoding systems. They become visible only when the usually invisible process of character encoding is thrust into view. Mojibake provides one such opportunity.

Bad Signs

I caught another revealing crash screen over on The Daily WTF.

Travelex Crash Screen

Although the folks at WTF did not draw attention to the fact, a close examination revealed that the dialog box on the crashed screen is rotated 90 degrees.

If you step back and look at the sign, it makes sense. The folks at Travelex wanted a tall poster-sized electronic bulletin board to display currency information and promotions. Unfortunately long screens are rare and LCD screens of unusual sizes are extremely expensive. Travelex appears to have done the very sensible thing of taking a readily available and low-cost wide-screen LCD television, turned it on its side, and hooked it up to a computer.

Of course, screens have tops and bottoms. To display correctly on a sideways screen, a computer needs to be configured to display information sideways — a non-trivial tasks on many systems. If you look a the Windows “Start” menu and task-bar along the right side (i.e., bottom) of the screen and the shape of the dialog, it seems that Travelex simply didn’t bother. They used the screen to display images, or sequences of images and found it easy enough to simply rotate each of the images to be display 90 degrees as well. They simply showed a full-screen slide-show of sideways images on their sideways screen. And no user ever noticed until the system crashed.

It’s a neat trick that many users might find useful but most would not think to do. Although they might after seeing this crash!

A close-up of the screen reveals even more.

Travelex Crash Screen Closeup

Apparently, the dialog has popped up because the computer running the sign has a virus! Viruses are usually acquired through user interaction with a computer (e.g., opening a bad attachment) or through the Internet. It seems likely that the computer is plugged into the Internet — perhaps the slide-show is updated automatically — or that the image is being displayed from a computer used to do other things. In any case, it’s a worrying “sign” from a financial services company.

Picture of a Process

I enjoyed seeing this image in an article in The Register.

finger shown in Google book

The picture is a screen shot from Google Books viewing a page from a 1855 issue of The Gentleman’s Magazine. The latex-clad fingers belong to one of the people whose job it is to scan the books for Google’s book project.

Information technologies often hide the processes that bring us the information we interact with. Revealing errors give a picture of what these processes look like or involve. In an extremely literal way, this error shows us just such a picture.

We can learn quite a lot from this image. For example, since the fingers are not pressed against glass, we might conclude that Google is not using a traditional flatbed scanner. Instead, it is likely that they are using a system similar to the one that the the Internet Archive has built that is designed specifically for scanning books.

But perhaps the most important thing that this error reveals is something we know, but often take for granted — the human involved in the process.

The decision on where to automate a process, and where leave it up to a human, is sometimes a very complicated one. Human involvement in a process can prevent and catch many types of errors but can cause new ones. Both choices introduce risks and benefits. For example, an automated bank transaction system may allow human to catch obvious errors and to detect suspicious use that a computer without “common sense” might miss. On the other hand, a human banker might commit fraud to try to enrich themselves with others money — something a machine would never do.

In our interaction with technological systems, we rarely reflect on the fact, and the ways, that the presence of humans in these areas is important to determining the behavior, quality, reliability, and the nature and degree of trust that we have in a technology.

In our interactions with complex processes through simple and abstract user interfaces, it is often only through errors — distinctly human errors, if not usually quite as clearly human as this one — that information workers’ important presence is revealed.