Mojibake

One of my favorite Japanese words is mojibake (文字化け) which literally translates as “character changing.” The term is used to describe an error experienced frequently by computers users who read and write non-Latin scripts — like Japanese. When readers of non-Latin scripts open a document, email, web page, or some other text, text is sometimes displayed mangled and unreadable. Japanese speakers refer to the resulting garbage as “mojibake.” Here’s a great example from the mojibake article in Wikipedia (the image is supposed to be in Japanese and to display the the Mojibake article itself).

The UTF-8-encoded Japanese Wikipedia article for mojibake, as displayed in the Windows-1252 ('ISO-8859-1') encoding.

The problem has been so widespread in Japanese that webpages would often place small images in the top corners of pages that say “mojibake.” If a user cannot read the content on the page, the image links to pages which will try to fix the problem for the user.

From a more technical perspective, mojibake might be better described as, “incorrect character decoding,” and it hints at a largely hidden part of the way our computers handle text that we usually take for granted.

Of course, computers don’t understand Latin or Japanese characters. Instead they operate on bits and bytes — ones and zeros that represent numbers. In order to input or or output text, computer scientists created mappings of letters and characters to numbers represented by bits and bytes. These mappings end up forming a sequence of characters or letters in a particular order often called a character set. To display two letters, a computers might ask for the fifth and tenth characters from a particular set. These character sets are codes; they map numbers (i.e., positions in the list) to letters just as Morse code maps dots and dashes to letters. Letters can be converted to numbers by a computer for storage and then converted back to be redisplayed. The process is called character encoding and decoding and it happens every time a computer inputs or outputs text.

While there may be some natural orderings, (e.g., A through Z), there are many ways to encode or map a set of letters and numbers (e.g., Should one put numbers before letters in the set? Should capital and lowercase letters be interspersed?). The most important computer character encoding is a ASCII which was first defined in 1963 and is the de facto standard for almost all modern computers. It defines 128 characters including the letters and numbers used in English. But ASCII says nothing about how one should encode accented characters in Latin, scientific symbols, or the characters in any other scripts — they are simply not in the list of letters and numbers ASCII provides and no mapping is available. Users of ASCII can only use the characters in the set.

Left with computers unable to represent their languages, many non-English speakers have added to and improved on ASCII to create new encodings — different mappings of bits and bytes to different sets of letters. Japanese text can frequently be found in encodings with obscure technical names likes EUC-JP, ISO-2022-JP, Shift_JIS, and UTF-8. It’s not important to understand how they differ — although I’ll come back to this in a future blog post. It’s merely important to realize that these each represents different ways to map a set of bits and bytes into letters, numbers, and punctuation.

For example The set of bytes that says “文字化け” (the word for “mojibake” in Japanese) encoded in UTF-8 would show up as “��絖�����” in EUC-JP, “������������” in ISO-2022-JP, and “文字化け” in ISO-8859-1. Each of the strings above is a valid decoding of identical data — the same ones and zeros. But of course, only the first is correct and comprehensible by a human. Although the others are displaying the same data, the data is unreadable by humans because it is decoded according to a different character set’s mapping! This is mojibake.

For every scrap of text that a computer shows to or takes from a human, the computer needs to keep track of the encoding the data is in. Every web browser must know the encoding of the page it is receiving and the encoding that it will be displayed to the user in. If the data sent is a different format than the one that will be displayed, the computer must convert the text from one encoding to another. Although we don’t notice it. Encoding metadata is passed along with almost every webpage we read and every email we send. Data is being converted between encodings millions of times each day. We don’t even notice that text is encoded — until it doesn’t decode properly.

Mojibake makes this usually invisible process extremely visible and provides an opportunity to understand that our text is coded — and how. Encoding introduces important limitations — it limits our expression to the things that are listed in pre-defined character sets. Until the creation of an encoding called Unicode, one couldn’t mix Japanese and Thai in the same document; while there were encodings for both, there were no character sets that encoded the letters for both. Apparently, in Chinese, there are older more obscure characters that no computers can encode yet. Computer users simply can’t write these letters on computers. I’ve seen computers users in Ethiopia emailing each other in English because support for Amharic encodings at the time was so poor and uneven! All of these limits, and many more, are part and parcel of our character encoding systems. They become visible only when the usually invisible process of character encoding is thrust into view. Mojibake provides one such opportunity.

Bad Signs

I caught another revealing crash screen over on The Daily WTF.

Travelex Crash Screen

Although the folks at WTF did not draw attention to the fact, a close examination revealed that the dialog box on the crashed screen is rotated 90 degrees.

If you step back and look at the sign, it makes sense. The folks at Travelex wanted a tall poster-sized electronic bulletin board to display currency information and promotions. Unfortunately long screens are rare and LCD screens of unusual sizes are extremely expensive. Travelex appears to have done the very sensible thing of taking a readily available and low-cost wide-screen LCD television, turned it on its side, and hooked it up to a computer.

Of course, screens have tops and bottoms. To display correctly on a sideways screen, a computer needs to be configured to display information sideways — a non-trivial tasks on many systems. If you look a the Windows “Start” menu and task-bar along the right side (i.e., bottom) of the screen and the shape of the dialog, it seems that Travelex simply didn’t bother. They used the screen to display images, or sequences of images and found it easy enough to simply rotate each of the images to be display 90 degrees as well. They simply showed a full-screen slide-show of sideways images on their sideways screen. And no user ever noticed until the system crashed.

It’s a neat trick that many users might find useful but most would not think to do. Although they might after seeing this crash!

A close-up of the screen reveals even more.

Travelex Crash Screen Closeup

Apparently, the dialog has popped up because the computer running the sign has a virus! Viruses are usually acquired through user interaction with a computer (e.g., opening a bad attachment) or through the Internet. It seems likely that the computer is plugged into the Internet — perhaps the slide-show is updated automatically — or that the image is being displayed from a computer used to do other things. In any case, it’s a worrying “sign” from a financial services company.

Picture of a Process

I enjoyed seeing this image in an article in The Register.

finger shown in Google book

The picture is a screen shot from Google Books viewing a page from a 1855 issue of The Gentleman’s Magazine. The latex-clad fingers belong to one of the people whose job it is to scan the books for Google’s book project.

Information technologies often hide the processes that bring us the information we interact with. Revealing errors give a picture of what these processes look like or involve. In an extremely literal way, this error shows us just such a picture.

We can learn quite a lot from this image. For example, since the fingers are not pressed against glass, we might conclude that Google is not using a traditional flatbed scanner. Instead, it is likely that they are using a system similar to the one that the the Internet Archive has built that is designed specifically for scanning books.

But perhaps the most important thing that this error reveals is something we know, but often take for granted — the human involved in the process.

The decision on where to automate a process, and where leave it up to a human, is sometimes a very complicated one. Human involvement in a process can prevent and catch many types of errors but can cause new ones. Both choices introduce risks and benefits. For example, an automated bank transaction system may allow human to catch obvious errors and to detect suspicious use that a computer without “common sense” might miss. On the other hand, a human banker might commit fraud to try to enrich themselves with others money — something a machine would never do.

In our interaction with technological systems, we rarely reflect on the fact, and the ways, that the presence of humans in these areas is important to determining the behavior, quality, reliability, and the nature and degree of trust that we have in a technology.

In our interactions with complex processes through simple and abstract user interfaces, it is often only through errors — distinctly human errors, if not usually quite as clearly human as this one — that information workers’ important presence is revealed.