Wordlists and Profanity

Revealing errors are a way of looking at the fact that a technology’s failure to deliver a message can tell us a lot. In this way, there’s an intriguing analogy one can draw between revealing errors and censorship.

Censorship doesn’t usually keep people from saying or writing something — it just keeps them from communicating it. When censorship is effective, however, an audience doesn’t realize that any speech ever occurred or that any censorship has happened — they simply don’t know something and, more importantly perhaps, don’t know that they don’t know. As with invisible technologies, a censored community might never realize their information and interaction with the world is being shaped by someone else’s design.

I once was in an cafe with a large SMS/text message “board.” Patrons could send an SMS to a particular number and it would be displayed on a flat-panel television mounted on the wall that everyone in the restaurant could read. I tested to see if there was a content filter and, sure enough, any message that contained a four-letter word was silently dropped; it simply never showed up on the screen. As the censored party, the failure of my message to show up on the board revealed a censor. Further testing and my success in posting messages with creatively spelled profanity, numbers instead of letters, and the construction of crude ASCII drawings revealed the censor as a piece of software with a blacklist of terms; no human charged with blocking profanity would have allowed “sh1t” through. Through the whole process, the other patrons in the cafe, remained none-the-wiser; they never realized that the blocked messages had been sent.

This desire to create barriers to profanity is widespread in communication technologies. For example, consider the number of times have you been prompted by a spellchecker to review and “fix” a swear word. Offensive as they may be, “fuck” and “shit” are correctly spelled English words. It seems highly unlikely that they were excluded from the spell-checker’s wordlist because the compiler forgot them. They were excluded, quite simply, because their were deemed obscene or inappropriate. While intentional, these words’ omission results in the false identification of all cursing as misspelling — errors we’ve grown so accustomed to that they hardly seem like errors at all!

Now, unlike a book or website which more impressionable children might read, nobody can be expected to find a four-letter word while reading their spell-checking wordlist. These words are not included simply because our spell-checker makers think we shouldn’t use them. The result is that every user who writes a four-letter-word must add that word, by hand, to their “personal” dictionary — they must take explicit credit for using the term. The hope, perhaps, is that we’ll be reminded to use a different, more acceptable word. Every time this happens, the paternalism of the wordlist compiler is revealed.

Connecting back to my recent post on predictive text, here’s a very funny video of Armstrong and Miller lampooning the omission of four-letter words from predictive text databases that make it more difficult to input profanity onto mobile phones (e.g., are you sure you did not mean “shiv” and “ducking”?). You can also or download the video in OGG Theora if you have trouble watching it in Flash.

There’s a great line in there: “Our job … is to offer people not the words that they do use but the words that they should use.”

Most of the errors described on this blog reveal the design of technical systems. While the errors in this case do not stem from technical decisions, they reveal a set of equally human choices. Perhaps more interestingly, the errors themselves are fully intended! The goal of swear-word omission is, in part, the moment of reflection that a revealing error introduces. In that moment, the censors hope, we might reflect on the “problems” in our coarse choice of language and consider communicating differently.

These technologies don’t keep us from swearing any more than other technology designers can control our actions — we usually have the option of using or designing different technologies. But every technology offers affordances that make certain things easier and others more difficult. This may or not be intended but it’s always important. Through errors like those made by our prudish spell-checker and predictive text input systems, some of these affordances, and their sources, are revealed.

Bucklame and Predictive Text Input

I recently heard that “Bucklame,” apparently a nickname for New Zealand’s largest city Auckland, has its source in a technical error that is dear to my heart. It seems that it stems from the fact that many mobile phones’ predictive text input software will suggest the term “Bucklame” if a user tries to input “Auckland” — the latter of which was apparently not in its list of valid words.

In my initial article on revealing errors, I wrote a little about the technology at the source of this error: Tegic’s (now Nuance‘s) T9 predictive text technology which is a frequent way that users of mobile phones with normal keypad (9-12 keys) can quickly type in text messages with 50+ letters, numbers and symbols. Here is how I described the system:

Tegic’s popular T9 software allows users to type in words by pressing the number associated with each letter of each word in quick succession. T9 uses a database to pick the most likely word that maps to that sequence of numbers. While the system allows for quick input of words and phrases on a phone keypad, it also allows for the creation of new types of errors. A user trying to type me might accidentally write of because both words are mapped to the combination of 6 and 3 and because of is a more common word in English. T9 might confuse snow and pony while no human, and no other input method, would.

Mappings of number-sequences to words are based on database that offers words in order of relative frequency. These word frequency lists are based on a corpus of text in the target language pre-programmed into the phone. These corpora, at least initially, were not based on the words people use to communicate using SMS but one a more readily available data source (e.g., in emails or memos of in fiction). This leads to problems common to many systems that built on shaky probabilistic models: what is likely in one context may not be as likely in another. For example, while “but” is an extremely common English word, it might be much less common in SMS where more complex sentence structures are often eschewed due to economy of space (160 character messages) and laborious data-entry. The word “pony” might be more common than “snow” in some situations but it’s certainly not in my usage!

Of course, proper nouns, of which there are many, are often excluded from these systems as well. Since the T9 system does not “know” the word “Auckland”, the nonsensical compound-word “bucklame” seems to be an appropriate mapping for the same number-sequence. Apparently, people liked the error so much they kept using it and, with time perhaps, it stops being an error at all.

As users move to systems with keyboards like Blackberries, Treos, Sidekicks, and iPhones (which use a dual-mode system) these errors become impossible. As a result, the presence of these types of errors (e.g., a swapped “me” and “of”) can tell communicators quite a lot about the type of device they are communicating with.

Creating Kanji

Errors reveal characteristics of the languages we use and the technologies we use to communicate them — everything from scripts and letter forms (which while very fundamental to written communication are technologies nonetheless) to the computer software we use to create and communicate text.

I’ve spent the last few weeks in Japan. In the process, I’ve learned a bit about the Japanese language; no small part of this through errors. Here’s one error that taught me quite a lot. The sentence is shown in Japanese and then followed by a translation into English:

今年から貝が胃に棲み始めました。
This year, a clam started living in my stomach.

Needless to say perhaps, this was an error. It was supposed to say:

今年から海外に住み始めました。
This year, I started living abroad.

When the sentences are translated into romaji (i.e., Japanese written in an Roman script) the similarity becomes much more clear to readers that don’t understand Japanese:

Kotoshikara kaiga ini sumihajimemashita.
Kotoshikara kaigaini sumihajimemashita.

Kotoshikara means “since this year.” Sumihajimemashita means, “has started living.” The word kaigaini means “abroad” or “overseas.” Kaiga ini (two words) means “clam in stomach.” When written phonetically in romaji, the only difference in the two sentences lie in the introduction of a word-break in the middle of “kaigaini.” Written out in Japanese, the sentences are quite different; even without understanding, one can see that more than a few of the characters in the sentences differ.

In English word spacing plays an essential role in making written language understandable. Japanese, however, is normally written without spaces between words.

This isn’t a problem in Japanese because the Japanese script uses a combination of logograms — called kanji — and phonetic characters — called hiragana and katakana or simply kana — to delimit words and to describe structure. The result, to Japanese readers, is unambiguous. Phonetically and without spaces, the two sentences are identical in either kana or romaji:

ことしからかいがいにすみはじめました。
Kotoshikarakaigainisumihajimemashita.

In purely phonetic form, the sentence is ambiguous. Using kanji, as shown in the opening examples, this ambiguity is removed. While phonetically identical, “kaigaini” (abroad) and “kaiga ini” (clam in stomach) are very different when kanji is used; they are written “海外に” and “貝が胃に” respectively and are not easily confusable by Japanese readers.

This error, and many others like it, stems from the way that Japanese text is input into computers. Because there are more than 4,000 kanji in frequent use in Japan, there simply are not enough keys on a keyboard to input kanji directly. Instead, text in Japanese is input into computers phonetically (i.e., in kana) without spaces or explicit word boundaries. Once the kana is input, users then transform the phonetic representation of their sentence or phrase into a version using the appropriate kanji logograms. To do so, Japanese computer users employ special software that contains a database of mappings of kana to kanji. In the process, this software makes educated guesses about where word boundaries are. Usually, computers guess correctly. When computers get it wrong, users need to go back and tweak the conversion by hand or select from other options in a list. Sometimes, when users are in a rush, they use an incorrect kana to kanji conversion. It would be obvious to any Japanese computer users that this is precisely what happened in the sentence above.

This type of error has few parallels in English but is extremely common in Japanese writing. The effects, like this one, are often confusing or hilarious. For a Japanese reader, this error reveals the kana to kanji mapping system and the computer software that implements it — nobody would make such a mistake with a pen and paper. For a person less familiar with Japanese, the error reveals a number of technical particularities about the Japanese writing system and, in the process, about the ways in Japanese differs from other languages they might speak.