Saturday, May 13, 2006


Semantic Illiteracy

“Analyzing sets of possible values” is an important task in the design of information models and I emphasize it in courses and projects (it is also a section heading in Chapter 12 of my Document Engineering book). Specifying constraints on content is an essential part of specifying what something means and enforcing them is critical to interoperability.

But just as people often assign bad names to things or concepts, they often fail to analyze possible values, they specify them incompletely, or they just get it wrong. You could say that they suffer from semantic illiteracy.

A few weeks ago in the 10 April 2006 New Yorker’s “Talk of the Town” column David Owen wrote about the drop-down list for honorifics in the sign-up form for the Skywards frequent-flier program of Emirates airline. In addition to the usual Mr / Mrs / Ms/ Miss / Dr the article said that there are at least another hundred of them. Some of them are translations of the usual ones (Frau, Senor, etc.), but most are infrequent or even exotic, such as Dowager, King, Midshipman, Shriman, Swami, Sultan, The Very Reverend, Vice Admiral, Viscount, etc. These examples illustrate that honorifics can encode gender, age, occupation, organizational status, and cultural values. Languages and cultures vary a great deal in how they represent this information.

You might wonder whether Kings, Sultans, and Vice Admirals would ever fly on commercial airlines rather than on their private jets, but as a document engineer that’s not what interests me the most. So I made a brief tour of airlines to see how they handle honorifics and it was interesting.

I started with Emirates… and was astonished to see that instead of the long list described in the New Yorker article, the application form drop-down contained only Mr, Ms, Mrs, Miss, and Master. Maybe the Emirates webmaster was embarrassed by the article and changed the list. That might be the end of the story, except that my next stop on the airline tour was British Air, and I found the exact list described in the article. BA might even be the originator of the list, because that would explain all the British military, peerage, and Church of England honorifics on the list.

I then checked United, the airline that I fly the most often and for which I have amassed over a half million miles in the frequent flier program, where I am registered as "Dr Robert J Glushko" but which says "Robert J Glushko" on the plastic card they gave me. United’s drop-down for honorifics lists seven choices in this order: Mr Ms Mrs Miss Dr Hon Prof. I am also a member of US Air’s program, which uses check boxes instead of a drop-down menu on its registration form. US Air’s choices are Mr Mrs Ms Dr, which is both a shorter list (maybe fewer professors fly US Air) but more interestingly has Mrs before Ms, the opposite order than United. Can we infer anything about how United and US Air treat women as employees or customers?

Air France offers forms in French, where the honorific drop-down is a short list (M, Melle, Mme), and also in English (Mr, Miss, Mrs) – the US Air ordering for the two titles for women. Interesting twist for Air France is that you get to these different forms by choosing a country, not a language, and when you choose Canada the form defaults to French.

Singapore Air, which I’ve only flown a couple of times but really enjoyed, has an odd drop-down in its registration form. Its list of choices is Mr, Ms, Mrs, Dr, Miss, Master, Madam, Others -- but choosing Others doesn’t ask or allow you to specify what other title you go by. Does this mean that if I chose Others my plastic card would say "Others Robert J Glushko" -- I feel like joining just to find out.

Asian cultures are very big on honorifics, so I expected Japan Air to have a comprehensive list of honorifics. I was astonished to see none at all, just the simple First Name and Last Name. This was on the English-language site, and since I don’t read Japanese, I couldn’t check what they do on the Japanese site. But maybe the Japan Air forms designers are better document engineers than those working for the other airlines and they understand how tricky the semantics are here.

The New Yorker article tries to say this in a clever way:

Attempts at exhaustivenesss are inherently self-defeating; the longer a list, the more conspicuous its lacunae.

This isn’t the best advice. You should definitely be exhaustive in situations where there are standard code sets like ISO 3166 (Country codes) or ISO 4217 (currency codes). And when you can’t be exhaustive because of the distribution’s "long tail," I’d recommend that you be sensitive to the frequency of the values, cut off the low-frequency tail, and provide an OTHER category.

But I think there’s a larger message here. The heterogeneity of approaches here for what might seem to be a straightforward information modeling task shows that many people just don’t realize how difficult it is to be precise about what something means. We emphasize "computer literacy" (desktop applications and web surfing) but I’ve never heard anyone fret about how poorly people name and define the things and concepts that their computer applications capture and process for them, which seems more important to me. We need "semantic literacy" or maybe even "ontological literacy" but maybe we don’t teach it because it is too hard to explain what they mean.

-Bob Glushko

Dr Glusko -

There's actually been a LOT of hand-wringing over the years about how things are named/labeled in applications.

In a simpler time it used to be called simply "naming standards" or "naming conventions."

I first became interested in the topic at an insurance company with a 64 page masterfile (this was prior to widespread use of commercial DBMS products).

Allegedly in the process of preparing to move from this custom built flat-file to some to-be-determined DBMS (DB2 eventually I believe), they had discovered some 70 different names for the central business concept "policy number."

"Naming conventions" is still a very contentious issue, one problem being that it's fundamentally left up to the programmer to use or not "good names."

I'm sure you can find in your personal technical library some beginners language manual which lists the various technical constraints for naming things (length, acceptable character set, allowable separator, special meanings, etc.)... and then the seemingly universal: "use meaningful names." End of discussion. On to the next topic.

The question of course is: Meaningful to whom? Me now? Or the 25th maintenance programmer who has to read this code in 15 years?

FACTOIDS to consider...

There are currently multiple robust industrial strength RDBMS engines typically available for free to developers.

There are upwards of 500+ languages in active use. [This is per Caper Jones's Function Point studies effort.]

Essentially every language has its own unique quirks on how it labels/names things. Case sensitive or not, lower case, UPPER CASE ONLY, underscores, dashes, camelCase, maximum 8 or 27 or 43 or 256 characters long (and everything in between).

Allegedly the "typical application" is written in 6+ languages.

There are NO (zero, bupkus, nada, none, zippo, zilch) tools that help with automating this complex task.

In the profession of document management (surely as old as the Pyramids), software is not considered to be a document.

There are some 500,000+ formal English words each typically with multiple formal meanings. This is not counting acronyms & abbreviations.

I personally believe there is some support for the claim that the core conceptual language of an enterprise is in the 1,500 to 6,000 concepts range. [Counting thusly: H2O {water, ice, steam, fog, rain, hail, fog, snow, cloud} count as a single concept.]

... please to drawn you own conclusions....

- David Eddy
Your way of expressing thoughts inspires me a lot..
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?