Sunday, September 10, 2006

 

XML Still Isn’t "Self-Describing"

A few weeks ago I wrote a post titled "XML isn't self-describing" and I thought I was done with that topic but a comment to my post (thanks, it is nice to know that some people who are not my students are reading what I'm writing) has made me want to resume talking about it. And for my students taking "Information Organization and Retrieval" from me this semester at the UC Berkeley School of Information , I'll refer back to an article we talked about to reinforce some lessons with this response.

The commenter said:

Don't you think that it is possible within "known" problems to create self-describing information with XML?

For example, a known problem such as ordering dinner at a restaurant or a personal address book might be very capably handled by XML. There might be variations, but I think we could agree on a dozen tags that were obvious within the context. I might not be familiar with the < holdpickles/ > tag, but intuitively I would understand given my familiarity with the problem.

This is a very typical rebuttal attempt that acknowledges that XML isn't self-describing in most cases, but argues that in "familiar" domains there is sufficient agreement about what words mean for it to be so. It seems "obvious" that people agree on things like restaurants and addresses but it just isn't true.

There is hard scientific evidence from experimental studies of "statistical semantics" that there is little agreement on the words that people assign as names to common objects or processes. The classic paper, reporting a number of compelling experiments, is "The Vocabulary Problem in Human-System Communication" by George Furnas, Tom Landauer, Louis Gomez, and Sue Dumais in the Communications of the ACM way back in 1987.

In my IO & IR class last Thursday we replicated the basic results with a few things that I grabbed from my office on the way to my lecture (a dollar bill, a coffee cup, a pocket knife, and a photo with me and my wife). I asked students to think of the best one or two word descriptions for these familiar things. For some of them there were four or five different ones (e.g, dollar, bill, buck, greenback; cup, mug, coffee cup, drink; and so on).

This result is pretty surprising to people. When you ask a few dozen people, you get even more different terms suggested as the "best" or "most natural" name for a common object. But because any given person can only think of a few names as "intuitive" fits, they can't imagine that there could be such diversity, and so they greatly overestimate the amount of agreement about names -- and they conclude that the name of something is a reliable description of it.

So the only thing about names being "self-describing" is that they describe the meaning of something to oneself. Just not necessarily for anyone else.

-Bob Glushko






Comments:
While I agree with you that XML isn't self-describing, your experiment doesn't prove that. In fact, had you taken those items, had one person identify those items, I'm guessing that everyone or almost everyone in the room would have recognized the items from the names given to them.

A better experiment might be to have the class write a bibliographic citation for a book, and see how many different formats their are, or something of that nature.
 
Touché Bob. You're argument is convincing, and the conclusion follows logically from the ambiguity of the written/spoken word. I agree the notion of a look-up dictionary to resolve ambiguity and language might be useful. However, becuase many words are both synoymous yet distinct I'm afraid we eventually run into the same problem.
 
wow gold
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?