Sunday, August 27, 2006


Sorry Pluto… And Some Thoughts About Categorization

Because semantics and categorization are key themes in document engineering and in the courses I teach, I've been flabbergasted by much of the recent reaction to the "demotion" of Pluto from planethood to the inferior status of "dwarf planet"” by the International Astronomical Union. The IAU recently passed a resolution that defined a planet within our solar system as:

A celestial body that is (a) in orbit around a star, (b) has sufficient mass for its self-gravity to overcome rigid body forces so that it assumes a hydrostatic equilibrium (nearly round) shape, and (c) has cleared the neighbourhood around its orbit.

Because Pluto doesn't satisfy the third requirement, it no longer is classified as a planet. This has generated a great deal of news and caused lots of people to get upset. A typical headline is "Pluto's demotion has schools spinning" -- elementary school science teachers just don't know what to teach. And the widow of the astronomer who discovered Pluto in 1930 says she is all shook up.

But "Pluto's place is safe with astrologers" and that's crucially important to me because apparently Pluto is the most important planet for Scorpios and I am a Scorpio.

These stories are pathetically amusing and that alone makes them interesting, but what really intrigues me is how they illustrate how people understand categories. "Planet" is a category with profound historical and cultural importance, and because of the IAC resolution, we get to witness a very clear and sudden shift in how that category is defined.

For millennia we earthlings have had a notion of planet as a "wandering" celestial object, but because we only knew of planets in our own solar system, we could define "planet" by enumeration. Very few categories can be understood that way, that is, by making an exhaustive list of their members. But once we acknowledge the existence of planets outside our solar system, the set of planets becomes unbounded, and the lack of a definition becomes apparent. Then we can have arguments about the definitions, and hence biases of one kind or another get built into the categorization.

The popular reaction to this new way of understanding "planet" by extensional definition rather than by enumeration suggests that many people are living with the delusion that there is an objective reality in which categories and definitions are objective and unchanging. And it is scary to read that the IAC members gave serious discussion to the likely impact their new definition would have on elementary school science. I thought progress in science, scientific revolutions, creative destruction and all that meant that we should look forward and not worry about the “installed base” of people with a sixth-grade science education.

-Bob Glushko

Saturday, August 12, 2006


XML Isn't "Self-Describing"

I am so sick and tired of reading that XML is “self-describing.” It isn't. I could link to 100 web articles or blog posts that proclaim that it is, and even the popular "Learning XML" book by Eric Ray that I've used to teach XML says it is ("Creating self-describing data" is on the book cover). But I was working with XML for 10 years even before it existed , back when it was a 4-letter word (that's a joke about SGML that I credit to Bob DuCharme), and it wasn't self-describing then and never will be. And that's a feature, not a bug.

Let me try to be charitable, and assume that what people mean when they say XML is "self-describing" they are really saying "compared to something else that clearly isn't." So the least "self-describing" information consists of just a stream of the alphanumeric characters being represented by some text format, as they might be on a punch card. This delimiter-less encoding doesn't even make explicit the tokenization of the characters into meaningful values, so there isn't even any "self" to which any description could be assigned:


The information here has been encoded in a position-sensitive way, and it turns out that there are three different information components that occupy fixed-length fields in the text stream. But we can't begin to describe the information here unless we have some mapping of positions to values. The possibility of description emerges when we separate the values with commas or some other delimiter character, which tells us what information components must be described:


But commas as delimiters provide no clues about what the components mean, do not enable any association of one component to another, and do not enable one component to be contained within another. A text encoding syntax that uses multiple delimiters like EDIFACT is a step closer to self-description, because it can implicitly represent structural or semantic hierarchy among components. XML goes one step further with the syntactic mechanisms of paired text labels to distinguish the information components in a stream of text and quotes to associate one bit of information as an attribute of another. So an XML encoding of this text stream might be:

<xxx yyy="4567">850</xxx>

The <, >, and " characters distinguish the information being described from the "markup" that is part of its description. This syntax allows more flexibility in the encoding of the text stream (without positional encoding, we no longer have to assume that the values are of fixed length). But the information isn't described by these syntactic markup mechanisms, and that's all that XML per se is contributing so far.

I suppose that it is the text labels inside of XML's syntactic delimiters that cause most people to think that XML is self-describing. But these tags aren't part of XML, so it isn't XML that is doing the work. But what do these "tags" really contribute anyway? Instead of xxx, yyy, and zzz, I might have encoded the text stream this way:

<TransactionType reference="4567">850</TransactionType>

Using text labels in a language we "understand" might give us a warm feeling that we are describing the text content, but the tags really don’t do that.

Choosing the terms used for tags or naming anything is often a difficult and contentious activity. Everyone naturally creates names that make sense to them, but even when describing exactly the same thing, chances are very good that two people will choose different names for it. And they will often use the same name or tag for different things.

"TransactionType" and "reference" and "Date" might suggest something about the meaning of the content, but "suggesting something" is not enough to make it self-describing. To someone familiar with the ANSI X12 EDI standard, a "TransactionType" with a value of 850 is a Purchase Order, but most people wouldn't have any idea that I used this interpretation to make up this example. Does "Date" mean the date of the purchase order or the date I wrote about it? What about a <Price> tag -- does this tag describe the retail, wholesale, discounted, or FOB Sydney price? Does it describe the price for a single item, a dozen, or a pallet-full? What's the currency? The tag by itself can't possibly distinguish between these different descriptions, so it doesn't make the information self-describing.

To be self-describing the XML syntax and tags would have to simultaneously convey both the specific information they mark up, all the semantic nuance needed to distinguish among synonyms or related concepts, and all the rules that govern relationships to other content – all without any additional information. If XML syntax and tags could magically do that by themselves we wouldn't need schemas or any documentation or other metadata. So as we said at the end of our Geometry proofs, Q.E.D.


What do you suppose the "rating" and "weight" tags mean (from
Google's recommendations to users of its "Google Base" service


Make a guess. You will probably be wrong. The tags aren't "self-describing" enough.

-Bob Glushko

Sunday, August 06, 2006


Deconstructing the Paperwork Burden

Every year the "Paperwork Reduction Act" requires that the Office of Management and Budget provide Congress with an estimate of how much time Americans spend filling out government forms. The goal is to "minimize the burden that responding to these collections imposes on the public, while maximizing their public benefit." The 2005 estimate was 8.4 billion hours, so for the 230 million people over age 16 (I assume that kids don’t fill out a lot of government forms) that's an average of 36 hours per person. IRS tax forms account for 80% of this estimate, 30 hours a year.

Even if we factor out homeless people, transients, people in prisons and others who probably don't fill out a lot of government forms, maybe this number goes up some but it doesn't seem like such a big time burden on average. And since I know I spend a lot more time than that dealing with income tax recordkeeping and reporting, most people must be getting off relatively easy.

Nevertheless, a recent review of the OMB's report conducted by the US Government Accountability Office got some news coverage when it came out, and a lot of it wasn't very kind. There were lots of headlines like "Drowning in Paper" or "Buried under paperwork."

I didn't think that this kind of bashing was justified, so I read the complete GAO report, presented to Congress by Linda Koontz, and I was very impressed by it. Most government agencies seem to be taking the PRA very seriously. In particular, the PRA as amended in 1995 presents a checklist against which information collection should be evaluated (included as Table 1 of the GAO report) and OMB provides
careful guidance
on the why, when, how of data collection in forms, surveys, and questionnaires.

Most people would identify two kinds of causes for the paperwork burden:

1) Government action. Most of the burden correlates with the extent of data collection mandated by law or regulation, and this is what most critics focus on.

2) Design of the forms or other data collection instruments. The burden is increased by forms that are poorly designed, either from a physical or presentational standpoint (confusing layout, tiny fonts, not enough space) or from a more conceptual or architectural standpoint (ambiguous language, choices or enumerations that don't span the entire range or that are incompatible).

But recent thinking about estimating the paperwork burden is surprisingly sophisticated. Rather than analyzing the burden form-by-form, the IRS is developing a methodology that considers characteristics of taxpayers and the context in which they complete their tax returns. These include the way in which they gather and manage relevant information, any software they use, and whether they work with others (like accountants or tax preparation service). This more complex methodology will make information standards, interoperability, and other document engineering considerations more prominent in the analysis to yield more accurate estimates of the compliance burden and suggest better ways to reduce it. The IRS is also making
significant investments in XML and online filing.

And consider one more thing. At least the government is trying to minimize the burden that collecting information imposes on its "customers" and is very transparent in making the case for what it does. The private sector has no such goals, and consciously misrepresents or obscures its efforts to wring every last bit of information from consumers. For example, when you buy something, did you know that your warranty rights are not in the least bit undermined if you don't fill out product registration cards and the detailed surveys that usually are part of them?

-Bob Glushko

This page is powered by Blogger. Isn't yours?