June 10, 2006

Four (at least) issues in text and taxonomy federation

For any issue in text analytics, there are a lot of smart people who understand it very well (and, in particular, better than I do). But I suspect that even many of the discipline’s better thinkers have failed to grasp just what will be involved in taking text analytic technologies to the next level of usefulness and adoption. One example is what I see as the requirements for an ontology management system; I’m not aware of anybody else right now who has as expansive a view of this as I do. In this post, I’m going to address another issue whose complexity is in my opinion under-appreciated – federation. More specifically, I’ll argue that the various aspects of the federation task are a big part of what makes ontology management so complex.

The thing is – there are at least four different senses of “federation” in the text sphere, all of them important, most of them still technically unsolved. Namely:

1. Federated UIs.

2. Federated relevancy rankings.

3. Federated taxonomies (part 1) – synonym lists.

4. Federated taxonomies (part 2) – structure.

“Federated UI” is not a common term; rather, people just talk about “portals,” or maybe even “knowledge management.” Cool. Anyhow, in its classic form a portal has a lot of different kinds of information presented or linked to on different parts of a screen, much of it textual. The best portals are rich enough so as to become unholy messes without some kind of context-sensitivity, where the context is most commonly a specific subject, the role of the user, or both. Clearly, there’s a taxonomy at work here, but typically this is not where the serious linguistic stresses lie.

Much less has been accomplished in the area of federated relevancy rankings. Suppose you have several lists of hits on a search, obtained from several corpuses. You can pretend they’re all from the same corpus and re-rank, but hits from a single corpus will usually win. Or you can deliberately interleave hits from different corpuses. And – well, that’s about as sophisticated as it gets.

Why is this problem so hard? Well, let’s consider a closely related problem. We have multiple criteria by which to rank relevancy: Boolean results, hand-built metadata, link popularity, etc. How do we combine those into a single score? I’ll give you a two word answer: “Not easily.” Sure, every search engine does it somehow. Web search engines even do a halfway-decent job of it. But in enterprise search, no general solution has yet been found.

I could go on about this point at near-infinite length, not least because it was a big part of what I worked on at Elucidate. I shall mercifully refrain. However, please note that the two versions I gave are essentially equivalent problems. Different corpuses will provide different kinds of hits, which are hits for different reasons. Thus, reconciling them requires a way to weigh different kinds of relevancy evidence. Conversely, if you really could do a great job of reconciling different kinds of relevancy evidence, “dump them all into one pot and rerank them” would in most cases be an effective solution.

How would a strong ontology management system help with reconciling different kinds of relevancy evidence? Honestly, I’m not sure how much it does. But I do know that, for many particular kinds of relevancy metric, an ontology is crucial to determining an accurate score.

Uh, I’m assuming that for anybody who’s bothered to read this far, that last point is pretty obvious. If it isn’t, please leave a comment on this post, and I’ll expand on it further.

Finally, there are at least two major kinds of issue in the federation of taxonomies themselves – adding synonyms at existing nodes, and extending the structure itself. The former might at first blush seem to be straightforward: “Oh, good! We have more sources of sometimes-useful synonyms. Let’s throw them into the mix, and our recall will be improved.” Well, yes and no. Unless the synonym list is context-sensitive, you open yourself up to all sorts of unfortunate possibilities. So once again, the ontology management/federation challenge here is non-trivial.

And if you want to federate multiple versions of a taxonomy with somewhat different structures – well, that’s so obviously a hard problem (at least in the general case) that I won’t make this post yet longer by spelling the reasons out.


One Response to “Four (at least) issues in text and taxonomy federation”

  1. Should ontology management be open sourced? | Text Technologies on June 8th, 2008 5:22 am

    […] invention. And there’s a lot of invention still to be done here, especially in the area of taxonomy federation. Could there be an open source business model in which the vendor gives away the code and sells […]

Leave a Reply

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.