December 12th, 2005 Curt Monash
I argue a lot with relational purists. On the whole they’re smart people, but they do have their blindspots.
One of the biggest is in the area of text. They fail to see how text data management is fundamentally different from tabular data management. Here’s a little article explaining why text doesn’t fit well into the relational model.
Posted in Search and text storage | 1 Comment »
December 11th, 2005 Curt Monash
In previous posts I argued that what’s holding the text technology industry back is the lack of a viable ontology management system. The obvious objection to such a suggestion is: Who would use it? There is no business process for ontology management, even less than there is for “knowledge management,” and for that matter less than there was for “knowledge engineering” during the expert systems bubble of the 1980s. Enterprises do not have anything like a “chief ontologist.” Indeed, that job title sounds like a joke — a touchy-feely liberal-artsy nonstarter.
The only way a successful product category of ontology management systems can emerge is if the products are usable by ordinary IT personnel. Vendor-supplied product training can be required, of course. Some day there can be certifications, and maybe a single class in a computer science curriculum. But almost nobody is going to buy a product whose use requires a masters degree in library science or “ontology management.”
So here are some very high-level requirements I think an ontology management system needs to meet.
1. Basic knowledge representation has to be flexible. It has to accommodate semantic net kinds of relationships (is_an_instance_of, is_a_subcategory_of). It also has to accommodate machine learning/statistical kinds of evidence (both positive and negative evidence).
2. There has to be strong layering/versioning. Pieces of the ontology will come from the vendor. Pieces will come from frequently-updated machine-learning exercises against an enterprise’s own corpus(es). Pieces will be added by hand, through a collaboration between IT and (at first) power users. It will have to be possible to reverse any of those pieces out, to apply different pieces for different specific applications, and so on.
3. There need to be standard, open ways for different kinds of applications to use the ontologies. UIMA could be a starting point.
4. The product needs to be industrial-strength – reliable, scalable, secure, sufficiently easy to administer, available on a sufficient range of platforms, and compliant with general standards (not just the text-specific ones).
Obviously, these requirements are nontrivial to achieve. But if some vendor does do a good job on them, the payoff could be huge. Dominance of the enterprise text technologies market – which would be a greatly expanded market – is at stake.
I think it will happen.
Posted in Ontologies and context identification | 4 Comments »
December 11th, 2005 Curt Monash
The text technologies market should be booming, but actually is in disarray. How, then, do I think it should be fixed? I think the key problem can be summed up like this:
There’s a product category that is a key component of the technology, without which it won’t live up to nearly its potential benefits. But there’s widespread and justified concern over its commercial viability. Hence, the industry cowers in niches where it can indeed eke out some success despite products that fall far short of their true potential.
The product category I have in mind, for lack of a better name, is an ontology management system. No category of text technology can work really well without some kind of semantic understanding. Automated clustering is very important for informing this understanding in a cost-effective way, but such clustering is not a complete solution – hence the relative disappointment of Autonomy, the utter failure of Excite, and so on. Rather, there has to be some kind of concept ontology that can be use to inform disambiguation. It doesn’t matter whether the application category is search, text mining, command/control, or anything else; semantic disambiguation is almost always necessary for the most precise, user-satisfying results. Maybe it’s enough to have a thesaurus – i.e., a list of synonyms. Maybe it’s enough to define “concepts” by simple vectors of word likelihoods. But you have to have something, or your search results will be cluttered, your information retrieval won’t fetch what you want it to, your text mining will have wide error bars, and your free-speech understanders will come back with a whole lot of “I’m sorry; I didn’t understand that.”
This isn’t just my opinion. Look at Inquira. Look at text mining products from SPSS and many others. Look at Oracle’s original text indexing technology and also at its Triplehop acquisition. For that matter, look at Sybase’s AnswersAnywhere, in which the concept network is really just an object model, in the full running-application sense of “object.” Comparing text to some sort of thesaurus or concept representation is central to enterprise text technology applications (and increasingly to web search as well).
Could one “ontology management system,” whatever that is, service multiple types of text applications? Of course it could. The ideal ontology would consist mainly of four aspects:
1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, and competitors with product names, and those names have abbreviations, and so on.
Relatively little of that is application-specific; for any given enterprise, a single ontology should meet most or all of its application needs.
Coming up: The legitimate barriers to the creation of an ontology management system market, and ideas about how to overcome them.
Posted in Enterprise search, Natural language and speech recognition, Ontologies and context identification, Search and text storage, Speech recognition, Text mining | 5 Comments »
December 9th, 2005 Curt Monash
The text technologies market should be huge and thriving. Actually, however, it’s in disarray. Multiple generations of enterprise search vendors have floundered, with the Autonomy/Verity merger being basically a combination of the weak. The RDBMS vendors came up with decent hybrid tabular/text offerings, and almost nobody cared. (Admittedly, part of the reason for that is that the best offering was Oracle’s, and Oracle almost always screws up its ancillary businesses. Email searchability has been ridiculously bad since — well, since the invention of email. And speech technology has floundered for decades, with most of the survivors now rolled into the new version of Nuance.
Commercial text mining is indeed booming, but not to an extent that erases the overall picture of gloom. It’s at most a several hundred million dollar business, and one that’s highly fragmented. For example, at a conference on IT in life sciences not that long ago, two things became evident. First, the text mining companies were making huge, intellectually fascinating, life-saving contributions to medical research. Second, more than ten vendors were divvying up what was only around a $10 million market.
If text technology is going to achieve the prominence and prosperity it deserves, something dramatic has to change.
Posted in Enterprise search, Natural language and speech recognition, Search and text storage, Speech recognition, Text mining | 2 Comments »
December 9th, 2005 Curt Monash
From a number of standpoints, the market for enterprise technologies that explicitly* manage text SHOULD be huge. Consider:
1. The market for consumer text search is huge — think of Google.
2. The market for implicit* management of text is huge. Email management is a significant fraction of the IT budget, if you factor in the predominance of email in the use of networks and PCs. Now regulations are compelling email to be stored and managed at great expense. People spend hours per day working on email, word processors, etc.
3. The text mining market has recently boomed, and good ROI appears to be the norm.
*My implicit vs. explicit distinction here is meant to distinguish technologies that manage text as some sort of BLOB or other blob vs. technologies that take account of the fact that it is text, which contains words, phrases, synonyms, and so on.
If text technologies could live up to researchers’ dreams, typical knowledge workers would save hours per week and in many cases hours per day. The benefits would rival at least those of the whole PC/office productivity/messaging set of technologies. Thus, at least in theory, the market potential for these technologies is enormous.
Posted in Search and text storage, Text mining | 3 Comments »