Enterprise search

Analysis of enterprise-specific search technology (as opposed to general web search). Related subjects include:

July 29, 2006

Web search and enterprise search are coming together

Web search and enterprise search are in many ways fundamentally different problems. The biggest problem in web search is screening out pages that deliberately pretend to be relevant to a search. The second biggest problem is picking out the crème de la crème from a long list of essentially good hits. In enterprise search, on the other hand, the biggest problem is finding a single document, or single fact, that is lonely at best, and if you’re unlucky doesn’t exist in the corpus at all. Document structures are also completely different, as are linking structures and almost every other input to the ranking algorithms except the raw words themselves.

Even so, the businesses and technologies of web and enterprise search are beginning to combine. Read more

July 29, 2006

Convera aka Excalibur aka ConQuest

Once upon a time, more than a decade before the founding of Autonomy, a New Mexico inventor had the idea for a generic pattern recognition tool. He implemented it on a PC add-in board that, if I recall correctly, plugged into the Apple II. This was the genesis of the company Excalibur Technologies.

Read more

July 23, 2006

Text mining for compliance and legal discovery

One theme that keeps recurring in my talks with text mining and other text analytics/text technology companies is compliance. Ditto legal discovery, which is closely related. Most of the focus seems to be on three kinds of data:

  1. Vehicle defect evidence. The TREAD Act is of course the big driver here (no pun intended).
  2. Drug side effect evidence. The FDA is pushing that one.
  3. Email/correspondence archives. Text search/filtering/clustering/mining whatever is now a standard part of legal discovery.

Read more

July 23, 2006

Update: Autonomy/Verity merger

I had a couple of very interesting calls with Autonomy last week. One message I got was that they do not want to be pigeonholed in search, which they think on the whole is a primitive way of dealing with “unstructured information.” Nonetheless, my first post based on those calls will indeed focus on text indexing and search. You see, I wrote quite skeptically about the Autonomy/Verity merger when it was announced, and I’d like to amend that with an updated opinion. Autonomy’s claims can be summarized in part by the following: Read more

July 11, 2006

Google’s internal text-based project/knowledge management

Slashdot turned up an amazing article in Baseline on Google’s infrastructure. There’s lots of gee-whiz stuff in there about server farms, petabytes of disk packed into a standard shipping container so as to allow the setup of more server farms around the globe, and so on. But even more interesting to me was another point, about Google’s internal use of its own technology. In at least one case – a hybrid of project and knowledge management – Google really seems to be doing what other firms only dream about as futures. Here’s the relevant excerpt:

Read more

December 11, 2005

The text technologies market 3: Here’s what’s missing

The text technologies market should be booming, but actually is in disarray. How, then, do I think it should be fixed? I think the key problem can be summed up like this:

There’s a product category that is a key component of the technology, without which it won’t live up to nearly its potential benefits. But there’s widespread and justified concern over its commercial viability. Hence, the industry cowers in niches where it can indeed eke out some success despite products that fall far short of their true potential.

The product category I have in mind, for lack of a better name, is an ontology management system. No category of text technology can work really well without some kind of semantic understanding. Automated clustering is very important for informing this understanding in a cost-effective way, but such clustering is not a complete solution – hence the relative disappointment of Autonomy, the utter failure of Excite, and so on. Rather, there has to be some kind of concept ontology that can be use to inform disambiguation. It doesn’t matter whether the application category is search, text mining, command/control, or anything else; semantic disambiguation is almost always necessary for the most precise, user-satisfying results. Maybe it’s enough to have a thesaurus – i.e., a list of synonyms. Maybe it’s enough to define “concepts” by simple vectors of word likelihoods. But you have to have something, or your search results will be cluttered, your information retrieval won’t fetch what you want it to, your text mining will have wide error bars, and your free-speech understanders will come back with a whole lot of “I’m sorry; I didn’t understand that.”

This isn’t just my opinion. Look at Inquira. Look at text mining products from SPSS and many others. Look at Oracle’s original text indexing technology and also at its Triplehop acquisition. For that matter, look at Sybase’s AnswersAnywhere, in which the concept network is really just an object model, in the full running-application sense of “object.” Comparing text to some sort of thesaurus or concept representation is central to enterprise text technology applications (and increasingly to web search as well).

Could one “ontology management system,” whatever that is, service multiple types of text applications? Of course it could. The ideal ontology would consist mainly of four aspects:

1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, and competitors with product names, and those names have abbreviations, and so on.

Relatively little of that is application-specific; for any given enterprise, a single ontology should meet most or all of its application needs.

Coming up: The legitimate barriers to the creation of an ontology management system market, and ideas about how to overcome them.

December 9, 2005

The text technologies market 2: It’s actually in disarray

The text technologies market should be huge and thriving. Actually, however, it’s in disarray. Multiple generations of enterprise search vendors have floundered, with the Autonomy/Verity merger being basically a combination of the weak. The RDBMS vendors came up with decent hybrid tabular/text offerings, and almost nobody cared. (Admittedly, part of the reason for that is that the best offering was Oracle’s, and Oracle almost always screws up its ancillary businesses. Email searchability has been ridiculously bad since — well, since the invention of email. And speech technology has floundered for decades, with most of the survivors now rolled into the new version of Nuance.

Commercial text mining is indeed booming, but not to an extent that erases the overall picture of gloom. It’s at most a several hundred million dollar business, and one that’s highly fragmented. For example, at a conference on IT in life sciences not that long ago, two things became evident. First, the text mining companies were making huge, intellectually fascinating, life-saving contributions to medical research. Second, more than ten vendors were divvying up what was only around a $10 million market.

If text technology is going to achieve the prominence and prosperity it deserves, something dramatic has to change.

November 4, 2005

Autonomy + Verity — so what?

On some levels, the Autonomy/Verity merger makes total sense. The text search industry now has an unquestionably dominant vendor of shelfware. Somewhat less snarkily, I could say that it has a dominant OEM vendor of search technology. And while Verity’s management team has never recovered from the dizzying cycles of turnover in the 1990s, Autonomy’s obviously was quite effective. However, I see no obvious reason to believe that combined company will actually ship good products, or ones that lead to fundamentally greater adoption for enterprise search than the fairly marginal role it plays today.

Verity and Autonomy represent different philosophies of text search — Boolean vs. concept-based, basically. Neither works very well on its own, whether in the enterprise or on the web, with concept-based being the weaker of the two. That’s why Altavista et al. failed, and Excite failed yet more completely. It’s why Verity’s text search is generally more respected, and has more hardcore users, than Autonomy’s. (Being a vastly older company than Autonomy helps a lot too, of course.)

I hope that the merged company will soon introduce some new and/or synthesized approaches to search, significantly improving the overall quality of available products. If anybody has the resources and motivation, it will. The recent boom in text data mining, and the general increase of seriousness about ontologies, at least raises the possibility that concept-oriented search will evolve into something significantly useful. But I’m not holding my breath.

← Previous Page

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.