Convera – Text Technologies

Government-specific search fails to impress

Curt Monash — Wed, 31 Jan 2007 18:31:33 +0000

According to Steven Arnold, FirstGov – which has been renamed USASearch.gov — is by far the most effective US government-specific search engine. But there’s something odd about it; whatever the query, it’s determined to give no more than a little over 100 results. Queries for which I’ve noted results in this quantity range include Bush (and this covers all family members), Cheney (ditto), Kennedy (ditto), Condaleeza, Scalia, Coolidge, Red Sox, big dig, Burlingame, Redmond, Pluto, ethanol, spotted owl, and topology. The only ones I’ve found so far coming out above that results range – perhaps inevitably — are death (137) and taxes (177).

Only when I forced the issue with really narrow terms did I get fewer results – e.g., drosophila (95), cichlid (82), Haluk Ozkaynak (Linda’s good-guy ex-husband – 80), or cohomology (53). Wait. Let me amend that; Under Secretary of State Paula Dobriansky only rates 43 hits, less than our dorm mate Patricia Buckles (105).*

*Come to think of it, the last time I saw either of those ladies was on a visit to Washington in 1981, when I stayed at Patty’s apartment and got an after-hours White House tour from Paula. I regret falling out of touch with both of them, especially Patty, who was a dear friend. But I digress …

Whatever the peculiarity in its level of recall, USASearch at least seems to do a good job at giving relevant results first (e.g., official bio pages of people). The same can’t be said of Convera’s Govmine, even when a search is restricted to .gov pages only. That site still needs a lot of work.

Enterprise-specific web search: High-end web search/mining appliances?

Curt Monash — Mon, 23 Oct 2006 00:27:59 +0000

OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …

Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:

Filtering out spam hits. This is obviously important for search, and in some cases could help with public-web text mining as well. It should be OK to be more aggressive on spam-site filtering in an enterprise-specific index than it is in general web search.
Filtering out malicious/undesirable downloads of various sorts. I’m thinking mainly of malware/spyware here, but of course it can also be used for netnannying porn-prevention and the like as well. Again, this is more easily done for the enterprise market than for the search world at large. (I anyway think that Google could blow Websense out of the water any time they wanted to – except, of course, for the not-so-small matter of not being seen as participating in the censorship business — but that’s a separate discussion.)
Capturing employees’ search strings. This could be useful for various purposes, including discerning their interests, and building the corporate ontology for internal web search.
Freshness control. If there’s a site you really care about, you can make sure it’s re-indexed frequently.

FAST, Convera, Google, and Microsoft all have the potential to introduce such a product package, although I’m not aware of any specific initiatives that exactly match what I have in mind. The closest may be Convera, which is providing a standard vertical-market-specific sub-Web designed for its government intelligence/law-enforcement customers. (I’ve forgotten whether this is on-site or on a SaaS basis.)

Of course, the idea has drawbacks, but I don’t see any of them as killers. Possible pitfalls include:

Copyright issues because indexing = copying? No more than for Google itself.
Reluctance to let a competitor spider you? That would be only few a few pages at most.
Lack of full Web coverage? There always are public search engines as a fallback. Indeed, it might be interesting to show results from sanitized search and public search side by side, something which also would help with credibility and adoption.

One other thing about this hypothetical product – it might well be in appliance format. (Or it might be SaaS, just as general Web search currently is. Indeed, in some ways that’s a more likely possibility.) Here are some thoughts on both sides of the appliance issue:

1. There’s a clear technology trend at the high-end of the relational data warehousing world: Massively multi-parallel “shared-nothing” systems are winning over the symmetric multi-processing “shared-everything” systems that dominate the OLTP RDBMS world. (I’ve written about that at length over on DBMS2, e.g. in this post.)

Text search for large corpuses also seems to work best on MPP shared-nothing systems. This is famously the Google architecture, and Inktomi’s (i.e., Yahoo’s) as well. It’s also the FAST approach.

Doing text in MPP makes theoretical sense too; just partition the documents by node, search each node, and aggregate the results. Indeed, text search is an even clearer fit for MPP than relational query, since you can ALWAYS use a local index, with none of those pesky relational joins. At least, that’s true of the querying part. In actual spidering, you do have the problem of shipping URLs from one node to another; the same might be true of link information and other search algorithm inputs.

2. The relational world gives very mixed signals as to how much these MPP/shared-nothing products benefit from specialized hardware. There’s certainly a consensus that customers at least want preconfigured hardware. But beyond that it gets confusing. Teradata has specialized networking. IBM is going to preconfigured generic hardware, and also mumbles about a specialized data filtering chip. Netezza has a very custom system based on FPGAs. DATallegro loves its custom Infiniband networking add-ons, but mumbles about going to standard hardware. Greenplum and Kognitio are on standard hardware, although Greenplum focuses on a quasi-appliance through Sun. (if my memory is correctly, Greenplum actually boasts at least one text-indexing customer, namely O’Reilly.)

Of that group, Kognitio (formerly White Cross) actually has software the most analogous to text indexing, in that they rely on compressed bitmaps for their data access, vs. the hashes and table scans of the other contenders. So their experience might be the most relevant here. Kognitio started out as a custom hardware vendor, but now runs on utterly standard blades. That said, they’re getting interest from their customers in having them somehow prepackage standard hardware, much as IBM is with its BCUs (Base Configuration Units).

And on the theoretical side: If you look at the specific problems solved by proprietary data warehousing hardware, such as speedy moving around of intermediate query results, they don’t seem to have strong analogues in the text search case.

3. But that’s just actual querying. There’s also the matter of spidering/indexing. And that has strong aspects of stream processing, data communications, maybe security, etc. – and those kinds of things do often call for appliances, and sometimes even special-purpose chips (more precisely, specially-programmed FPGAs).

So there you have it. The preponderance of evidence suggests appliances are the way to go, and also that the market would welcome an appliance package. But it’s hardly conclusive.

As for the related question of whether purely in-house enterprise search should be done on appliances – well, the same considerations apply. There’s good market receptivity to low-end search appliances and high-end data warehousing appliances, so there’s at least a plausibility argument that high-end enterprise search should be done on appliances as well. But it’s not an open-and-shut case.

Analyst reports about enterprise search

Curt Monash — Sat, 29 Jul 2006 12:16:33 +0000

Gartner and Forrester have high opinions of FAST. Not coincidentally, you can download both those firms’ recent search industry survey reports from almost any page of www.fastsearch.com. Of the two, Forrester’s is both better and more recent.

Summarizing brutally, the big firms’ consensus seems to be:

FAST and Autonomy are the clear leaders.
Endeca has great technology and is coming on strong.
Everybody else is a niche player, at least for now.
Convera is in deep yogurt.

Forrester is particularly harsh on Convera. Presumably this has much to do with the fact that Convera did not cooperate well with the survey process. I shall not speculate as to which way the causality runs there – but I should note that Convera was quite cooperative with my research last week.

Web search and enterprise search are coming together

Curt Monash — Sat, 29 Jul 2006 12:13:12 +0000

Web search and enterprise search are in many ways fundamentally different problems. The biggest problem in web search is screening out pages that deliberately pretend to be relevant to a search. The second biggest problem is picking out the crème de la crème from a long list of essentially good hits. In enterprise search, on the other hand, the biggest problem is finding a single document, or single fact, that is lonely at best, and if you’re unlucky doesn’t exist in the corpus at all. Document structures are also completely different, as are linking structures and almost every other input to the ranking algorithms except the raw words themselves.

Even so, the businesses and technologies of web and enterprise search are beginning to combine. Google’s attack on the low end of enterprise search is well-known, of course, as are Microsoft’s increasingly well-publicized ambitions. But enterprise search companies are also reaching out to the Web. Convera has gotten the most press for this strategy, offering focused web search to the same customers (mainly intelligence law enforcement agencies) that bought its enterprise product RetrievalWare. This is a great fit for Convera, both in customers (a lot of what those agencies have done all along is filter news information) and technology (their key differentatior is their detailed taxonomies, and those can help in any kind of search).

But it’s not just Convera. FAST of course sold alltheweb.com, which is now owned by Yahoo, and is barred by non-compete agreement from getting back into the web search business. Even so, it is spidering and analyzing and perhaps filtering billions of web pages, and offering the results as a service to its enterprise customers. These customers then have a huge leg up in deciding which pages to spider themselves and index with FAST’s enterprise technology, and they have access to FAST’s metadata banks to help with the ranking of those pages once spidered. Clever!

I think that Autonomy is doing something along these lines too, but I’m devoid of any details.

EDIT: Actually, Convera later sold its search technology to FAST, and started OEMing FAST’s technology instead. Microsoft now inherits that relationship with its acquisition of FAST.

Convera aka Excalibur aka ConQuest

Curt Monash — Sat, 29 Jul 2006 12:10:48 +0000

Once upon a time, more than a decade before the founding of Autonomy, a New Mexico inventor had the idea for a generic pattern recognition tool. He implemented it on a PC add-in board that, if I recall correctly, plugged into the Apple II. This was the genesis of the company Excalibur Technologies.

The Excalibur operation eventually moved north of San Diego, CA. And the company acquired ConQuest, makers of RetrievalWare, one of the original government-focused text search companies. And Allen & Company became major backers (presumably before the acquisition, but I don’t actually recall). There was some excitement in the mid-1990s, when extensible RDBMS were coming out, and at least two of Informix, IBM, and Oracle (I forget which two) seemed to be introducing Excalibur-based extensions. That fizzled, however. Later there was a merger with an Intel image-retrieval operation, and a name change to Convera. That, it seems, was spectacularly unsuccessful, although I must admit that I wasn’t paying attention and hence missed, as it were, the spectacle.

Now the company offers RetrievalWare, augmented by some pattern-matching technology – e.g., what they think is a better form of fuzzy word tokenization, and some color/shape/texture image matching as well. They also have introduced a web search product. (This is confusingly called Excalibur, but they told me last week that a much-needed rebranding is underway.) Maybe this strategy will be the one that finally works out for them.