October 22, 2006

Enterprise-specific web search: High-end web search/mining appliances?

OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …

Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:

FAST, Convera, Google, and Microsoft all have the potential to introduce such a product package, although I’m not aware of any specific initiatives that exactly match what I have in mind. The closest may be Convera, which is providing a standard vertical-market-specific sub-Web designed for its government intelligence/law-enforcement customers. (I’ve forgotten whether this is on-site or on a SaaS basis.)

Of course, the idea has drawbacks, but I don’t see any of them as killers. Possible pitfalls include:

One other thing about this hypothetical product – it might well be in appliance format. (Or it might be SaaS, just as general Web search currently is. Indeed, in some ways that’s a more likely possibility.) Here are some thoughts on both sides of the appliance issue:

1. There’s a clear technology trend at the high-end of the relational data warehousing world: Massively multi-parallel “shared-nothing” systems are winning over the symmetric multi-processing “shared-everything” systems that dominate the OLTP RDBMS world. (I’ve written about that at length over on DBMS2, e.g. in this post.)

Text search for large corpuses also seems to work best on MPP shared-nothing systems. This is famously the Google architecture, and Inktomi’s (i.e., Yahoo’s) as well. It’s also the FAST approach.

Doing text in MPP makes theoretical sense too; just partition the documents by node, search each node, and aggregate the results. Indeed, text search is an even clearer fit for MPP than relational query, since you can ALWAYS use a local index, with none of those pesky relational joins. At least, that’s true of the querying part. In actual spidering, you do have the problem of shipping URLs from one node to another; the same might be true of link information and other search algorithm inputs.

2. The relational world gives very mixed signals as to how much these MPP/shared-nothing products benefit from specialized hardware. There’s certainly a consensus that customers at least want preconfigured hardware. But beyond that it gets confusing. Teradata has specialized networking. IBM is going to preconfigured generic hardware, and also mumbles about a specialized data filtering chip. Netezza has a very custom system based on FPGAs. DATallegro loves its custom Infiniband networking add-ons, but mumbles about going to standard hardware. Greenplum and Kognitio are on standard hardware, although Greenplum focuses on a quasi-appliance through Sun. (if my memory is correctly, Greenplum actually boasts at least one text-indexing customer, namely O’Reilly.)

Of that group, Kognitio (formerly White Cross) actually has software the most analogous to text indexing, in that they rely on compressed bitmaps for their data access, vs. the hashes and table scans of the other contenders. So their experience might be the most relevant here. Kognitio started out as a custom hardware vendor, but now runs on utterly standard blades. That said, they’re getting interest from their customers in having them somehow prepackage standard hardware, much as IBM is with its BCUs (Base Configuration Units).

And on the theoretical side: If you look at the specific problems solved by proprietary data warehousing hardware, such as speedy moving around of intermediate query results, they don’t seem to have strong analogues in the text search case.

3. But that’s just actual querying. There’s also the matter of spidering/indexing. And that has strong aspects of stream processing, data communications, maybe security, etc. – and those kinds of things do often call for appliances, and sometimes even special-purpose chips (more precisely, specially-programmed FPGAs).

So there you have it. The preponderance of evidence suggests appliances are the way to go, and also that the market would welcome an appliance package. But it’s hardly conclusive.

As for the related question of whether purely in-house enterprise search should be done on appliances – well, the same considerations apply. There’s good market receptivity to low-end search appliances and high-end data warehousing appliances, so there’s at least a plausibility argument that high-end enterprise search should be done on appliances as well. But it’s not an open-and-shut case.

Comments

One Response to “Enterprise-specific web search: High-end web search/mining appliances?”

  1. Richard L. Brandt on October 25th, 2006 4:08 pm

    Interesting argument. I do not see why it should not happen. Google, and I assume others, already has an enterprise search appliance, although it’s currently designed only to search through the enterprise’s own data. http://www.google.com/enterprise/gsa/onebox.html

    But how hard would it be to expand that appliance to do filtered searching on the Internet as well? There would be big advantages for companies that do not want employees to waste time on non-work searches.

Leave a Reply




Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.