OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …
Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:
- Filtering out spam hits. This is obviously important for search, and in some cases could help with public-web text mining as well. It should be OK to be more aggressive on spam-site filtering in an enterprise-specific index than it is in general web search.
- Filtering out malicious/undesirable downloads of various sorts. I’m thinking mainly of malware/spyware here, but of course it can also be used for netnannying porn-prevention and the like as well. Again, this is more easily done for the enterprise market than for the search world at large. (I anyway think that Google could blow Websense out of the water any time they wanted to – except, of course, for the not-so-small matter of not being seen as participating in the censorship business — but that’s a separate discussion.)
- Capturing employees’ search strings. This could be useful for various purposes, including discerning their interests, and building the corporate ontology for internal web search.
- Freshness control. If there’s a site you really care about, you can make sure it’s re-indexed frequently.
FAST, Convera, Google, and Microsoft all have the potential to introduce such a product package, although I’m not aware of any specific initiatives that exactly match what I have in mind. The closest may be Convera, which is providing a standard vertical-market-specific sub-Web designed for its government intelligence/law-enforcement customers. (I’ve forgotten whether this is on-site or on a SaaS basis.)
Of course, the idea has drawbacks, but I don’t see any of them as killers. Possible pitfalls include:
- Copyright issues because indexing = copying? No more than for Google itself.
- Reluctance to let a competitor spider you? That would be only few a few pages at most.
- Lack of full Web coverage? There always are public search engines as a fallback. Indeed, it might be interesting to show results from sanitized search and public search side by side, something which also would help with credibility and adoption.
One other thing about this hypothetical product – it might well be in appliance format. (Or it might be SaaS, just as general Web search currently is. Indeed, in some ways that’s a more likely possibility.) Here are some thoughts on both sides of the appliance issue:
1. There’s a clear technology trend at the high-end of the relational data warehousing world: Massively multi-parallel “shared-nothing” systems are winning over the symmetric multi-processing “shared-everything” systems that dominate the OLTP RDBMS world. (I’ve written about that at length over on DBMS2, e.g. in this post.)
Text search for large corpuses also seems to work best on MPP shared-nothing systems. This is famously the Google architecture, and Inktomi’s (i.e., Yahoo’s) as well. It’s also the FAST approach.
Doing text in MPP makes theoretical sense too; just partition the documents by node, search each node, and aggregate the results. Indeed, text search is an even clearer fit for MPP than relational query, since you can ALWAYS use a local index, with none of those pesky relational joins. At least, that’s true of the querying part. In actual spidering, you do have the problem of shipping URLs from one node to another; the same might be true of link information and other search algorithm inputs.
2. The relational world gives very mixed signals as to how much these MPP/shared-nothing products benefit from specialized hardware. There’s certainly a consensus that customers at least want preconfigured hardware. But beyond that it gets confusing. Teradata has specialized networking. IBM is going to preconfigured generic hardware, and also mumbles about a specialized data filtering chip. Netezza has a very custom system based on FPGAs. DATallegro loves its custom Infiniband networking add-ons, but mumbles about going to standard hardware. Greenplum and Kognitio are on standard hardware, although Greenplum focuses on a quasi-appliance through Sun. (if my memory is correctly, Greenplum actually boasts at least one text-indexing customer, namely O’Reilly.)
Of that group, Kognitio (formerly White Cross) actually has software the most analogous to text indexing, in that they rely on compressed bitmaps for their data access, vs. the hashes and table scans of the other contenders. So their experience might be the most relevant here. Kognitio started out as a custom hardware vendor, but now runs on utterly standard blades. That said, they’re getting interest from their customers in having them somehow prepackage standard hardware, much as IBM is with its BCUs (Base Configuration Units).
And on the theoretical side: If you look at the specific problems solved by proprietary data warehousing hardware, such as speedy moving around of intermediate query results, they don’t seem to have strong analogues in the text search case.
3. But that’s just actual querying. There’s also the matter of spidering/indexing. And that has strong aspects of stream processing, data communications, maybe security, etc. – and those kinds of things do often call for appliances, and sometimes even special-purpose chips (more precisely, specially-programmed FPGAs).
So there you have it. The preponderance of evidence suggests appliances are the way to go, and also that the market would welcome an appliance package. But it’s hardly conclusive.
As for the related question of whether purely in-house enterprise search should be done on appliances – well, the same considerations apply. There’s good market receptivity to low-end search appliances and high-end data warehousing appliances, so there’s at least a plausibility argument that high-end enterprise search should be done on appliances as well. But it’s not an open-and-shut case.