Eric Lai wrote in this week’s Computerworld about “Why is enterprise search harder than Google Web search?” Highlights included:
- He described enterprise search as consisting mainly of a search box plus faceted searching, with maybe some automated tagging as well.
- He observed that off-page factors such as PageRank don’t work nearly as well in an enterprise as they do on the Web, and that manual tagging by enterprise users falls far short of closing the gap.
- He stumbled a bit compare/constrasting search engines and “structured” DBMS.
- He basically endorsed the worldview of Ali Riaz, late of FAST, now of Attivio.
On the whole, that’s not bad. If this were an easy subject to write about, I’d have explained it a lot more clearly in the past myself. OK. Let me get off my duff and give it a whirl now.
Actually, when writing, I generally stay on my duff. At least, that’s true if I’m guessing correctly what a “duff” is. And this is not just a vaguely humorous digression — it’s also an example of why information retrieval is so hard if you only have the text itself to go by.
With that said, here are some notes on web search, enterprise search, single-site search, and database management.
- Web search has a huge problem that enterprise search doesn’t — adversarial information retrieval. Enterprise document creators generally don’t try to game search results.
- Single-site search sometimes works very well, and sometimes works laughably badly. (I often use regular Google after a site-search engine offers up a long list of bug reports and minor upgrade notes, instead of product overviews.) For an egregious example see Oracle.com, where search seems to have gotten even worse than it was a couple of years ago.
- The difference in almost every case, I think, is whether or not the site owner has done a good job of manual tagging. That would explain why changing a choice of search engine can make a site worse, if you don’t rebuild the tags; I suspect this is what happened in the Oracle case, after some acquisition.
- Full structured search technology can be the difference between “pretty good search” and an e-commerce site that really rocks, but it doesn’t work at all without a lot of manual tagging.
- When you have a revenue-generating e-commerce site, it’s easy to justify the work of manual tagging. When you have a marketing site, it can make sense. But inside an enterprise, the tagging isn’t going to happen much.
- Documents in an enterprise lie in a broad range of disparate corpuses. There are spreadsheets, PowerPoint presentations, structured-field database entries, free-text-field database entries, email, instant messages, and so on. And there also are many different corpuses of traditional text documents. A large-enterprise search engine needs tools to dial up or dial down the relevance of different corpuses, in general and to a specific search.
- Each corpus may also have its own kinds of metadata that are helpful in ranking and summarizing search results.
- As Guy Creese already pointed out in a comment to Eric’s article, security comes into play in enterprise search. I think of that as different users seeing different corpuses.
- Link analysis doesn’t work inside enterprises. Indeed, most of the documents aren’t in HTML and don’t link to each other.
- In web and enterprise search alike, you’re often satisfied by a page that doesn’t help you directly, if it points you at a better resource. But the details of the two cases differ. In enterprises, the better resource is usually a person — i.e., a document author — that you can contact directly. (This is pretty much the main part of knowledge management that actually works.)
- I wrote two years ago about a key missing ingredient to enterprise search technology — an ontology management system. Unfortunately, I’m not aware of significant progress in that direction subsequently.
Finally, I’d like to lay out a few points about the integration of text search and database management.
- Text search became an option in object/relational database management systems a little over a decade ago. The basic paradigm is that a document is stored in a single field, and a full-text index on it is integrated into the RDBMS. SQL syntax was extended to include text operators. Oracle, IBM, and Informix all did this cleanly; Microsoft found a workaround. Although all these vendors were very disappointing in the quality and performance of their search, these engines get a lot of use in application areas where it’s obviously beneficial to integrate search and relational queries.
- Search engines — both standalone and DBMS-integrated — have in recent years gained the capability to query any kind of database field. Why not? They typically can handle 1-200+ other document types as well. Anyhow, business intelligence based on that capability is part of the FAST and now Attivio stories.
- The fundamental technology of full-text indexing is similar to a number of things that go on in the relational/tabular/structured worlds, such as ADABAS’s inverted lists, or anything in the bitmapped and columnar areas. As just one example, SAP’s columnar BI Accelerator is directly built on TREX technology.
- Mark Logic makes a good case that if you’re going to integrate search and DBMS, XML may be the way to go rather than relational.