I just stumbled across a brilliant summary of evolution in text search technology, written four years ago. It’s equally valid today (which in itself says something). I found it on the Prism Legal blog, but the actual author is Sharon Flank. My own comments are interspersed in bold.
“There are several underlying important developments over the last decade or so:
- Incorporating user feedback to refine search results, usually indirectly rather than explicitly, making results better through machine learning. [Amazon.com is the most-often cited example of this with it’s “if you like A, you’ll also like B.”] [CAM] Technically, that’s not a search example, but the general point is correct even so.
- Assessments based on usage or referral. This is what makes Google so useful and popular. This approach gives higher rankings if other web sites point to a target or if that target gets a lot of hits.
- Various approaches to using taxonomies. The better applications use taxonomies as a navigation guide but don’t force it or require administrators to implement it. Vivisimo.com is an example of interesting, automated clustering approach. [CAM] “Faceted search” seems to be the buzzword here. It’s a big part of what I call “structured search.” But taxonomy use is probably more trivial in search than it is in, say, text mining.
- Better handling of phrases. Google automatically parses phrases and deals with search terms as phrases. This now seems natural but in the AltaVista days, you couldn’t tell a Venetian blind from a blind Venetian [example courtesy of Prof. George Miller, Princeton Univ. – too good not to cite].
- Context-sensitive search is now an emerging trend. Systems track what users have previously searched for and infer interest in the same domain to refine search result. So if you look for “line” and a system knows you’ve just looked for “tacklebox,” then it infers you mean “fishing line.” Or if you search for bagels and the system knows you are in 20009, it tells you that you can buy them at Comet Liquors (which happens to sell bagels). [CAM] That happens a lot with ad serving. But I’m not convinced it hit actual search until Google’s personal search kicked off, and that was quite recent.
“More generally in natural language processing, the statistical and linguistic approaches are converging in a new way: use massive amounts of data (i.e. the Web) to get statistical answers to deep linguistic questions, like “How do we figure out what the most likely referent is for the pronoun ‘they’?” Or “How do we determine the correct sense for ambiguous words?” These things aren’t in search engines yet, but you can expect to see more “intelligent” features coming out of this approach.
“Looking at this list, you can see that the conceptual changes (breakthroughs?), with the exception of better phrase handling, are primarily focused around Web searches. When dealing with one-of-a-kind document collections behind the corporate firewall, many of these developments turn out not to add much to older approaches. So, at least for enterprise search, I too remain partial to some of the older products you mention, though I am disappointed that most of the old-time vendors have not updated their approaches beyond adding taxonomy support.” [CAM] Yep, web search and enterprise search are very different things.
The original blog post did have one error — Sharon’s PhD isn’t in Computational Linguistics, but rather Slavic Linguistics, as I recently noted in my post about text analytics careers for humanities majors.