November 11, 2006

Text mining and search, joined at the hip

Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean:

  1. Text mining powers search. The biggest text mining outfits in the world, possibly excepting the US intelligence community, are surely Google, Yahoo, and perhaps Microsoft.
  2. Search powers text mining. Restricting the corpus of documents to mine, even via a keyword search, makes tons of sense. That’s one of the good ideas in Attensity 4.
  3. Text mining and search are powered by the same underlying technologies. For starters, there’s all the tokenization, extraction, etc. that vendors in both areas license from Inxight and its competitors. Beyond that, I think there’s a future play in integrated taxonomy management that will rearrange the text analytics market landscape.

So who does “get it” about the search/text mining connection? The UIMA folks at IBM probably do. Inxight surely does. Attensity seemingly does, and so do most large search engine vendors (FAST and the public guys for sure; I’m not so certain about Autonomy and Convera). A small company whose CEO just called me yesterday does. I think I do.

But I’m not sure that the smaller text mining and search outfits – or the small text-oriented parts of large enterprise software vendors — have gotten the message at all yet …


3 Responses to “Text mining and search, joined at the hip”

  1. Patrick Herron on November 11th, 2006 3:52 pm

    You got me, that’s for sure.

    I understand that search and text mining are related and overlap. Part of what’s commonly called text mining–information extraction (IE)–powers some advanced features of search engines. Sure enough. But search engines take keywordese input, do some sort of finding operation, and return relevant documents. That’s the purpose of information retrieval. The most promising, interesting, and innovative applications of text mining are not information extraction engines but rather systems that do operations and return results not at the document level nor even at the extraction level but rather at the level of synthesis. Such applications take extracted elements and put them together in order to generate new information. They use inductive logic or some set of domain rules (taxonomic rules, if you like) to create information that did not previously exist. That’s not information-finding/search but information-generating. Some positive examples include Arrowsmith, the Robot Scientist, BioPubMiner, and pieces of Etzioni’s Machine Reading (MR) research. While many of technologies that power search are present in good text mining applications, text mining applications should do a whole lot more. I mean, look at how token-driven Google is. Search regurgitates. Text mining should in effect *learn*.

    Numerous credible sources consider IE applications to be text mining, or more oxymoronically, “knowledge discovery.” Heart’s widely-read essays, or Weiss et al.’s definitive text mining text, place IE under the text mining umbrella. But IE is really just search taken one more step, from returning documents to returning document elements. Yes search engines like Google already return both documents and extracted elements from them. But these search engines don’t appear to have semantics built into them.

    If anything, the relationship started very tightly and has separated over the years. Text mining will likely change names, as it already appears to be doing, thus making the process of separation somewhat more difficult to interpret.

  2. Paolo Cavone on March 16th, 2007 12:10 pm

    I’m agree with Patrick. Text mining should (or must!) in effect *learn* (and store semantic structures)

    …Yes search engines like Google already return both documents and extracted elements from them. But these search engines don’t appear to have semantics built into them.

    until the semantic structures are (too much) detected from the anchors of the backlink…

  3. Text Technologies»Blog Archive » The text mining vendors continue to lack constructive vision on December 18th, 2007 10:39 pm

    […] active in text search, except to some extent in the custom-publishing vertical, despite the huge reliance of search vendors on text mining technologies. They aren’t getting traction in the archiving/compliance area. There don’t seem to be […]

Leave a Reply

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.