Text mining and search, joined at the hip
Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean:
- Text mining powers search. The biggest text mining outfits in the world, possibly excepting the US intelligence community, are surely Google, Yahoo, and perhaps Microsoft.
- Search powers text mining. Restricting the corpus of documents to mine, even via a keyword search, makes tons of sense. That’s one of the good ideas in Attensity 4.
- Text mining and search are powered by the same underlying technologies. For starters, there’s all the tokenization, extraction, etc. that vendors in both areas license from Inxight and its competitors. Beyond that, I think there’s a future play in integrated taxonomy management that will rearrange the text analytics market landscape.
So who does “get it” about the search/text mining connection? The UIMA folks at IBM probably do. Inxight surely does. Attensity seemingly does, and so do most large search engine vendors (FAST and the public guys for sure; I’m not so certain about Autonomy and Convera). A small company whose CEO just called me yesterday does. I think I do.
But I’m not sure that the smaller text mining and search outfits – or the small text-oriented parts of large enterprise software vendors — have gotten the message at all yet …
Comments
3 Responses to “Text mining and search, joined at the hip”
Leave a Reply
You got me, that’s for sure.
I understand that search and text mining are related and overlap. Part of what’s commonly called text mining–information extraction (IE)–powers some advanced features of search engines. Sure enough. But search engines take keywordese input, do some sort of finding operation, and return relevant documents. That’s the purpose of information retrieval. The most promising, interesting, and innovative applications of text mining are not information extraction engines but rather systems that do operations and return results not at the document level nor even at the extraction level but rather at the level of synthesis. Such applications take extracted elements and put them together in order to generate new information. They use inductive logic or some set of domain rules (taxonomic rules, if you like) to create information that did not previously exist. That’s not information-finding/search but information-generating. Some positive examples include Arrowsmith, the Robot Scientist, BioPubMiner, and pieces of Etzioni’s Machine Reading (MR) research. While many of technologies that power search are present in good text mining applications, text mining applications should do a whole lot more. I mean, look at how token-driven Google is. Search regurgitates. Text mining should in effect *learn*.
Numerous credible sources consider IE applications to be text mining, or more oxymoronically, “knowledge discovery.” Heart’s widely-read essays, or Weiss et al.’s definitive text mining text, place IE under the text mining umbrella. But IE is really just search taken one more step, from returning documents to returning document elements. Yes search engines like Google already return both documents and extracted elements from them. But these search engines don’t appear to have semantics built into them.
If anything, the relationship started very tightly and has separated over the years. Text mining will likely change names, as it already appears to be doing, thus making the process of separation somewhat more difficult to interpret.
I’m agree with Patrick. Text mining should (or must!) in effect *learn* (and store semantic structures)
[quote]
…Yes search engines like Google already return both documents and extracted elements from them. But these search engines don’t appear to have semantics built into them.
[/quote]
until the semantic structures are (too much) detected from the anchors of the backlink…
[…] active in text search, except to some extent in the custom-publishing vertical, despite the huge reliance of search vendors on text mining technologies. They aren’t getting traction in the archiving/compliance area. There don’t seem to be […]