Two of the clearest and most charismatic speakers in the text mining business are Attensity cofounders Todd Wakefield and David Bean. Last year, Todd’s Text Mining Summit speech gave an excellent overview of the various application areas in which text mining was being adopted; vestiges of that material may be found in a blog post I made at the time, and on Attensity’s web site. This time, David’s Text Analytics Summit speech was basically a pitch for Attensity’s latest product release – and it was a pitch well worth hearing.
The basic story is that selective fact extraction from text is a knowledge-engineering-intensive process. You need to determine which facts to extract, and then determine how to extract those particular kinds of facts. So Attensity has a better idea; it will extract all facts, not just some, and dump them in a “fact relationship network” (FRN). The FRN is two relational tables, one for facts and one for relationships, suitable for copying to a Teradata machine. Attensity calls this “exhaustive extraction.”
To some extent, exhaustive extraction amounts to what in the math biz is called restating the problem.
- Old version: You need to determine which kinds of facts to get out of the documents, and what those facts might look like.
- New version: Same two challenges, but now vis-à-vis the FRN.
Still, this approach would seem to offer some nice advantages. Separating the initial extraction from later lexicography is pure goodness, for all the reasons that modularity is generally good. The same goes for separating the initial extraction from later decisions as to just what information it is you care about anyway. And generally, this approach should help in applications where somebody might say, in David’s phrase, “I don’t know what I’m looking for, but I’ll know it when I see it.”