I had a fascinating talk with Jay Henderson of ClearForest Friday. While I have more research to do before I know what I really think, there already is plenty to post about.
ClearForest is one of the two companies whose name comes up for fact extraction applications, probably even a little ahead of Attensity. Their flagship account is the GM deal they did with IBM, kicking off the whole warranty report mining boom. Procter & Gamble is no slouch of a customer either. They’re involved enough in anti-terrorism that, when I asked Jay if he knew who Cogito was, he said “Of course.” And apparently one of their techie founders is the guy who coined the term “text mining” in the first place.
In short, ClearForest specializes in highly accurate fact extraction, principally as an input to text mining. But if I understood correctly, their top market right now isn’t text mining at all; rather, it’s custom publishing, often in partnership with Marklogic*. That makes sense; after all, an extracted “fact” is most valuable if you have access to the context in which it was originally asserted.
* This seems to provide validation for all the stuff Dave Kellogg, CEO of Marklogic, has been saying.
So how do they do this? It sounds as if it’s the same way we tried to do it at my failed startup Elucidate – one clever rule at a time. (Well, actually, it’s both rules and lexicons/lists, combined as needed; doing rules without synonyms makes no sense at all.) Only they seem to have been smarter about it somehow. OO and inheritance play a big role (modularity is good!). They also ship the knowledge to extract a bunch of different objects – around 200 in all — ready “out of the box.” (Definitely needed.) This is the kind of thing that Attensity is trying to obviate with “exhaustive extraction,” but as I said before, that in many ways merely restates the problem, rather than actually solving it.
Continuing on their modularity kick, they’re proud of their SOA. Indeed, they’re proud of their scalability and general IT-friendliness. The one place modularity seems to fail a bit is in their analytic tools, which are not as embeddable in portal-based dashboards as one might ideally like.