And for my sixth text mining post this weekend, here are some highlights of the Clarabridge technology story. (Sorry if it sounds clipped, but I’m a bit burned out …)
- Like Attensity, Clarabridge practices exhaustive extraction.* That is, they do linguistics against documents, extract all sorts of entities and relationships among the entities from each document, and dump the results into a relational database.
- Unlike Attensity, which uses a simple normalized relational schema, Clarabridge dumps the extracted data into a star schema. (The Clarabridge folks are from Microstrategy, which – surely not coincidentally – also favors star schemas.)
- For now, the linguistic part of the analysis is within a sentence, or else based on proximity, or (this sounded minor) based on the whole document. But actual anaphora resolution is coming soon.
- The other big thing that goes into Clarabridge’s star schema is a category hierarchy, which has two aspects. One is categories fixed in advance. When I asked how many, CTO Justin Langseth cited an example range of 10-400. I.e., it varies widely. In principle, these are established by line-of-business folks at Clarabridge customers, but I’d venture to guess that professional services play a significant role as well.
- The other kind of categories – subcategories to the first group – are created automagically at data load time via document clustering. Indeed, they’re called “clusters.” These are available for drilldown via business intelligence tools.
- Obviously it is good practice to have dashboards and scheduled reports depend only on the fixed categories, not the clusters.
*I should note that Clarabridge understandably bristles a bit at my use of this Attensity-introduced term to describe what they do too. If Clarabridge wants to start talking about, say, “comprehensive extraction, I’ll consider adopting that term as well. But for now I’m going with what’s most widely used.
Want to continue getting great research about text mining, data warehouse appliances, and other hot analytics-related topics? Then subscribe to our comprehensive (if not exhaustive) feed, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.