Comprehensive or exhaustive extraction – Text Technologies

Maybe text mining SHOULD be playing a bigger role in data warehousing

Curt Monash — Fri, 24 Oct 2008 04:39:36 +0000

When I chatted last week with David Bean of Attensity, I commented to him on a paradox:

Many people think text information is important to analyze, but even so data warehouses don’t seem to wind up holding very much of it.

My working theory explaining this has two parts, both of which purport to show why text data generally doesn’t fit well into BI or data mining systems. One is that it’s just too messy and inconsistently organized. The other is that text corpuses generally don’t contain enough information.

Now, I know that these theories aren’t wholly true, for I know of counterexamples. E.g., while I’ve haven’t written it up yet, I did a call confirming that a recently published SPSS text/tabular integrated data mining story is quite real. Still, it has felt for a while as if truth lies in those directions.

Anyhow, David offered one useful number range:

If you do exhaustive extraction on a text corpus, you wind up with 10-20X as much tabular data as you had in text format in the first place. (Comparing total bytes to total bytes.)

So how big are those corpuses? I think most text mining installations usually have at least 10s of thousands of documents or verbatims to play with. Special cases aside, the upper bound seems to usually be about two orders of magnitude higher. And most text-mined documents probably tend to be short, as they commonly are just people’s reports on a single product/service experience – perhaps 1 KB or so, give or take a factor of 2-3? So we’re probably looking at 10 gigabytes of text at the low end, and a few terabytes at the high end, before applying David’s 10-20X multiplier.

Hmm – that IS enough data for respectable data warehousing …

Obviously, special cases like national intelligence or very broad-scale web surveys could run larger, as per the biggest Marklogic databases. Medline runs larger too.

5 ideas for how to pick between Attensity and Clarabridge

Curt Monash — Tue, 10 Jun 2008 23:43:51 +0000

Jim D. of UPS asked in the comment thread to the recent Attensity update post how one should decide between Attensity and Clarabridge. I wrote an answer, and then decided to just split it out in a separate post. Here are five ideas about how to pick between Attensity and Clarabridge for the kind of Voice of the Customer/Market application both companies are focusing on.

1. Attensity is the older company than Clarabridge, and is good at more things. Is Clarabridge really good at everything you want them to be?

2. In particular, Attensity has more overall sophistication at linguistic extraction. Do any of the differences matter to you?

3. Both companies are working hard on ease of use, for multiple kinds of user (business user tweaking linguistic rules, IT user, etc.). Whose approach and feature set do you like better?

4. Usually, buying one of these products involves some professional services. Whose organization do you like better?

5. Attensity’s default database schema for its exhaustive extraction is pretty flat and normalized, as befits a happy Teradata partner. Clarabridge’s is more of a star schema, as befits a bunch of ex-Microstrategy guys. Either can be straightforwardly translated into the other, so you may not care — but do you?

Clarabridge does SaaS, sees Inxight

Curt Monash — Wed, 14 Nov 2007 18:11:28 +0000

I just had a quick chat with text mining vendor Clarabridge’s CEO Sid Banerjee. Naturally, I asked the standard “So who are you seeing in the marketplace the most?” question. Attensity is unsurprisingly #1. What’s new, however, is that Inxight – heretofore not a text mining presence vs. commercially-focused Clarabridge – has begun to show up a bit this quarter, via the Business Objects sales force. Sid was of course dismissive of their current level of technological readiness and integration – but at least BOBJ/Inxight is showing up now.

The most interesting point was text mining SaaS (Software as a Service). When Clarabridge first put out its “We offer SaaS now!” announcement, I yawned. But Sid tells me that about half of Clarabridge’s deals now are actually SaaS. The way the SaaS technology works is pretty simple. The customer gathers together text into a staging database – typically daily or weekly – and it gets sucked into a Clarabridge-managed Clarabridge installation in some high-end SaaS data center. If there’s a desire to join the results of the text analysis with some tabular data from the client’s data warehouse, the needed columns get sent over as well. And then Clarabridge does its thing.

It has always been the case that business intelligence was an IT systems software technology that often wound up being sold on an application basis to end-user departments. Clarabridge very much fits that model. And while it used to be the case that BI adoption was pretty simple, that’s increasingly not the case, which is one reason SaaS is appealing. So this all makes a lot of sense.

Even so, I was surprised to hear that SaaS had so quickly become half of Clarabridge’s business. Wow.

Since Clarabridge touts Cognos as an important partner, and Cognos is being bought by IBM, I also asked Sid about UIMA. He basically responded that UIMA was unlikely to become relevant to Clarabridge any time soon, because the way Clarabridge interfaces with other software is SQL. Up to a point, that makes great sense to me. But if we buy into the comprehensive/exhaustive extraction story — as Clarabridge does — then the day should and will come when serious linguistic processing gets done on text after it is extracted into a relational database. And if that happens, then all of a sudden SQL won’t be the only interface integrating text analytics with BI.

The Clarabridge approach to text mining

Curt Monash — Sun, 07 Oct 2007 00:14:23 +0000

And for my sixth text mining post this weekend, here are some highlights of the Clarabridge technology story. (Sorry if it sounds clipped, but I’m a bit burned out …)

Like Attensity, Clarabridge practices exhaustive extraction.* That is, they do linguistics against documents, extract all sorts of entities and relationships among the entities from each document, and dump the results into a relational database.
Unlike Attensity, which uses a simple normalized relational schema, Clarabridge dumps the extracted data into a star schema. (The Clarabridge folks are from Microstrategy, which – surely not coincidentally – also favors star schemas.)
For now, the linguistic part of the analysis is within a sentence, or else based on proximity, or (this sounded minor) based on the whole document. But actual anaphora resolution is coming soon.
The other big thing that goes into Clarabridge’s star schema is a category hierarchy, which has two aspects. One is categories fixed in advance. When I asked how many, CTO Justin Langseth cited an example range of 10-400. I.e., it varies widely. In principle, these are established by line-of-business folks at Clarabridge customers, but I’d venture to guess that professional services play a significant role as well.
The other kind of categories – subcategories to the first group – are created automagically at data load time via document clustering. Indeed, they’re called “clusters.” These are available for drilldown via business intelligence tools.
Obviously it is good practice to have dashboards and scheduled reports depend only on the fixed categories, not the clusters.

*I should note that Clarabridge understandably bristles a bit at my use of this Attensity-introduced term to describe what they do too. If Clarabridge wants to start talking about, say, “comprehensive extraction, I’ll consider adopting that term as well. But for now I’m going with what’s most widely used.

Want to continue getting great research about text mining, data warehouse appliances, and other hot analytics-related topics? Then subscribe to our comprehensive (if not exhaustive) feed, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.

Technorati Tags: Clarabridge, text mining, exhaustive extraction

When to use exhaustive extraction

Curt Monash — Sat, 06 Oct 2007 00:54:52 +0000

I’ve been emailing and/or talking with both Clarabridge and Attensity this week. Since they’re the two big proponents of exhaustive extraction, I naturally asked whether there are any cases exhaustive extraction should not be used. In Clarabridge’s case, it turns out exhaustive extraction is the default, and no customer has ever turned this default off. However, their current high end is several million documents* per year. They suspect that in some current projects with much higher volumes the default may finally be turned off.

*Actually, the word Clarabridge CTO Justin Langseth used was “verbatim.” But that’s essentially a synonym for document, only with the connotation that these documents will probably be people’s statements (think warranty cards, customer surveys, email, call center notes, etc.), with all that implies for their grammar, structure (or lack thereof), and so on.

I didn’t push Attensity for an answer that clear. What they said was simply that all their capabilities were integrated together, so everybody uses exhaustive extraction. I imagine they’d say something similar, but it seems I should follow up a little bit further …

David Bean of Attensity explains sentiment and other qualifiers

Curt Monash — Sat, 06 Oct 2007 00:36:38 +0000

David Bean of Attensity is rightly one of the most popular explainers of text mining, for his clarity and personality alike. I shot a question to him about how Attensity’s exhaustive extraction strategy handled sentiment and so on. He responded with an email that contains the best overall explanation of sentiment analysis in text mining I’ve seen anywhere. Naturally, this is rolled into an Attensity-specific worldview and sales pitch — but so what?

Our exhaustive extraction approach doesn’t compromise detection of qualifiers* because we recognize the qualifications while we have access to the complete linguistic information of the input. Much of that information is later stripped away, since it’s way more information than a user would want. We make sure we project qualifications like you mention in the final representations. In fact, we’ve put a lot of effort into recognizing “voicing,” i.e. distinguishing among negations, conditional statements, and variations in the degree of sentiment.

Examples will help here:

(1) I want to return the espresso machine. (intention to
return)

(2) I plan on returning the espresso machine. (intention to
return)

(3) I won’t return the espresso machine. (negation – not a
return)

(4) I returned no espresso machines. (negation – not a return)

(5) I failed to return the espresso machine. (negation – not a return)

(6) If you don’t return my phone call, I will return the espresso machine. (conditional – threat to return)

(7) I’ve returned espresso machines twice already. (recurrence – repeated returns)

(8) I tried to return the espresso machine. (attempt to return, negation – not a return)

(9) I failed to return the espresso machine. (failed attempt, negation – not a return)

(10) I refuse to return the espresso machine. (negation – not a return)

(11) I need to return the espresso machine now/asap. (urgency)

(12) I’m unhappy. (unhappy, duh)

(13) I’m really unhappy. (augmented unhappiness)

(14) The tires were over-inflated. (augmented inflation…works on non-sentiment qualities too)

(15) The breakfast was under-cooked. (diminished)

(16) The water in the shower this morning was way too cold. (augmented coldness)

(17) I will speak to the customer about returning the espresso machine. (indefinite – not a return, yet)

If we’re using our Fact Relationship Network style of extraction to look at these sentences, those voicing variations get represented on the mode* (typically), so you’d see output like:

return (intent)
return (not)
return (if/then)
return (again)
return (urgency)
happy (not)
happy (not, augmented)
cooked (diminished)
cold (augmented)

*Editor’s note: “Mode” means, in effect, “behavior or action.” It’s not a typo for “node.”

Post-extraction, any of these voicings can be used to roll up several FRN extractions into a collection that makes sense to the business, e.g. “water | cold (augmented)” and “water | hot (not).” What makes all that possible is that the core engine has access to a great deal of linguistic information before it turns the extraction into a specific type of representation like an FRN. Such linguistic information includes the notions of negating verbs (failed to ), double negatives, negative quantifiers that transfer their negation to the verb (no animals were harmed…), adverbial prepositional phrases (I returned the espresso machine in a fit of rage.) and so on. We think that’s a big deal – it lets us get a true count of, in these examples, product returns – not the returns of phone calls, or the threatened returns, the intentional returns, or the non-returns. We used this kind of distinctive power to show a retailer how they could identify customers who were threatening to return products, thereby detecting a set of product recalls that could be saved (before they ended up costing the retailer $$$).

Clarabridge takes on Attensity

Curt Monash — Tue, 27 Mar 2007 00:36:38 +0000

Text mining newbie Clarabridge gave me the all-too-customary “Please let us brief you, but then don’t write about it for a while” routine. Now that it’s OK to post, what I’m up for offering is a few salient points in bullet form.

The closest analogy to what Clarabridge does is Attensity’s new(ish) strategy – extract “facts” from documents and dump them into a relational database management system. In particular, Clarabridge and Attensity alike make the case “Our categorization is more flexible because it’s applied only after the extraction happens.”
Clarabridge’s sweet spot is extracting user opinions from short documents. E.g., the customer uses cases they talk about are customer feedback forms, public blog postings, etc. about A. hotels and B. consumer software products.
Clarabridge has a strong business intelligence mentality, describing the product as “ETL for unstructured data.” But then, it’s spun out of a BI consultancy that itself was founded by Microstrategy veterans.
Clarabridge uses a different database schema than Attensity. Attensity’s fact-relationship network (FRN) is basically just two thin, long tables. Clarabridge, however, uses a Microstrategy-like star schema, in which different kinds of things that you can tokenize correspond to different dimensions.

Frankly, if somebody wants an alternative to the Attensity/Teradata/Business Objects partnership they could do worse than talk with Clarabridge.

Attensity, extractive exhaustion, and the FRN

Curt Monash — Sun, 25 Jun 2006 02:40:27 +0000

Two of the clearest and most charismatic speakers in the text mining business are Attensity cofounders Todd Wakefield and David Bean. Last year, Todd’s Text Mining Summit speech gave an excellent overview of the various application areas in which text mining was being adopted; vestiges of that material may be found in a blog post I made at the time, and on Attensity’s web site. This time, David’s Text Analytics Summit speech was basically a pitch for Attensity’s latest product release – and it was a pitch well worth hearing.

The basic story is that selective fact extraction from text is a knowledge-engineering-intensive process. You need to determine which facts to extract, and then determine how to extract those particular kinds of facts. So Attensity has a better idea; it will extract all facts, not just some, and dump them in a “fact relationship network” (FRN). The FRN is two relational tables, one for facts and one for relationships, suitable for copying to a Teradata machine. Attensity calls this “exhaustive extraction.”

To some extent, exhaustive extraction amounts to what in the math biz is called restating the problem.

Old version: You need to determine which kinds of facts to get out of the documents, and what those facts might look like.
New version: Same two challenges, but now vis-à-vis the FRN.

Still, this approach would seem to offer some nice advantages. Separating the initial extraction from later lexicography is pure goodness, for all the reasons that modularity is generally good. The same goes for separating the initial extraction from later decisions as to just what information it is you care about anyway. And generally, this approach should help in applications where somebody might say, in David’s phrase, “I don’t know what I’m looking for, but I’ll know it when I see it.”