Sentiment analysis – Text Technologies

Notes, links, and comments, October 24, 2010

Curt Monash — Sun, 24 Oct 2010 08:58:25 +0000

Time for a notes/links/comments post just for Text Technologies:

TechCrunch got sold, GigaOm raised money, and VentureBeat/MediaBeat provided a good starting link for both those stories and more. Since TechCrunch and GigaOm are/were both private, financial details are murky, but:
- TechCrunch is variously reported as having revenue in the $6-10 million range, probably mainly from events. (If you believe that they sell ~3000 total tickets at ~$2000 each to two annual versions of TechCrunch Disrupt, that makes sense.)
- GigaOm reports >10,000 subscribers to market research sevice (sort of) GigaOm Pro, at $199, apparently concentrated on the vendor side.
John Gruber straightforwardly posts both ad rates and circulation for his blog. It’s a simple $5000/week for readership that exceeds mine by >1 order of magnitude.
The New Yorker points out Gawker Media may not yet have crossed $20 million in revenue.
An “ASCAP for news” seems to finally be on the way.
Business Week/Bloomberg notices a trend that social-media/Voice of the Customer/Voice of the Market text analytics firms are getting acquired by bigger marketing-oriented firms. Seth Grimes, however, argues that the same trend is already passe’.
TechCrunch accused the Wall Street Journal of killing a story about sister company MySpace, then quickly running it after TechCrunch caught them.
LinkedIn has a really cool-looking tech blog. One recent post describes LinkedIn’s approach to socially-informed search. I read about it in a thoughtful post on Daniel Tunkelang’s blog.
Bill Simmons took 3843 words to explain the story of a two-word tweet — “moss Vikings.” Somewhere in there are a few interesting ruminations about media in the current age.
Some notes and links that actually belong here instead went up on DBMS 2 a few weeks ago.
About half of what I write about liberty and privacy is highly relevant to the subjects of this blog, including almost all of today’s post.

Maybe text mining SHOULD be playing a bigger role in data warehousing

Curt Monash — Fri, 24 Oct 2008 04:39:36 +0000

When I chatted last week with David Bean of Attensity, I commented to him on a paradox:

Many people think text information is important to analyze, but even so data warehouses don’t seem to wind up holding very much of it.

My working theory explaining this has two parts, both of which purport to show why text data generally doesn’t fit well into BI or data mining systems. One is that it’s just too messy and inconsistently organized. The other is that text corpuses generally don’t contain enough information.

Now, I know that these theories aren’t wholly true, for I know of counterexamples. E.g., while I’ve haven’t written it up yet, I did a call confirming that a recently published SPSS text/tabular integrated data mining story is quite real. Still, it has felt for a while as if truth lies in those directions.

Anyhow, David offered one useful number range:

If you do exhaustive extraction on a text corpus, you wind up with 10-20X as much tabular data as you had in text format in the first place. (Comparing total bytes to total bytes.)

So how big are those corpuses? I think most text mining installations usually have at least 10s of thousands of documents or verbatims to play with. Special cases aside, the upper bound seems to usually be about two orders of magnitude higher. And most text-mined documents probably tend to be short, as they commonly are just people’s reports on a single product/service experience – perhaps 1 KB or so, give or take a factor of 2-3? So we’re probably looking at 10 gigabytes of text at the low end, and a few terabytes at the high end, before applying David’s 10-20X multiplier.

Hmm – that IS enough data for respectable data warehousing …

Obviously, special cases like national intelligence or very broad-scale web surveys could run larger, as per the biggest Marklogic databases. Medline runs larger too.

Low-latency text mining in the investment market

Curt Monash — Fri, 19 Sep 2008 09:15:58 +0000

I’m not at Gartner’s Event Processing conference, but there seem to be some interesting posts and articles coming out of it. Seth Grimes has one on Reuters’ integration of text mining and event processing, including sentiment analysis. Well worth reading. Lots more detail than I’ve ever posted on similar applications.

Lexalytics has merged with part of Infonic

Curt Monash — Thu, 07 Aug 2008 19:59:01 +0000

As reported on the Lexalytics blog, sentiment analysis specialist Lexalytics has merged with the text analytics division of Infonic to form Lexalytics Limited. The deal seems to have a screwy financial structure — which Seth Grimes made a valiant effort to decipher (I think from vacation, poor guy) — as is common when companies much too small to be public wind up trading publicly anyway.

Related links

If you think sentiment analysis technology can detect idiom, I have a bridge I’d like to sell you

Curt Monash — Fri, 20 Jun 2008 11:40:52 +0000

Text mining tools are just WONDERFUL at detecting idiom, sarcasm, and figurative speech … Yeah, right. I asked Lexalytics CEO Jeff Catlin whether his tool could do that kind of thing, and he looked at me like I’d just grown a third ear.

Actually, he didn’t. But just like every other sentiment analysis vendor I encountered at the Text Analytics Summit or spoke to beforehand, he made it clear that his tool could only handle straightforward, literal expressions of opinion. Idiom, irony, sarcasm, metaphor, et al. are beyond the current reach of the technology.

Aren’t you just thrilled that I shared that earth-shattering news with you?

TEMIS tidbits

Curt Monash — Tue, 17 Jun 2008 05:27:59 +0000

The usual TEMIS execs didn’t make the trip to the Text Analytics Summit this year. But cofounder Alessandro Zanasi did come, and I chatted with him for a bit. Alessandro is also author of a recent book on text mining, and pretty much a one-man Italian operation for France-based TEMIS. Despite his nominal 100:1 manpower disadvantage vs. Italian national-champion text anayltics vendor Expert System S.p.A., Alessandro proudly rattled off four different Italian government accounts he’d won vs. Expert System, all of them apparently in the government area.

Beyond that, Alessandro denies all the rumors that have grown out of TEMIS being hard to reach recently. He reports that pharma is still TEMIS’s big market, but stresses that this covers a range of apps, from research to Voice of the Market. I do get the sense that TEMIS’s sentiment extraction capabilities are less sophisticated than some of the other vendors’ — but the other vendors I’m thinking of are pretty focused on English, SPSS aside. If you need sentiment analysis in non-English languages — e.g., French or Italian — TEMIS should definitely be on your vendor shortlist.

Intro to Lexalytics

Curt Monash — Tue, 17 Jun 2008 04:49:28 +0000

I chatted with Lexalytics CEO Jeff Catlin at the Text Analytics Summit today. Lexalytics is a 14 person company, which represents a doubling over last year. Jeff thinks Lexalytics is on track this year to double again.

Lexalytics’ main business is OEMing sentiment extraction, e.g. to the many blog-analysis/reputation-management (i.e., Voice of the Market) companies that recently started up and in some cases have been bought by big market analysis firms. Lexalytics can and sometimes does extract the more basic stuff as well, but sentiment analysis is the heart of its business. A partial customer list can be found on the Lexalytics site. Lexalytics extracts in the English language only.

One feature Lexalytics is proud of is that it doesn’t just assess sentiment from a phrase; it also gives a confidence (“evidence”) weighting. In such a fuzzy area as sentiment, I think that’s a good idea.

Lexalytics has a demo site, PoliticalTrends.info. The links on the left show some of the charts and reports they offer. But the bar charts in the middle inadvertently show the limitations of an approach that overweights some kinds of linguistic analysis at the expense of others. As I write this, the top 5 “Breaking themes in the last 3 days” are

last week
court decision
web site
nuclear program
front page

I think that particular part of the app might work better if a little more restriction were placed on what is or isn’t counted as a “theme.”

Voice of the Customer/Market is indeed where the action is

Curt Monash — Tue, 17 Jun 2008 04:03:04 +0000

I was at the Text Analytics Summit yesterday. After the sessions and theoretically* before the drinks, there was a group of subject- or industry-specific “roundtables.” The three best-attended roundtables by far — each with at least 20% of the total roundtable attendees — were on “Voice of the Market”, “Competitive Intelligence”, and “Sentiment Analysis”. (Yes, those are in practice pretty close to being the same thing.) Thus, over half of the show attendees who voted with their feet on a particular subject area of interest picked one in the customer/marketing area.

*In reality, the bar opened early, and I took a Sam Adams into the roundtable room.

Now, it’s possible this reflected a certain vendor bias. Most of the show’s attendees are either vendors or users whose attendance the vendors pay for, and many of the rest are prospects the vendors encourage to come. The show’s program is also heavily influenced by what the vendors think is important. Still, this is confirming evidence that the text mining industry’s center of gravity has shifted emphatically to the CRM area.

How much linguisitic sophistication is needed in Voice of the Customer/Market applications?

Curt Monash — Wed, 11 Jun 2008 11:54:15 +0000

According to Attensity CTO David Bean:

Voice of the Customer/Market applications require less linguistic sophistication than other text mining applications.
Hence, Voice of the Customer/Market apps are easier to get running than other text mining applications, which he conjectures is a big part of the reason for burgeoning sales.

I’m guessing most text mining vendors would agree with those views, although they might not agree with his elaborations, which include:

Attensity’s knowledge extraction technology is more sophisticated than Clarabridge’s or most other competitors’.
In particular, Clarabridge’s extraction is little more than bag-of-words.
There’s a good match between companies he thinks have less-sophisticated extraction (e.g., Clarabridge, SAS, SPSS) and companies whose text mining sales are heavily concentrated in Voice of the Customer/Market applications.

So the question arises: Just how much linguistic sophistication is needed in these market-trend-oriented text mining applications?

I actually got onto this subject not just because of what David said, but also via a conversation an hour earlier with Brooke Aker of Expert System, who proposed linguistic sophistication as a key reason for beating the competition (which, however, didn’t include Attensity or Clarabridge) at two accounts. The point Brooke was stressing is that it’s important to be able to extract multiple facts or indicators of sentiment from the same sentence. E.g., “I just had a crummy Chevy, but at least the seats were comfortable” is both a negative indicator about Chevrolet and a positive indicator about Chevrolet’s seats. Attensity captures both of those too, and I think Clarabridge would as well. (If you do comprehensive/ exhaustive extraction, you extract — well, you should extract comprehensively.)

Anyhow, my first-best answer to the question I posed is:

Sentiment analysis is hard, at least in venues where you have to deal with slang, metaphor, or irony (the real biggie). The more sophisticated, the better.
Otherwise, the linguistics of customer/marketing applications is pretty straightforward. Just put together the right list of wacky synonyms, and you’re good to go.

But what do you think?

David Bean of Attensity explains sentiment and other qualifiers

Curt Monash — Sat, 06 Oct 2007 00:36:38 +0000

David Bean of Attensity is rightly one of the most popular explainers of text mining, for his clarity and personality alike. I shot a question to him about how Attensity’s exhaustive extraction strategy handled sentiment and so on. He responded with an email that contains the best overall explanation of sentiment analysis in text mining I’ve seen anywhere. Naturally, this is rolled into an Attensity-specific worldview and sales pitch — but so what?

Our exhaustive extraction approach doesn’t compromise detection of qualifiers* because we recognize the qualifications while we have access to the complete linguistic information of the input. Much of that information is later stripped away, since it’s way more information than a user would want. We make sure we project qualifications like you mention in the final representations. In fact, we’ve put a lot of effort into recognizing “voicing,” i.e. distinguishing among negations, conditional statements, and variations in the degree of sentiment.

Examples will help here:

(1) I want to return the espresso machine. (intention to
return)

(2) I plan on returning the espresso machine. (intention to
return)

(3) I won’t return the espresso machine. (negation – not a
return)

(4) I returned no espresso machines. (negation – not a return)

(5) I failed to return the espresso machine. (negation – not a return)

(6) If you don’t return my phone call, I will return the espresso machine. (conditional – threat to return)

(7) I’ve returned espresso machines twice already. (recurrence – repeated returns)

(8) I tried to return the espresso machine. (attempt to return, negation – not a return)

(9) I failed to return the espresso machine. (failed attempt, negation – not a return)

(10) I refuse to return the espresso machine. (negation – not a return)

(11) I need to return the espresso machine now/asap. (urgency)

(12) I’m unhappy. (unhappy, duh)

(13) I’m really unhappy. (augmented unhappiness)

(14) The tires were over-inflated. (augmented inflation…works on non-sentiment qualities too)

(15) The breakfast was under-cooked. (diminished)

(16) The water in the shower this morning was way too cold. (augmented coldness)

(17) I will speak to the customer about returning the espresso machine. (indefinite – not a return, yet)

If we’re using our Fact Relationship Network style of extraction to look at these sentences, those voicing variations get represented on the mode* (typically), so you’d see output like:

return (intent)
return (not)
return (if/then)
return (again)
return (urgency)
happy (not)
happy (not, augmented)
cooked (diminished)
cold (augmented)

*Editor’s note: “Mode” means, in effect, “behavior or action.” It’s not a typo for “node.”

Post-extraction, any of these voicings can be used to roll up several FRN extractions into a collection that makes sense to the business, e.g. “water | cold (augmented)” and “water | hot (not).” What makes all that possible is that the core engine has access to a great deal of linguistic information before it turns the extraction into a specific type of representation like an FRN. Such linguistic information includes the notions of negating verbs (failed to ), double negatives, negative quantifiers that transfer their negation to the verb (no animals were harmed…), adverbial prepositional phrases (I returned the espresso machine in a fit of rage.) and so on. We think that’s a big deal – it lets us get a true count of, in these examples, product returns – not the returns of phone calls, or the threatened returns, the intentional returns, or the non-returns. We used this kind of distinctive power to show a retailer how they could identify customers who were threatening to return products, thereby detecting a set of product recalls that could be saved (before they ended up costing the retailer $$$).