Attensity – Text Technologies

The state of the art in text analytics applications

Curt Monash — Thu, 02 Dec 2010 02:06:54 +0000

Text analytics application areas typically fall into one or more of three broad, often overlapping domains:

Understanding the opinions of customers, prospects, or other groups. This can be based on any combination of documents the user organization controls (email, surveys, warranty reports, call center logs, etc.) — in which case — or public-domain documents such as blogs, forum posts, and tweets. The former is usually called Voice of the Customer (VotC), while the latter is Voice of the Market (VotM).
Detecting and identifying problems. This can happen across many domains — VotC, VotM, diagnosing equipment malfunctions, identifying bad guys (from terrorists to fraudsters), or even getting early warnings of infectious disease outbreaks.
Aiding text search, custom publishing, and other electronic document-shuffling use cases, often via document augmentation.

For several years, I’ve been distressed at the lack of progress in text analytics or, as it used to be called, text mining. Yes, the rise of sentiment analysis has been impressive, and higher volumes of text data are being processed than were before. But otherwise, there’s been a lot of the same old, same old. Most actual deployed applications of text analytics or text mining go something like this:

A bunch of documents are analyzed to ascertain the ideas expressed in them.
A count is made as to how many times each idea turns up.
The application user notices any surprisingly large numbers, and as result of noticing pays attention to the corresponding ideas.

Often, it seems desirable to integrate text analytics with business intelligence and/or predictive analytics tools that operate on tabular data is. Even so, such integration is most commonly weak or nonexistent. Apart from the usual reasons for silos of automation, I blame this lack on a mismatch in precision, among other reasons. A 500% increase in mentions of a subject could be simple coincidence, or the result of a single identifiable press article. In comparison, a 5% increase in a conventional business metric might be much more important.

But in fairness, the text analytics innovation picture hasn’t been quite as bleak as what I’ve been painting so far. While standalone, passively-reported text analytics is indeed the baseline, there are some interesting exceptions. For example:

I once confirmed that SPSS customer Cablecom‘s statistical models for churn and the like absolutely included text data; Cablecom even assigned different weights to the same apparent level of emotion depending on whether the text was in German, French, or Italian. Vertica recently told me of a Vertica/Hadoop customer doing something similar, except for the multilingual aspect. And the end of a 2008 SAS-based paper makes similar claims.
There long* have been some examples of fact extraction that don’t really fit into my three buckets above. For example, researchers mine collections of articles to try to determine biochemical or biological pathways that would not be apparent from examining single research studies alone.
It also has long* been the case that some bad-guy-finding applications — especially in the anti-terrorism area — used text analytics to populate state-of-the-art graph-oriented data analysis tools.

*When it comes to text analytics, “long” means “at least for the past several years.”

In more recent examples:

Greenplum built a document recommender for law firms that does hard-core statistical analysis to determine which .1% of a document set lawyers might actually want to see, and which then learns from users’ feedback after they respond to initial result sets.
Information extracted from investment news gets included into automated trading algorithms. This was unusual technology a couple of years ago, but is more common today.
After a series of mergers, Attensity now uses marketing-oriented text analytics in at least three different ways:
- Attensity text analytics feeds marketing dashboards just as it always did.
- Attensity text analytics triggers alerts, as I wish dashboards and business intelligence tools more often did, the false positives problem notwithstanding.
- Attensity text analytics triggers concrete workflows, for example routing specific social media hits for priority response.
- And in one example that did not actually get into production, a very large social networking company correlated word usage (e.g., choice among different synonyms) against user characteristics such as age and gender.

Finally there are some applications that, while fitting the standard template, just strike me as getting to unusually sophisticated levels of analysis. For example, Vertica told me of another Vertica/Hadoop case where VotM document analysis is carried out to the level of observing which order brand names appear in, and adjusting that for whether or not it was just an alphabetical list.

I suspect text analytics is about to become more interesting again.

Related links

The enabling technology for text/tabular data integration has existed for years.
In 2006, I listed major application areas for data mining/predictive analytics. It overlaps pretty closely with the similar list for text mining/text analytics.
Before being acquired by IBM, SPSS boasted a rather large text mining user base.

The new Attensity — deal overview

Curt Monash — Mon, 20 Apr 2009 07:14:09 +0000

A new Attensity Group has been created in a complex set of maneuvers. So far as I understand or guess, elements of the deal include:

The Attensity Group is being formed by the merger of three companies: Attensity, empolis, and Living-e. Frankly, I’d never heard of either empolis or Living-e until this merger. (In case you ever have to resort to the Wayback Machine, embolis’ URL was http://www.empolis.com/home.html and Living-e’s was http://www.living-e.com/us/index.php)
Existing investors (employees aside) have largely been bought out. Most of the stock is owned by Aeris, an investment vehicle for SAP co-founder Klaus Tschira. Living-e already was a Tschira investment.
Inxight managers have been brought in to run the whole thing. Specifically, Ian Bonner will be CEO, and Ian Hersey will be EVP of Products and Technology.
The former CEOs of Attensity and empolis will run the Americas and EMEA regions, under the Attensity and empolis names respectively, apparently with their prior sales organizations more or less intact.
A former CEO of Living-e will be their boss, but also run “Special Projects”, which adds up to a very odd title indeed: “Senior Vice President of Operations and Strategic Projects, Attensity Group”
The former CTOs of Attensity and empolis are CTOs of system software (“Natural Language Processing”) and application software respectively. This gets Attensity’s total CTO count up to 3, a level I’ve previously seen only at Teradata. I haven’t talked with David Bean yet, but his colleagues insist that he’s excited about his new role.
This whole deal has been underway since at least late last year. For example, Ian Bonner has been involved for that long. empolis and Living-e announced the pooling of their sales forces back in February.
Technically, the merger isn’t complete, as Living-e is a public company and all 100% of its shares haven’t been acquired yet. (But they will be Real Soon Now.)
Attensity, of course, was a venture-backed private company, with tired investors. empolis was owned by Bertelsmann, and was itself a roll-up of several smaller text analytics companies.

I was told on the phone empolis was doing something like €30-40 million. Attensity and Living-e were under $10 million each. That surprises me a bit, as I thought Attensity was in that range on commercial business alone, and was doing more than $10 million counting its government accounts.

It turns out that if I had been paying attention to the news filters I could have seen this coming. Specifically, a March 16, 2009 story said:

German media giant Bertelsmann has confirmed the sale of its software development unit empolis to data management holding Attensity empolis Europe.

Attensity empolis Europe is based in Switzerland and part of the holding company founded by former IBM manager Ian Bonner.

A Bertelsmann spokesperson told Handelsblatt that empolis, which develops software for semantic analysis, did not fit in strategically anymore. empolis was part of Bertelsmann subsidiary Arvato employing 200 people with revenues of around €30m.

Attensity’s acquisition of empolis adds to the recent takeover of Living-e which was acquired in December 2008 for a symbolic price of one euro after its former majority shareholder Klaus Tschira, one of the SAP founders, was not willing in invest more money.

Tschira, however, is still intent on investing in Attensity empolis which was part of the agreement on Living-e. His portfolio includes holdings in 26 companies with combined revenues of €200m.

I don’t immediately know how to reconcile the apparent contradictions between that and the information above.

I plan to post with technology/business thoughts when I have a chance.

Enterprise IT experts on Twitter

Curt Monash — Sat, 03 Jan 2009 03:12:49 +0000

It was my birthday yesterday (New Year’s Day), and I remarked on Twitter that I seemed to be getting more automated greetings from message boards and the like than I was getting from real people.* Naturally, a number of folks set out to redress the imbalance :), specifically J A di Paolantonio, Rob Paller, Neil Raden, Claudia Imhoff, Gareth Horton, Donald Farmer, IdaRose Sylvester, and Seth Grimes.

*In retrospect that was a silly comment, made soon after midnight while humans were generally either partying or asleep. But it’s the set-up for the rest of this post.

Sheer self-indulgence aside — “Happy Birthday To Me!!” — I see something blogworthy in that. Indeed, it reflects the emergence over the past 6 months or so of one particular Twitter community. Takeaways include:

1. The responders weren’t a randomly selected subset from among those of my 1304 Twitter followers online when I tweeted. Every person who responded is an industry analyst, a BI expert, or both.

Yes Virginia, there are some enterprise IT folks on Twitter.

2. Members of the community seem to follow each other’s tweetstreams in their entirety. Many of their tweets are in direct reply to or otherwise inspired by each other. Indeed, based on the timing, I suspect a lot more folks were inspired by Neil Raden’s message to me than by my original post.

3. Unlike me, these other folks seem to keep their followee lists small enough to engage with. 100ish numbers of people followed is not uncommon. By way of contrast, I follow 1682 people, which means that despite considerable care about who I follow, I wind up almost never actually checking what the tweetstream contains. (Instead, I usually just tweet something and react to the @replies.)

I no doubt like the charming Claudia Imhoff at least as well as she likes me. Even so, if there were a group of tweets about her birthday, I might well miss it — especially at first — just because I follow too many people to keep up. More on that point in another post (coming soon).

4. Twitter is really just another venue for the evolution of an already-extant community. The independent BI analysts tend to travel as a pack anyway, to venues such as TDWI and Teradata Partners conferences, or to local gettogethers they hold in Colorado.

5. But Twitter does help that community evolve. I’ve really been brought into the club via Twitter. For example, the conversations that led to my teaching at the next TDWI Conference grew out of an email from Wayne Eckerson to the effect “Hi. I follow you on Twitter, and generally read your stuff. Can you help with a particular hardcore DBMS technology question I’ve run into?”

6. Twitter connections are useful. Twitter has made it easier for me to have offline conversations with Claudia, Wayne et al. My user-focused consulting services will be much richer for that.

Six months ago I felt that Twitter was dominated by the “new-age” tech folks — search engine optimizers, podcasters, social media consultants, Web 2.0 gurus and the like. But in one particular enterprise area — business intelligence — traditional IT folks are active as well. Perhaps similar ones will emerge in other areas of IT too.

Maybe text mining SHOULD be playing a bigger role in data warehousing

Curt Monash — Fri, 24 Oct 2008 04:39:36 +0000

When I chatted last week with David Bean of Attensity, I commented to him on a paradox:

Many people think text information is important to analyze, but even so data warehouses don’t seem to wind up holding very much of it.

My working theory explaining this has two parts, both of which purport to show why text data generally doesn’t fit well into BI or data mining systems. One is that it’s just too messy and inconsistently organized. The other is that text corpuses generally don’t contain enough information.

Now, I know that these theories aren’t wholly true, for I know of counterexamples. E.g., while I’ve haven’t written it up yet, I did a call confirming that a recently published SPSS text/tabular integrated data mining story is quite real. Still, it has felt for a while as if truth lies in those directions.

Anyhow, David offered one useful number range:

If you do exhaustive extraction on a text corpus, you wind up with 10-20X as much tabular data as you had in text format in the first place. (Comparing total bytes to total bytes.)

So how big are those corpuses? I think most text mining installations usually have at least 10s of thousands of documents or verbatims to play with. Special cases aside, the upper bound seems to usually be about two orders of magnitude higher. And most text-mined documents probably tend to be short, as they commonly are just people’s reports on a single product/service experience – perhaps 1 KB or so, give or take a factor of 2-3? So we’re probably looking at 10 gigabytes of text at the low end, and a few terabytes at the high end, before applying David’s 10-20X multiplier.

Hmm – that IS enough data for respectable data warehousing …

Obviously, special cases like national intelligence or very broad-scale web surveys could run larger, as per the biggest Marklogic databases. Medline runs larger too.

Attensity update

Curt Monash — Fri, 24 Oct 2008 04:29:24 +0000

I had a brief chat with the Attensity guys at their Teradata Partners Conference booth – mainly CTO David Bean, although he did buck one question to sales chief Jeff Johnson. The business trends story remained the same as it was in June: The sweet spot for new sales remains Voice of the Customer/Voice of the Market, while on-premise/SaaS new-name accounts are split around 50-50 (by number, not revenue).

David’s thoughts as to why the SaaS share isn’t even higher – as it seems to be for Clarabridge* – centered on the point that some customers want to blend internal and external data, and may not want to ship the internal part out to a SaaS provider. Besides, if it’s tabular data, I suspect Attensity isn’t the right place to ship it anyway.

*Speaking of Clarabridge, CEO Sid Banerjee recently posted a thoughtful company update in this comment thread.

When I challenged him on ease of use, David said that Attensity is readying a Microstrategy-based offering, which is obviously meant to compete with Clarabridge and any of its perceived advantages head-on.

The layered messaging marketing model as applied to Attensity

Curt Monash — Mon, 08 Sep 2008 06:52:15 +0000

My general layered messaging theory survived its first test against an IT vendor example – Netezza. Let’s try another, in this case a company that’s not a Monash Research client.

Attensity is a text mining vendor with a lot of cool technology. Like other text mining vendors, it’s had mixed market success at best. However, sales activity suggests that Attensity recently put together it’s strongest marketing story ever, specifically in its new Voice of the Customer / Voice of the Market (VotC/VotM) focus.

Attensity Voice of the Market messaging stack

Know what real consumers think about your products/services, how they react to your marketing, and what stories are being told about you
The only way to listen in on actual consumer conversations. Humans can’t begin to to do this.
Mine the Web to find out what’s being said about you; easy SaaS install
See – here are real, usable results
Extraction of the essence from any kind of text, as exhibited via proofs-of-concept

That’s a good story. The technology works. Prospects can see that it works. The benefits are self-evident, because the technology gives unique access to highly desirable information. (Obviously, you can’t have employees sit at their screens and try to read the whole Web on your behalf.) The cost, time to installation, and so on are attractive. All is good.

Let’s now compare that to what probably was Attensity’s prior commercial focus, warranty analysis, for products like automobiles, other vehicles, and consumer electronics. In this market, the story was something like:

Attensity warranty messaging stack

Faster, more accurate warning of product problems
Human reading of the warranty claims is too slow or costly
Mine your warranty claims to see why your products break
See – here are real, usable results
Extraction of the essence from warranty claims, as exhibited via proofs-of-concept

That worked up to a point, which is a big part of why Attensity remained in business. But in fact, there were relatively few customers for whom the assertion “Human reading of the warranty claims is too slow or costly” was true. So relatively few sales on that basis were ever made.

Now, as a market-success-prediction tool, this kind of analysis may seem like overkill. In essence, all I’ve done is reiterate:

Text mining has shown slow growth because too few customers had internal corpuses large enough to need it.
If you’re mining the whole Web, however, your corpus is enormous.

But this analysis has another point. There’s a text mining industry consensus saying, more or less:

The text mining industry used to be too focused on the minutiae of technology and especially semantics, but now we’ve seen the light and are selling straight to business users who don’t really care about how the stuff works.

As with most views held by a broad consensus of smart people, that one contains a lot of truth. But it’s missing a next act. Whether or not Attensity, Clarabridge, and TEMIS get acquired soon – as most industry participants seem to expect – it seems inevitable that there will be large, technology-rich contenders in the text mining market. SAP/Business Objects/Inxight? Oracle/somebody? The enterprise search players? Dow Jones/Factiva? One way or another, there will eventually be big companies in the text mining market. Attensity (and the same goes for Clarabridge) isn’t doing much these days to position itself in advance of such an onslaught.

Anyhow, whatever you think of my market-evolution views, it sure seems as if the layered-messaging template works in this example as well.

Attensity update updated

Curt Monash — Tue, 17 Jun 2008 03:41:13 +0000

I chatted a bit with Attensity’s CTO David Bean and sales VP Jeff Johnson yesterday at the Text Analytics Summit. Jeff confirmed what has colleagues had already told me — most of the action is now in Voice of the Customer/Market, he expects a very strong June quarter, etc. But one thing I posted last week wasn’t quite right. Hosted implementations (i.e., SaaS) haven’t yet reached the 50% level at Attensity. However, they are indeed growing fast, and they’re all (or almost all) in the Voice of the Customer/Market area.

How much linguisitic sophistication is needed in Voice of the Customer/Market applications?

Curt Monash — Wed, 11 Jun 2008 11:54:15 +0000

According to Attensity CTO David Bean:

Voice of the Customer/Market applications require less linguistic sophistication than other text mining applications.
Hence, Voice of the Customer/Market apps are easier to get running than other text mining applications, which he conjectures is a big part of the reason for burgeoning sales.

I’m guessing most text mining vendors would agree with those views, although they might not agree with his elaborations, which include:

Attensity’s knowledge extraction technology is more sophisticated than Clarabridge’s or most other competitors’.
In particular, Clarabridge’s extraction is little more than bag-of-words.
There’s a good match between companies he thinks have less-sophisticated extraction (e.g., Clarabridge, SAS, SPSS) and companies whose text mining sales are heavily concentrated in Voice of the Customer/Market applications.

So the question arises: Just how much linguistic sophistication is needed in these market-trend-oriented text mining applications?

I actually got onto this subject not just because of what David said, but also via a conversation an hour earlier with Brooke Aker of Expert System, who proposed linguistic sophistication as a key reason for beating the competition (which, however, didn’t include Attensity or Clarabridge) at two accounts. The point Brooke was stressing is that it’s important to be able to extract multiple facts or indicators of sentiment from the same sentence. E.g., “I just had a crummy Chevy, but at least the seats were comfortable” is both a negative indicator about Chevrolet and a positive indicator about Chevrolet’s seats. Attensity captures both of those too, and I think Clarabridge would as well. (If you do comprehensive/ exhaustive extraction, you extract — well, you should extract comprehensively.)

Anyhow, my first-best answer to the question I posed is:

Sentiment analysis is hard, at least in venues where you have to deal with slang, metaphor, or irony (the real biggie). The more sophisticated, the better.
Otherwise, the linguistics of customer/marketing applications is pretty straightforward. Just put together the right list of wacky synonyms, and you’re good to go.

But what do you think?

5 ideas for how to pick between Attensity and Clarabridge

Curt Monash — Tue, 10 Jun 2008 23:43:51 +0000

Jim D. of UPS asked in the comment thread to the recent Attensity update post how one should decide between Attensity and Clarabridge. I wrote an answer, and then decided to just split it out in a separate post. Here are five ideas about how to pick between Attensity and Clarabridge for the kind of Voice of the Customer/Market application both companies are focusing on.

1. Attensity is the older company than Clarabridge, and is good at more things. Is Clarabridge really good at everything you want them to be?

2. In particular, Attensity has more overall sophistication at linguistic extraction. Do any of the differences matter to you?

3. Both companies are working hard on ease of use, for multiple kinds of user (business user tweaking linguistic rules, IT user, etc.). Whose approach and feature set do you like better?

4. Usually, buying one of these products involves some professional services. Whose organization do you like better?

5. Attensity’s default database schema for its exhaustive extraction is pretty flat and normalized, as befits a happy Teradata partner. Clarabridge’s is more of a star schema, as befits a bunch of ex-Microstrategy guys. Either can be straightforwardly translated into the other, so you may not care — but do you?

Is text analytics a good technology career path for humanities majors?

Curt Monash — Tue, 10 Jun 2008 11:26:30 +0000

One of the major dilemmas facing a group of people we all know is: How can humanities majors make money? Sure, they can become lawyers. And they can join the tech industry and write documentation. But what else?

Well, what about text analytics? Much of what I know about natural language processing (NLP) I learned from my friend Sharon Flank, who I met when she was a Slavic Linguistics PhD student at Harvard. My partner in first figuring out search engines — and later in running Elucidate — was my wife Linda Barlow, a 15-times-published novelist who’s also taught English at the college level. And Olivier Jouve’s education is in paleontology, although whether or not that’s a humanity is a sort of borderline definitional issue.

So I ask you all: Is text analytics a fruitful area for humanities majors to find lucrative careers? All insight would be appreciated. If the news is good enough, I’ll do my part in publicizing it to university placement offices and the like.

I’ve started out by asking Attensity (David Bean) and Clarabridge (Sid Banerjee). Attensity turns out to hire humanities students most years, both as full-time employees and interns. Linguistics students are the top priority, but language students and other language-friendly types are of interest as well. David is even involved in trying to set up a computational linguistics certification program at the university where he teaches part-time. And Clarabridge, the much younger company of the two, has over the past year used humanities majors quite successfully as well, for multiple aspects of ontology-building.