Text mining – Text Technologies

The state of the art in text analytics applications

Curt Monash — Thu, 02 Dec 2010 02:06:54 +0000

Text analytics application areas typically fall into one or more of three broad, often overlapping domains:

Understanding the opinions of customers, prospects, or other groups. This can be based on any combination of documents the user organization controls (email, surveys, warranty reports, call center logs, etc.) — in which case — or public-domain documents such as blogs, forum posts, and tweets. The former is usually called Voice of the Customer (VotC), while the latter is Voice of the Market (VotM).
Detecting and identifying problems. This can happen across many domains — VotC, VotM, diagnosing equipment malfunctions, identifying bad guys (from terrorists to fraudsters), or even getting early warnings of infectious disease outbreaks.
Aiding text search, custom publishing, and other electronic document-shuffling use cases, often via document augmentation.

For several years, I’ve been distressed at the lack of progress in text analytics or, as it used to be called, text mining. Yes, the rise of sentiment analysis has been impressive, and higher volumes of text data are being processed than were before. But otherwise, there’s been a lot of the same old, same old. Most actual deployed applications of text analytics or text mining go something like this:

A bunch of documents are analyzed to ascertain the ideas expressed in them.
A count is made as to how many times each idea turns up.
The application user notices any surprisingly large numbers, and as result of noticing pays attention to the corresponding ideas.

Often, it seems desirable to integrate text analytics with business intelligence and/or predictive analytics tools that operate on tabular data is. Even so, such integration is most commonly weak or nonexistent. Apart from the usual reasons for silos of automation, I blame this lack on a mismatch in precision, among other reasons. A 500% increase in mentions of a subject could be simple coincidence, or the result of a single identifiable press article. In comparison, a 5% increase in a conventional business metric might be much more important.

But in fairness, the text analytics innovation picture hasn’t been quite as bleak as what I’ve been painting so far. While standalone, passively-reported text analytics is indeed the baseline, there are some interesting exceptions. For example:

I once confirmed that SPSS customer Cablecom‘s statistical models for churn and the like absolutely included text data; Cablecom even assigned different weights to the same apparent level of emotion depending on whether the text was in German, French, or Italian. Vertica recently told me of a Vertica/Hadoop customer doing something similar, except for the multilingual aspect. And the end of a 2008 SAS-based paper makes similar claims.
There long* have been some examples of fact extraction that don’t really fit into my three buckets above. For example, researchers mine collections of articles to try to determine biochemical or biological pathways that would not be apparent from examining single research studies alone.
It also has long* been the case that some bad-guy-finding applications — especially in the anti-terrorism area — used text analytics to populate state-of-the-art graph-oriented data analysis tools.

*When it comes to text analytics, “long” means “at least for the past several years.”

In more recent examples:

Greenplum built a document recommender for law firms that does hard-core statistical analysis to determine which .1% of a document set lawyers might actually want to see, and which then learns from users’ feedback after they respond to initial result sets.
Information extracted from investment news gets included into automated trading algorithms. This was unusual technology a couple of years ago, but is more common today.
After a series of mergers, Attensity now uses marketing-oriented text analytics in at least three different ways:
- Attensity text analytics feeds marketing dashboards just as it always did.
- Attensity text analytics triggers alerts, as I wish dashboards and business intelligence tools more often did, the false positives problem notwithstanding.
- Attensity text analytics triggers concrete workflows, for example routing specific social media hits for priority response.
- And in one example that did not actually get into production, a very large social networking company correlated word usage (e.g., choice among different synonyms) against user characteristics such as age and gender.

Finally there are some applications that, while fitting the standard template, just strike me as getting to unusually sophisticated levels of analysis. For example, Vertica told me of another Vertica/Hadoop case where VotM document analysis is carried out to the level of observing which order brand names appear in, and adjusting that for whether or not it was just an alphabetical list.

I suspect text analytics is about to become more interesting again.

Related links

The enabling technology for text/tabular data integration has existed for years.
In 2006, I listed major application areas for data mining/predictive analytics. It overlaps pretty closely with the similar list for text mining/text analytics.
Before being acquired by IBM, SPSS boasted a rather large text mining user base.

Notes, links, and comments, October 24, 2010

Curt Monash — Sun, 24 Oct 2010 08:58:25 +0000

Time for a notes/links/comments post just for Text Technologies:

TechCrunch got sold, GigaOm raised money, and VentureBeat/MediaBeat provided a good starting link for both those stories and more. Since TechCrunch and GigaOm are/were both private, financial details are murky, but:
- TechCrunch is variously reported as having revenue in the $6-10 million range, probably mainly from events. (If you believe that they sell ~3000 total tickets at ~$2000 each to two annual versions of TechCrunch Disrupt, that makes sense.)
- GigaOm reports >10,000 subscribers to market research sevice (sort of) GigaOm Pro, at $199, apparently concentrated on the vendor side.
John Gruber straightforwardly posts both ad rates and circulation for his blog. It’s a simple $5000/week for readership that exceeds mine by >1 order of magnitude.
The New Yorker points out Gawker Media may not yet have crossed $20 million in revenue.
An “ASCAP for news” seems to finally be on the way.
Business Week/Bloomberg notices a trend that social-media/Voice of the Customer/Voice of the Market text analytics firms are getting acquired by bigger marketing-oriented firms. Seth Grimes, however, argues that the same trend is already passe’.
TechCrunch accused the Wall Street Journal of killing a story about sister company MySpace, then quickly running it after TechCrunch caught them.
LinkedIn has a really cool-looking tech blog. One recent post describes LinkedIn’s approach to socially-informed search. I read about it in a thoughtful post on Daniel Tunkelang’s blog.
Bill Simmons took 3843 words to explain the story of a two-word tweet — “moss Vikings.” Somewhere in there are a few interesting ruminations about media in the current age.
Some notes and links that actually belong here instead went up on DBMS 2 a few weeks ago.
About half of what I write about liberty and privacy is highly relevant to the subjects of this blog, including almost all of today’s post.

Maybe text mining SHOULD be playing a bigger role in data warehousing

Curt Monash — Fri, 24 Oct 2008 04:39:36 +0000

When I chatted last week with David Bean of Attensity, I commented to him on a paradox:

Many people think text information is important to analyze, but even so data warehouses don’t seem to wind up holding very much of it.

My working theory explaining this has two parts, both of which purport to show why text data generally doesn’t fit well into BI or data mining systems. One is that it’s just too messy and inconsistently organized. The other is that text corpuses generally don’t contain enough information.

Now, I know that these theories aren’t wholly true, for I know of counterexamples. E.g., while I’ve haven’t written it up yet, I did a call confirming that a recently published SPSS text/tabular integrated data mining story is quite real. Still, it has felt for a while as if truth lies in those directions.

Anyhow, David offered one useful number range:

If you do exhaustive extraction on a text corpus, you wind up with 10-20X as much tabular data as you had in text format in the first place. (Comparing total bytes to total bytes.)

So how big are those corpuses? I think most text mining installations usually have at least 10s of thousands of documents or verbatims to play with. Special cases aside, the upper bound seems to usually be about two orders of magnitude higher. And most text-mined documents probably tend to be short, as they commonly are just people’s reports on a single product/service experience – perhaps 1 KB or so, give or take a factor of 2-3? So we’re probably looking at 10 gigabytes of text at the low end, and a few terabytes at the high end, before applying David’s 10-20X multiplier.

Hmm – that IS enough data for respectable data warehousing …

Obviously, special cases like national intelligence or very broad-scale web surveys could run larger, as per the biggest Marklogic databases. Medline runs larger too.

Attensity update

Curt Monash — Fri, 24 Oct 2008 04:29:24 +0000

I had a brief chat with the Attensity guys at their Teradata Partners Conference booth – mainly CTO David Bean, although he did buck one question to sales chief Jeff Johnson. The business trends story remained the same as it was in June: The sweet spot for new sales remains Voice of the Customer/Voice of the Market, while on-premise/SaaS new-name accounts are split around 50-50 (by number, not revenue).

David’s thoughts as to why the SaaS share isn’t even higher – as it seems to be for Clarabridge* – centered on the point that some customers want to blend internal and external data, and may not want to ship the internal part out to a SaaS provider. Besides, if it’s tabular data, I suspect Attensity isn’t the right place to ship it anyway.

*Speaking of Clarabridge, CEO Sid Banerjee recently posted a thoughtful company update in this comment thread.

When I challenged him on ease of use, David said that Attensity is readying a Microstrategy-based offering, which is obviously meant to compete with Clarabridge and any of its perceived advantages head-on.

Low-latency text mining in the investment market

Curt Monash — Fri, 19 Sep 2008 09:15:58 +0000

I’m not at Gartner’s Event Processing conference, but there seem to be some interesting posts and articles coming out of it. Seth Grimes has one on Reuters’ integration of text mining and event processing, including sentiment analysis. Well worth reading. Lots more detail than I’ve ever posted on similar applications.

The layered messaging marketing model as applied to Attensity

Curt Monash — Mon, 08 Sep 2008 06:52:15 +0000

My general layered messaging theory survived its first test against an IT vendor example – Netezza. Let’s try another, in this case a company that’s not a Monash Research client.

Attensity is a text mining vendor with a lot of cool technology. Like other text mining vendors, it’s had mixed market success at best. However, sales activity suggests that Attensity recently put together it’s strongest marketing story ever, specifically in its new Voice of the Customer / Voice of the Market (VotC/VotM) focus.

Attensity Voice of the Market messaging stack

Know what real consumers think about your products/services, how they react to your marketing, and what stories are being told about you
The only way to listen in on actual consumer conversations. Humans can’t begin to to do this.
Mine the Web to find out what’s being said about you; easy SaaS install
See – here are real, usable results
Extraction of the essence from any kind of text, as exhibited via proofs-of-concept

That’s a good story. The technology works. Prospects can see that it works. The benefits are self-evident, because the technology gives unique access to highly desirable information. (Obviously, you can’t have employees sit at their screens and try to read the whole Web on your behalf.) The cost, time to installation, and so on are attractive. All is good.

Let’s now compare that to what probably was Attensity’s prior commercial focus, warranty analysis, for products like automobiles, other vehicles, and consumer electronics. In this market, the story was something like:

Attensity warranty messaging stack

Faster, more accurate warning of product problems
Human reading of the warranty claims is too slow or costly
Mine your warranty claims to see why your products break
See – here are real, usable results
Extraction of the essence from warranty claims, as exhibited via proofs-of-concept

That worked up to a point, which is a big part of why Attensity remained in business. But in fact, there were relatively few customers for whom the assertion “Human reading of the warranty claims is too slow or costly” was true. So relatively few sales on that basis were ever made.

Now, as a market-success-prediction tool, this kind of analysis may seem like overkill. In essence, all I’ve done is reiterate:

Text mining has shown slow growth because too few customers had internal corpuses large enough to need it.
If you’re mining the whole Web, however, your corpus is enormous.

But this analysis has another point. There’s a text mining industry consensus saying, more or less:

The text mining industry used to be too focused on the minutiae of technology and especially semantics, but now we’ve seen the light and are selling straight to business users who don’t really care about how the stuff works.

As with most views held by a broad consensus of smart people, that one contains a lot of truth. But it’s missing a next act. Whether or not Attensity, Clarabridge, and TEMIS get acquired soon – as most industry participants seem to expect – it seems inevitable that there will be large, technology-rich contenders in the text mining market. SAP/Business Objects/Inxight? Oracle/somebody? The enterprise search players? Dow Jones/Factiva? One way or another, there will eventually be big companies in the text mining market. Attensity (and the same goes for Clarabridge) isn’t doing much these days to position itself in advance of such an onslaught.

Anyhow, whatever you think of my market-evolution views, it sure seems as if the layered-messaging template works in this example as well.

Lexalytics has merged with part of Infonic

Curt Monash — Thu, 07 Aug 2008 19:59:01 +0000

As reported on the Lexalytics blog, sentiment analysis specialist Lexalytics has merged with the text analytics division of Infonic to form Lexalytics Limited. The deal seems to have a screwy financial structure — which Seth Grimes made a valiant effort to decipher (I think from vacation, poor guy) — as is common when companies much too small to be public wind up trading publicly anyway.

Related links

If you think sentiment analysis technology can detect idiom, I have a bridge I’d like to sell you

Curt Monash — Fri, 20 Jun 2008 11:40:52 +0000

Text mining tools are just WONDERFUL at detecting idiom, sarcasm, and figurative speech … Yeah, right. I asked Lexalytics CEO Jeff Catlin whether his tool could do that kind of thing, and he looked at me like I’d just grown a third ear.

Actually, he didn’t. But just like every other sentiment analysis vendor I encountered at the Text Analytics Summit or spoke to beforehand, he made it clear that his tool could only handle straightforward, literal expressions of opinion. Idiom, irony, sarcasm, metaphor, et al. are beyond the current reach of the technology.

Aren’t you just thrilled that I shared that earth-shattering news with you?

6 trends that could shake up the text analytics market

Curt Monash — Thu, 19 Jun 2008 08:33:31 +0000

My last two posts were based on the introductory slide to my talk The Text Analytics Marketplace: Competitive landscape and trends. I’ll now jump straight ahead to the talk’s conclusion.

Text analytics vendors participate in the same trends as other software and technology vendors. For example, relational business intelligence and data warehousing products are increasingly being sold to departmental buyers. Those buyers place particularly high value on ease of installation. And golly gee whiz, both parts of that are also true in text mining.

But beyond such general trends, I’ve identified six developments that I think could radically transform the text analytics market landscape. Indeed, they could invalidate the neat little eight-bucket categorization I laid out in the prior post. Each is highly likely to occur, although in some cases the timing remains greatly in doubt.

These six market-transforming trends are:

Web/enterprise/messaging integration
BI integration
Universal message retention
Portable personal profiles
Electronic health records
Voice command & control

I’ll explain briefly.

1. Google and Microsoft are two of the three leaders in web search. Now that Microsoft has bought FAST, they are also two of the leaders in enterprise search. They are also two of the leaders in hosted email. Ditto instant messaging. So there’s a good chance these various disciplines will converge.

2. There are a number of ways text analytics and traditional analytics can and are being integrated:

Enterprise search and business intelligence are akin; both involve digging information out of the data you already have.
Text mining is naturally integrated with business intelligence and/or data mining.
There’s a trend toward using text search to dig up business intelligence documents such as specific reports, spreadsheets, etc.

To date the latter is focused on reports that already exist, rather than queries that could be run on the fly, but I hope and trust the technology will be extended over time. Natural language queries have merit anyway; I’d like to see the search box be extended in functionality to a true data-retrieval command line.

3. One of the big purchase drivers of storage, search, and clustering technology is mandates to preserve information and make it available to auditors, regulators, and/or people who want to sue you. Email in particular is changing from being ephemeral to becoming part of the permanent record. Well, if the information is being retained anyway, then maybe it’s time to see how to get useful insight from it.

Right now, a company’s overall text archives aren’t being leveraged in the same way data warehouses are. That will change.

4. For over a decade, online companies have fought to exploit the fact that users were registered with their sites or services, but not with others. Huge amounts of investment money were wasted in the dot-com bubble because people thought “registered users” was a significant metric, or that ISP subscribers could be directed to proprietary content. Enormous valuations are being assigned to Facebook and LinkedIn on similar theories today.

But as site owners and other marketers get ever more aggressive about exploiting user-specific information, users will get ever more sophisticated about controlling it. The obvious solution is for each internet user to control a sophisticated database of their contact information, presence information, actions, preferences, and writings, and to be very selective about which online services are allowed to see which portions of the data. I think that will come about some day, but I don’t know when. When it does, text analytics will be affected in a variety of interesting ways.

5. Electronic health records are almost unique in IT. What other enterprise app can you think of for which relational DBMS aren’t the default underpinning? (Intersystems’ object-oriented DBMS Cache’ has huge share in the clinical records market.) Normal tabular data, text, images, sensor output streams – health records have it all. What’s more, the health records area is coming upon some very interesting times in the area of data sharing, at least in the US.

Just as retailing went from being an IT backwater (through the mid-1980s), to a sophisticated user of database technology (1990s), to the leader of the internet revolution (rise of e-commerce), I think health care is due to take a leadership role in IT advances. And when it does, search, text mining, and voice recognition will all play important roles.

6. Most people reading this far have probably watched Star Trek. Well, what is keeping us from being able to command computers in a Star Trek fashion? Not really that much. Sure, there are some big missing pieces. We need a mapping from commands to the specific applications that would carry them out. We also need a more structured kind of analytic middle tier so that there’s something to map questions to. But those are solvable problems. And by the way – when everybody wears headphones, voice commands emanating from the next cubicle are no longer the big annoyance they would be today. Mobile/small devices only add to the business case for voice recognition advances.

When voice becomes a primary mode of human/device communication, “text” analytics will be affected in any number of ways.

Related links:

The introductory post in this series
19 possible Microsoft/Yahoo synergies, many of them related to text technology convergence, e.g. between web search and enterprise search
The compelling case for letting Google handle your enterprise email
An old post on why BI vendors flocked to integrate with Google OneBox
A proposal to refactor social networks
An old post in which I outlined some of the criteria for Profiles 2.0
Why text technologies are going to recombine (in A World of Bytes)

The Text Analytics Marketplace: Competitive landscape and trends

Curt Monash — Thu, 19 Jun 2008 07:35:39 +0000

As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:

1. Web search

2. Public-facing site search

3. Enterprise search and knowledge management

4. Custom publishing

5. Text mining and extraction

Three are more standalone:

6. Spam filtering

7. Voice recognition

8. Machine translation

This list comes from a talk I gave Monday at the Text Analytics Summit called The Text Analytics Marketplace: Competitive landscape and trends. In half an hour, I covered the first five areas (in Sue Feldman’s word, at a “gallop”). The slide deck has been uploaded to the link below. I plan to break out the material from the talk into a series of blog posts over the next few (or perhaps not-so-few) weeks.

Slides:

The Text Analytics Marketplace: Competitive landscape and trends

Other posts based on those slides:

Three specialized markets for text analytics (based on Slide 2)
6 trends that could shake up the text analytics market (based on Slide 19)
Why search technologies are going to recombine (in A World of Bytes, based on Slide 19)