IBM and UIMA
Analysis of IBM’s efforts in text analytics, especially its UIMA interoperability technology and proposed standard. Related subjects include:
- (in DBMS2) IBM in the database, middleware, and analytic technology markets
- (in The Monash Report) Operational and strategic issues for IBM
- (in Software Memories) Historical notes on IBM
The newsletter/column excerpted below was originally published in 1998. Some of the specific references are obviously very dated. But the general points about the requirements for successful natural language computer interfaces still hold true. Less progress has been made in the intervening decade-plus than I would have hoped, but some recent efforts — especially in the area of search-over-business-intelligence — are at least mildly encouraging. Emphasis added.
Natural language computer interfaces were introduced commercially about 15 years ago*. They failed miserably.
*I.e., the early 1980s
For example, Artificial Intelligence Corporation’s Intellect was a natural language DBMS query/reporting/charting tool. It was actually a pretty good product. But it’s infamous among industry insiders as the product for which IBM, in one of its first software licensing deals, got about 1700 trial installations — and less than a 1% sales close rate. Even its successor, Linguistic Technologies’ English Wizard*, doesn’t seem to be attracting many customers, despite consistently good product reviews.
*These days (i.e., in 2009) it’s owned by Progress and called EasyAsk. It still doesn’t seem to be selling well.
Another example was HAL, the natural language command interface to 1-2-3. HAL is the product that first made Bill Gross (subsequently the founder of Knowledge Adventure and idealab!) and his brother Larry famous. However, it achieved no success*, and was quickly dropped from Lotus’ product line.
*I loved the product personally. But I was sadly alone.
In retrospect, it’s obvious why natural language interfaces failed. First of all, they offered little advantage over the forms-and-menus paradigm that dominated enterprise computing in both the online-character-based and client-server-GUI eras. If you couldn’t meet an application need with forms and menus, you couldn’t meet it with natural language either. Read more
|Categories: BI integration, IBM and UIMA, Language recognition, Natural language processing (NLP), Progress and EasyAsk, Search engines, Speech recognition||3 Comments|
Marie Wallace of IBM wrote back in response to my post on Languageware. In particular, it seems I got the Languageware/UIMA relationship wrong. Marie’s email was long and thoughtful enough that, rather than just pointing her at the comment thread, I asked for permission to repost it. Here goes:
Thanks for your mention to LanguageWare on your blog, albeit a skeptical one I totally understand your scepticism as there is so much talk about text analytics these days and everyone believes they have solved the problem. I guess I can only hope that our approach will indeed prove to be different and offers some new and interesting perspectives.
The key differentiation in our approach is that we have completely decoupled the language model from the code that runs the analysis. This has been generalized to a set of data-driven algorithms that apply across many languages so that you can have an approach that makes the solution hugely and rapidly customizable (without having to change code). It is this flexibility that we believe is core to realizing multi-lingual and multi-domain text analysis applications in a real-word scenario. This customization environment is available for download from Alphaworks, http://www.alphaworks.ibm.com/tech/lrw, and we would love to get feedback from your community.
On your point about performance, we actually consider UIMA one of our greatest performance optimizations and core to our design. The point about one-pass is that we never go back over the same piece of text twice at the same “level” and take a very careful approach when defining our UIMA Annotators. Certain layers of language processing just don’t make sense to split up due to their interconnectedness and therefore we create our UIMA annotators according to where they sit in the overall processing layers. That’s the key point.
Anyway those are my thoughts, and thanks again for the mention. It’s really great to see these topics being discussed in an open and challenging forum.
Marie Wallace of IBM wrote in from Ireland to call my attention to Languageware, IBM’s latest try at natural language processing (NLP). Obviously, IBM has been down this road multiple times before, from ViaVoice (dictation software that got beat out by Dragon NaturallySpeaking) to Penelope (research project that seemingly went on for as long as Odysseus was away from Ithaca — rumor has it that the principals eventually decamped to Microsoft, and continued to not produce commercial technology there). Read more
In a 2006 white paper, IBM claimed that “just 4 years from now, the world’s information base will be doubling in size every 11 hours.” This week, that statistic was passed on — utterly deadpan — by the Industry Standard and Stephen Arnold. Arnold’s post actually reads as if he takes the figure seriously.
Now, I’ll confess to not having seen the argument in favor of that statistic. But color me skeptical that, by any measure of “information”, it will grow by a factor of more than 2^730 in a year, or 2^7300 in a decade …
Late last year, there was a little flap about who invented the phrase business intelligence. Credit turns out to go to an IBM researcher named H. P. Luhn, as per this 1958 paper. Well, I finally took a look at the paper, after Jeff Jones of IBM sent over another copy. And guess what? It’s all about text analytics. Specifically, it’s about what we might now call a combination of classification and knowledge management.
Half a century later, the industry is finally poised to deliver on that vision.
I just had a quick chat with text mining vendor Clarabridge’s CEO Sid Banerjee. Naturally, I asked the standard “So who are you seeing in the marketplace the most?” question. Attensity is unsurprisingly #1. What’s new, however, is that Inxight – heretofore not a text mining presence vs. commercially-focused Clarabridge – has begun to show up a bit this quarter, via the Business Objects sales force. Sid was of course dismissive of their current level of technological readiness and integration – but at least BOBJ/Inxight is showing up now.
The most interesting point was text mining SaaS (Software as a Service). When Clarabridge first put out its “We offer SaaS now!” announcement, I yawned. But Sid tells me that about half of Clarabridge’s deals now are actually SaaS. The way the SaaS technology works is pretty simple. The customer gathers together text into a staging database – typically daily or weekly – and it gets sucked into a Clarabridge-managed Clarabridge installation in some high-end SaaS data center. If there’s a desire to join the results of the text analysis with some tabular data from the client’s data warehouse, the needed columns get sent over as well. And then Clarabridge does its thing. Read more
|Categories: BI integration, Clarabridge, Comprehensive or exhaustive extraction, IBM and UIMA, Software as a Service (SaaS), Text mining, Text mining SaaS||1 Comment|
Today’s big news is IBM’s $5 billion acquisition of Cognos. Part of the analyst conference call was two customer examples of how the companies had worked together in the past — and one of those two had a lot of “integration of structured and unstructured data.” The application sounded more like a 360-degree customer view, retrieving text documents alongside relational records, than it did like hardcore text analytics. Even so, it illustrates a trend that I was seeing even before BOBJ’s buy of Inxight, namely an increasing focus in the business intelligence world on at least the trappings of text analytics.
CEO Eric Bregand of Temis recently checked in by email with an update on text mining market activity. Highlights of Eric’s views include:
- Yep, Voice Of The Customer is hot, in “many markets”; Eric specifically mentioned banking, car, energy, food, and retail. He further sees IBM backing VotC as text’s “killer app.” (Note: Temis has a history of partnering with IBM, most notably via its unusually strong commitment to UIMA.)
- Specifically, THE hot topics in the European market these days are competitive intelligence and sentiment analysis. (Note: I’ve always thought Temis got serious about competitive analysis a little earlier than most other text mining vendors did.)
- Life sciences is an ever growing focus for Temis.
- I confused him a bit with how I phrased my question about custom publishing and Temis’ Mark Logic partnership. But he did express favorable views of the market, specifically in the area of integrating text mining and native XML database management, and even volunteered that nStein appears to be doing well.
|Categories: Application areas, Competitive intelligence, Custom publishing, IBM and UIMA, Investment research and trading, Mark Logic, nStein, TEMIS, Text mining, Voice of the Customer||1 Comment|
Due to various transatlantic communication glitches, I’d never had a serious briefing with text mining vendor TEMIS until yesterday, when I finally connected with CEO Eric Bregand. So here’s a quick TEMIS overview; I’ll discuss what they actually do in a separate post.
- TEMIS has 50 people; 3 main businesses and a couple of secondary ones; two larger offices in France; and smaller offices in Germany and the US. As would be expected, TEMIS’ customer base is concentrated in Continental Europe. The US exceptions seem concentrated in the life sciences vertical (not coincidentally, the US office is outside Philadelphia).
- Like Inxight, TEMIS is at least partly a spin-off from Xerox’s text analytics efforts. Indeed, its Grenoble office was acquired from Xerox. Unlike Inxight, TEMIS doesn’t serious pursue OEM business, but a couple of exceptions have occurred (Eric mentioned Convera and Documentum). Read more
Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean:
- Text mining powers search. The biggest text mining outfits in the world, possibly excepting the US intelligence community, are surely Google, Yahoo, and perhaps Microsoft.
- Search powers text mining. Restricting the corpus of documents to mine, even via a keyword search, makes tons of sense. That’s one of the good ideas in Attensity 4.
- Text mining and search are powered by the same underlying technologies. For starters, there’s all the tokenization, extraction, etc. that vendors in both areas license from Inxight and its competitors. Beyond that, I think there’s a future play in integrated taxonomy management that will rearrange the text analytics market landscape.
|Categories: Attensity, Business Objects and Inxight, Enterprise search, FAST, Google, IBM and UIMA, Ontologies, Open source text analytics, Search engines, Text mining||3 Comments|