Open source text analytics

Discussion of text analytic software or ontologies that are offered through some version of open source licensing. Related subjects include:

Lucene
UIMA
(in DBMS2) Open source database management systems

December 12, 2007

Attivio tries to do it all

When Andrew McKay was at FAST, I grumped about his search/BI integration story. Now that he’s trying to do the same thing at a startup called Attivio, it sounds more plausible.

Attivio is having a house party and product rollout in the latter part of January, and details are scarce in the mean time. But here are some highlights.

Attivio was founded in August. It has 21 people and 1 VC. The VC has invested >$6 million and committed >$12 million total.
Attivio has ambitious plans for a fully integrated data management/real-time BI stack. It’s currently called the “Active Intelligence Engine.” Read more

Categories: Attivio, BI integration, Investment research and trading, Lucene, Open source text analytics

4 Comments

November 11, 2006

Text mining and search, joined at the hip

Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean:

Text mining powers search. The biggest text mining outfits in the world, possibly excepting the US intelligence community, are surely Google, Yahoo, and perhaps Microsoft.
Search powers text mining. Restricting the corpus of documents to mine, even via a keyword search, makes tons of sense. That’s one of the good ideas in Attensity 4.
Text mining and search are powered by the same underlying technologies. For starters, there’s all the tokenization, extraction, etc. that vendors in both areas license from Inxight and its competitors. Beyond that, I think there’s a future play in integrated taxonomy management that will rearrange the text analytics market landscape.

Categories: Attensity, Business Objects and Inxight, Enterprise search, FAST, Google, IBM and UIMA, Ontologies, Open source text analytics, Search engines, Text mining

3 Comments

July 27, 2006

UIMA data point

While talking with Attensity today about much else, I asked them about UIMA. What they said is not inconsistent with what I heard from IBM itself. According to Attensity:

A year ago almost no customers cared about UIMA.
Now UIMA is regularly showing up on government RFPs.
Private sector interest in UIMA is still very limited.

Categories: Open source text analytics, Text mining

Lead UIMA architect Dave Ferrucci speaks about adoption

Dave Ferrucci, lead architect for UIMA, shared some detailed views with me about UIMA adoption. WIth his permission, they are reproduced below. UIMA is still not getting a lot of attention from commercial text analytics vendors, but ultimately I think it will prevail, if just because nobody cares enough to start a war of dueling alternative standards.* So it’s something you should educate yourself about as it progresses.

*And IBM plans to convince me ASAP that even that assessment is too negative, which it well may be. Stay tuned.

So to sum up — 1. We seem to have fair amount of traction with the UIMA framework by communities that are very interested in plug-n-play with components from other providers. This includes the government, life sciences and research communities. 2. The UIMA standard, as opposed to the specific Java Framework implementation, developed under an SDO will broaden the opportunity and strengthen the case of adoption of UIMA as a standard for text and multi-modal analytics that allows interoperability across different frameworks and applications. It would of course be the case that the Java UIMA Framework would comply to the standard.

The complete email follows.
Read more

Categories: About this blog, IBM and UIMA, Open source text analytics

2 Comments

July 17, 2006

Should ontology management be open sourced?

I’ve argued previously that enterprises need serious ontologies, and that this lack is holding back growth in multiple areas of text technology – search, text mining and knowledge extraction, various forms of speech recognition, and so on. The core point was:

The ideal ontology would consist mainly of four aspects:

1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, it has competitors with product names, those names have abbreviations, and so on.

Categories: About this blog, Ontologies, Open source text analytics

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in