<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Text Technologies &#187; Open source text analytics</title>
	<atom:link href="http://www.texttechnologies.com/category/open-source-text-analytics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.texttechnologies.com</link>
	<description>Understanding technology ... in both senses of the phrase</description>
	<lastBuildDate>Wed, 18 Jan 2012 17:02:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Attivio tries to do it all</title>
		<link>http://www.texttechnologies.com/2007/12/12/attivio-tries-to-do-it-all/</link>
		<comments>http://www.texttechnologies.com/2007/12/12/attivio-tries-to-do-it-all/#comments</comments>
		<pubDate>Wed, 12 Dec 2007 04:38:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Attivio]]></category>
		<category><![CDATA[BI integration]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source text analytics]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2007/12/12/attivio-tries-to-do-it-all/</guid>
		<description><![CDATA[When Andrew McKay was at FAST, I grumped about his search/BI integration story. Now that he&#8217;s trying to do the same thing at a startup called Attivio, it sounds more plausible. Attivio is having a house party and product rollout in the latter part of January, and details are scarce in the mean time. But [...]]]></description>
			<content:encoded><![CDATA[<p>When Andrew McKay was at FAST, I grumped about his <a href="http://www.texttechnologies.com/2007/02/01/what%e2%80%99s-interesting-about-the-fast-venture-in-bi/" >search/BI integration story</a>.   Now that he&#8217;s trying to do the same thing at a startup called <a href="http://www.attivio.com" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.attivio.com');">Attivio</a>, it sounds more plausible.</p>
<p>Attivio is having a house party and product rollout in the latter part of January, and details are scarce in the mean time.  But here are some highlights.</p>
<ul>
<li>Attivio was founded in August.  It has 21 people and 1 VC.  The VC has invested &gt;$6 million and committed &gt;$12 million total.</li>
<li>Attivio has ambitious plans for a fully integrated data management/real-time BI stack.  It&#8217;s currently called the &#8220;Active Intelligence Engine.&#8221;<span id="more-151"></span></li>
<li>The data management part combines tabular, text, and XML data.  The tabular part is some kind of bitmap.  The text part is fairly traditional, and based on Lucene.</li>
<li>One point of this architecture is that one can more or less seamlessly join different kinds of data.</li>
<li>Another point is surely that &#8212; with everything being more or less like a column or bitmap &#8212; memory management and administration are manageable issues.</li>
<li>Despite containing all these wonders, the code is under 10 megs total.  At least right now.  But then &#8212; how much code can one write in a few months?</li>
<li>Andrew didn&#8217;t want me to repeat everything he said about target markets, but clearly Wall Street is one of the top possibilities.</li>
</ul>
<p>Stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2007/12/12/attivio-tries-to-do-it-all/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Text mining and search, joined at the hip</title>
		<link>http://www.texttechnologies.com/2006/11/11/text-mining-and-search-joined-at-the-hip/</link>
		<comments>http://www.texttechnologies.com/2006/11/11/text-mining-and-search-joined-at-the-hip/#comments</comments>
		<pubDate>Sat, 11 Nov 2006 08:14:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Attensity]]></category>
		<category><![CDATA[Business Objects and Inxight]]></category>
		<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[FAST]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[IBM and UIMA]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Open source text analytics]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Text mining]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2006/11/11/text-mining-and-search-joined-at-the-hip/</guid>
		<description><![CDATA[Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean: Text mining powers search. The biggest text mining outfits in the [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">Most people in the text analytics market realize that text mining and search are somewhat related.  But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become.  Here’s part of what I mean:</p>
<ol>
<li class="MsoNormal"><strong>Text mining powers search.</strong>   The biggest text mining outfits in the      world, possibly excepting the US      intelligence community, are surely Google, Yahoo, and perhaps Microsoft.</li>
<li class="MsoNormal"><strong>Search powers text mining.</strong>   Restricting the corpus of documents to      mine, even via a keyword search, makes tons of sense.  That’s one of the good ideas in      Attensity 4.</li>
<li class="MsoNormal"><strong>Text mining and search are powered by      the same underlying technologies.</strong>       For starters, there’s all the tokenization, extraction, etc. that      vendors in both areas license from Inxight and its competitors.   Beyond that, I think there’s a future      play in <a href="http://www.texttechnologies.com/2005/12/11/the-text-technologies-market-3-heres-whats-missing/" >integrated      taxonomy management</a> that will rearrange the text analytics market      landscape.</li>
</ol>
<p><span id="more-59"></span>
<p class="MsoNormal">So who does “get it” about the search/text mining connection?  The UIMA folks at IBM probably do.  Inxight surely does.  Attensity seemingly does, and so do most large search engine vendors (FAST and the public guys for sure; I’m not so certain about Autonomy and Convera).  A small company whose CEO just called me yesterday does.  I think I do.</p>
<p class="MsoNormal">But I’m not sure that the smaller text mining and search outfits – or the small text-oriented parts of large enterprise software vendors &#8212; have gotten the message at all yet …</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2006/11/11/text-mining-and-search-joined-at-the-hip/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>UIMA data point</title>
		<link>http://www.texttechnologies.com/2006/07/27/uima-data-point/</link>
		<comments>http://www.texttechnologies.com/2006/07/27/uima-data-point/#comments</comments>
		<pubDate>Thu, 27 Jul 2006 09:44:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Open source text analytics]]></category>
		<category><![CDATA[Text mining]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2006/07/27/uima-data-point/</guid>
		<description><![CDATA[While talking with Attensity today about much else, I asked them about UIMA. What they said is not inconsistent with what I heard from IBM itself. According to Attensity: A year ago almost no customers cared about UIMA. Now UIMA is regularly showing up on government RFPs. Private sector interest in UIMA is still very [...]]]></description>
			<content:encoded><![CDATA[<p>While talking with Attensity today about much else, I asked them about UIMA.  What they said is not inconsistent with what I heard <a href="http://www.texttechnologies.com/2006/07/19/lead-uima-architect-dave-ferrucci-speaks-about-adoption/" >from IBM itself</a>.  According to Attensity:</p>
<ul>
<li>A year ago almost no customers cared about UIMA.</li>
<li>Now UIMA is regularly showing up on government RFPs.</li>
<li>Private sector interest in UIMA is still very limited.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2006/07/27/uima-data-point/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lead UIMA architect Dave Ferrucci speaks about adoption</title>
		<link>http://www.texttechnologies.com/2006/07/19/lead-uima-architect-dave-ferrucci-speaks-about-adoption/</link>
		<comments>http://www.texttechnologies.com/2006/07/19/lead-uima-architect-dave-ferrucci-speaks-about-adoption/#comments</comments>
		<pubDate>Wed, 19 Jul 2006 17:00:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[About this blog]]></category>
		<category><![CDATA[IBM and UIMA]]></category>
		<category><![CDATA[Open source text analytics]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2006/07/19/lead-uima-architect-dave-ferrucci-speaks-about-adoption/</guid>
		<description><![CDATA[Dave Ferrucci, lead architect for UIMA, shared some detailed views with me about UIMA adoption. WIth his permission, they are reproduced below. UIMA is still not getting a lot of attention from commercial text analytics vendors, but ultimately I think it will prevail, if just because nobody cares enough to start a war of dueling [...]]]></description>
			<content:encoded><![CDATA[<p>Dave Ferrucci, lead architect for UIMA, shared some detailed views with me about UIMA adoption.  WIth his permission, they are reproduced below.  UIMA is still not getting a lot of attention from commercial text analytics vendors, but ultimately I think it will prevail, if just because nobody cares enough to start a war of dueling alternative standards.*  So it&#8217;s something you should educate yourself about as it progresses.</p>
<p><em>*And IBM plans to convince me ASAP that even that assessment is too negative, which it well may be.  Stay tuned. </em></p>
<blockquote><p>So to sum up &#8212; 1. We seem to have fair amount of traction  with the UIMA framework by communities that are very interested in plug-n-play  with components from other providers. This includes the government, life  sciences and research communities. 2. The UIMA standard, as opposed to the  specific Java Framework implementation, developed under an SDO will broaden the  opportunity and strengthen the case of adoption of UIMA as a standard for text  and multi-modal analytics that allows interoperability across different  frameworks and applications. It would of course be the case that the Java UIMA  Framework would comply to the standard.</p></blockquote>
<p>The complete email follows.<br />
<span id="more-30"></span></p>
<blockquote><p>Curt,<br />
Hi. While, we can&#8217;t really speak for the vendors, the  adoption story is ongoing and to fully appreciate it I think it best to consider  it in a bit more depth.</p>
<p>First is adoption of the UIMA Java Framework  which we have posted in binary form (as part of an SDK) on the IBM alphaworks  site in late 2004 (http://www.alphaworks.ibm.com/tech/uima) and then early this  year posted the source on source forge (http://uima-framework.sourceforge.net/).   On alphaworks we get a rough average of a couple hundred downloads/month from  government, academia and industry. On sourceforge we also get a similar average  although it seems to be tampering off of late.  What ALL these folks are doing  with the framework, we do not know. The forum on alphaworks is moderately  active; there hasn&#8217;t been as much activity on the source forum so far.  We see a lot of use of  the UIMA SDK (which includes the Java Framework) by government, universities and  research institutions/programs that are not in the business of selling a  specific application but rather in the business of creating/customizing their  own solutions. From these communities we see more activity on the alphaworks SDK  forum, requests for talks, tutorials and white papers and involvement in large  collaborative projects using UIMA. This makes sense to us. This is where we  expect to see early adoption of the framework. Traditionally these communities  do not see their value-add or core competency in developing infrastructure.  Rather they want to spend their time on the analytics, task models, integration  and solution level stuff. They are also more likely to experiment with 3rd party  analytics because they are not focused on competing at the component level, but  rather on solving their core problems, often in a collaborative environment.  Building adoption here for an interoperability framework, I think, is a good  first step.</p>
<p>It appears that text analysis vendors tend to build on their own internal frameworks, which have been in production for a some time and are  intimately tied to their applications. Also, they may tend to consider their  analytics a corner stone of their competitive advantage and therefore may not be  prone to share them or suggest that someone else&#8217;s are better. Switching over  to a different, externally provided, pluggable framework may not be a top priority. It is reasonable that over time vendors may adopt the UIMA Java  framework as part of their internal implementation, but that depends on  technical issues surrounding the cost/performance trade-offs relative to  maintaining their current implementations and their interest in reusing 3rd  party analytics. Vendors may be more immediately motivated to partner with IBM  in the creation and/or use of a standard (which I will say more about below).  Their hopes are to use the standard to better enable opportunities for strategic  partnerships and to find more channels for their technology.  The standard  enables, for example, network or service-level interoperability across  frameworks and applications not necessarily requiring a deeper implementation  commitment.</p>
<p>Second is adoption of a UIMA standard for interoperability  that accommodates different implementations/applications. We are considering the  creation of an external working group under a Standards Development Organization  (SDO), to define a standard specification for UIMA. This is independent of the  open-source Java Framework implementation and specifies the data representations  and abstract interfaces for defining compliant data and for communicating  between text (and multi-modal) applications over, for example, a network  protocol (e.g., SOAP). It will specify how to encode analysis data, how to  publish network services that do annotation etc. We expect that this will be an  attractive adoption point for communities that want to interoperate but not  necessarily change their internal implementation and development environments.  Perhaps the larger portion of text analysis vendor community fall into this  camp.</p>
<p>So to sum up &#8212; 1. We seem to have fair amount of traction  with the UIMA framework by communities that are very interested in plug-n-play  with components from other providers. This includes the government, life  sciences and research communities. 2. The UIMA standard, as opposed to the  specific Java Framework implementation, developed under an SDO will broaden the  opportunity and strengthen the case of adoption of UIMA as a standard for text  and multi-modal analytics that allows interoperability across different  frameworks and applications. It would of course be the case that the Java UIMA  Framework would comply to the standard.</p>
<p>Regards,</p>
<p>Dave</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2006/07/19/lead-uima-architect-dave-ferrucci-speaks-about-adoption/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Should ontology management be open sourced?</title>
		<link>http://www.texttechnologies.com/2006/07/17/should-ontology-management-be-open-sourced/</link>
		<comments>http://www.texttechnologies.com/2006/07/17/should-ontology-management-be-open-sourced/#comments</comments>
		<pubDate>Mon, 17 Jul 2006 08:49:38 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[About this blog]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Open source text analytics]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2006/07/17/should-ontology-management-be-open-sourced/</guid>
		<description><![CDATA[I’ve argued previously that enterprises need serious ontologies, and that this lack is holding back growth in multiple areas of text technology – search, text mining and knowledge extraction, various forms of speech recognition, and so on. The core point was: The ideal ontology would consist mainly of four aspects: 1. A conceptual part that’s [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">I’ve argued previously that <a href="http://www.texttechnologies.com/2005/12/11/the-text-technologies-market-3-heres-whats-missing/" >enterprises need serious ontologies</a>, and that <a href="http://www.texttechnologies.com/2005/12/11/the-text-technologies-market-4-requirements-for-an-industry-altering-ontology-management-system/" >this lack is holding back growth in multiple areas of text technology</a> – search, text mining and knowledge extraction, various forms of speech recognition, and so on.  The core point was:</p>
<blockquote><p>The ideal ontology would consist mainly of four aspects:</p>
<p>1. A conceptual part that’s language-independent.<br />
2. A general language-dependent part.<br />
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.<br />
4. An enterprise-specific part. For example, a company has product names, it has competitors with product names, those names have abbreviations, and so on.</p></blockquote>
<p><span id="more-29"></span></p>
<p class="MsoNormal">There are actually two different requirements – the enterprise-independent ontology, and the software to manage and use it in an enterprise-specific way.  But while I continue to believe that this dual product category will emerge, my faith has wavered somewhat.  The big vendors don’t “get it,” and the little ones lack the resources even if they do see the opportunity.</p>
<p class="MsoNormal">I discussed this with David Thede of dtSearch last week, and he raised an interesting question:  Should this be an open source project?   Some initial responses follow; also, I’d be very interested to know what the rest of the community thinks.</p>
<p class="MsoNormal"><strong>Peripheral parts of the software clearly can be drawn from the open source community.  </strong> IBM has open-sourced UIMA, which seems like a perfectly good framework for modularity, integration, interoperability, and so on.  Development tools can probably be based on Eclipse.  Etc. Open source has plenty of applicability these days.</p>
<p class="MsoNormal"><strong>The core software needs a profit-motivated vendor. </strong> Open-source, to date, has been much more about implementing alternate versions of known technology than it has been about difficult first-time invention.  And there’s a lot of invention still to be done here, especially in the area of <a href="http://www.texttechnologies.com/2006/06/10/four-at-least-issues-in-text-and-taxonomy-federation/" >taxonomy federation</a>.  Could there be an open source business model in which the vendor gives away the code and sells services?  Sure.  If nothing else, Monash’s Second Law of Commercial Semantics states “Where there are ontologies, there is consulting.”  But somebody has to <em>own</em> the product, in every sense of “own,” or it never will see the light of day.</p>
<p class="MsoNormal"><strong>An enterprise-independent ontology ideally should be open-sourced.   Probably, the leader of this effort should be the supplier of the ontology management software. </strong>WordNet is more or less public domain.  Various taxonomies of proper nouns and industry jargon can also be found.  Individual contributors can naturally provide new small pieces.   So there’s a lot of reason to think that public domain/open source is naturally the way to go for an ontology.</p>
<p class="MsoNormal">That said, a big problem comes to mind – how do you get everyone to agree on what the structure will be?   But isn’t that the same kind of problem that open source software development projects solve all the time?  I think so.</p>
<p class="MsoNormal">But unfortunately it’s also the kind of problem that standards committees botch all the time.  From that, I conclude that a public-domain ontology will need strong central leadership.  Who’s qualified and incentivized to provide that leadership?  I can’t think of anybody better suited than whoever emerges to seize the billion-dollar opportunity for ontology management software.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2006/07/17/should-ontology-management-be-open-sourced/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

