<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/rss2full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Text Technologies</title>
	
	<link>http://www.texttechnologies.com</link>
	<description>Understanding technology ... in both senses of the phrase</description>
	<pubDate>Thu, 20 Nov 2008 17:22:22 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/TextTechnologies" type="application/rss+xml" /><item>
		<title>More website weirdness</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/459119368/</link>
		<comments>http://www.texttechnologies.com/2008/11/19/more-website-weirdness/#comments</comments>
		<pubDate>Thu, 20 Nov 2008 03:27:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[ClearForest/Reuters]]></category>

		<category><![CDATA[Custom publishing]]></category>

		<category><![CDATA[Mark Logic]]></category>

		<category><![CDATA[Search engines]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=298</guid>
		<description><![CDATA[Here&#8217;s something longer-lasting and weirder than Vertica&#8217;s &#8220;We sell turkeys&#8221; theme: Mark Logic, whose product is used primarily to help enterprises make their content more acceptable, doesn&#8217;t have a search engine on its own website.*
*Or if it does, it&#8217;s VERY well-hidden. I looked at the home page and site map alike.
I wanted to refresh my [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s something longer-lasting and weirder than <a href="http://www.dbms2.com/2008/11/18/silly-website-tricks/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.dbms2.com');">Vertica&#8217;s &#8220;We sell turkeys&#8221; theme</a>: Mark Logic, whose product is used primarily to help enterprises make their content more acceptable, doesn&#8217;t have a search engine on its own website.*<span id="more-298"></span></p>
<p><em>*Or if it does, it&#8217;s VERY well-hidden. I looked at the home page and site map alike.</em></p>
<p>I wanted to refresh my memory as to Mark Logic&#8217;s history of working with specific text mining vendors, beyond what&#8217;s on the official <a href="http://www.marklogic.com/partners/open-enrichment-framework.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.marklogic.com');">partner page</a>. No luck.  Normally when site search is inadequate, one goes to Google.   But that&#8217;s problematic too.  Marklogic.com pages come up pretty low on Google&#8217;s search results, suggesting that:</p>
<ol>
<li>Mark Logic doesn&#8217;t put a lot of effort into SEO (or else doesn&#8217;t do it very well).</li>
<li>One can&#8217;t be confident all the site&#8217;s significant pages are findable by Google.</li>
</ol>
<p>Looking to other companies&#8217; sites for clues isn&#8217;t conclusive either.  E.g., <a href="http://clearforest.com/Partners/PartnerDetails.asp?id=11" onclick="javascript:pageTracker._trackPageview('/outbound/article/clearforest.com');">Clearforest lists Mark Logic as a partner</a>, but Mark Logic doesn&#8217;t return the compliment.  (If memory serves, Mark Logic and Clearforest have worked together both on national security deals and custom publishing deals &#8212; but don&#8217;t hold me to that.)</p>
<p>When it comes to making its own information conveniently available, Mark Logic is quite the unshod cobbler.</p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/459119368" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/11/19/more-website-weirdness/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/11/19/more-website-weirdness/</feedburner:origLink></item>
		<item>
		<title>The silly fuss over Obama’s use of YouTube</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/455371426/</link>
		<comments>http://www.texttechnologies.com/2008/11/16/the-silly-fuss-over-obamas-use-of-youtube/#comments</comments>
		<pubDate>Sun, 16 Nov 2008 23:48:40 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Google]]></category>

		<category><![CDATA[Social software and online media]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=297</guid>
		<description><![CDATA[President-Elect Barack Obama is posting videos on YouTube.  Clearly, his use of relatively cutting-edge communications technology is a Good Thing. It&#8217;s also unsurprising, giving the sophistication and importance of video in the recent presidential campaign.
However, various commentators &#8212; even ones as smart as Dan Farber &#8212; see something wrong with the use of YouTube [...]]]></description>
			<content:encoded><![CDATA[<p>President-Elect Barack Obama is posting <a href="http://www.change.gov/newsroom/entry/your_weekly_address_from_the_president_elect/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.change.gov');">videos on YouTube</a>.  Clearly, his use of relatively cutting-edge communications technology is a Good Thing. It&#8217;s also unsurprising, giving the <a href="http://www.networkworld.com/community/node/34751" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.networkworld.com');">sophistication and importance of video</a> in the recent presidential campaign.</p>
<p>However, various commentators &#8212; even ones as smart as <a href="http://news.cnet.com/8301-13953_3-10098174-80.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/news.cnet.com');">Dan Farber</a> &#8212; see something wrong with the use of YouTube for this purpose.  I think that&#8217;s silly.<span id="more-297"></span> If YouTube and its competitors are happy to provide the bandwidth for free, I see no reason why the transition team or government should pay for it.  Nor should ads be a concern.  After all, the president&#8217;s weekly radio address has long been provided to ad-supported radio channels.</p>
<p>Of course, I think official videos should be available from multiple sources. No site needs a monopoly.  And indeed, with YouTube being banned in some countries, there&#8217;s a clear &#8220;greater reach&#8221; reason for multiple sourcing.  (The American President doesn&#8217;t just speak to Americans.) But as Dan points out, the Obama video is indeed available through multiple major websites.</p>
<p>Now, I&#8217;m not saying that the government shouldn&#8217;t have sites of its own where it hosts videos of speeches by the Assistant Secretary of Transportation. But if YouTube or some other firm provides bandwidth in return for being noticed as doing same, I only see one kind of possible harm:</p>
<p><strong>Perhaps the implicit advertisement/endorsement they&#8217;re getting is of greater value than the bandwidth being provided.</strong></p>
<p>Fine. There are two ways to deal with that:</p>
<p>1.  Require YouTube to remove its logo from the version of the video being put up on official sites (the first link above shows how that&#8217;s not happening now).</p>
<p>2.  Auction off the right to be the primary video provider, or something like that.</p>
<p>Either way, I fail to see the big deal.  YouTube and Google are great American companies. Government does things for companies all the time.  Until a competitor comes up with a clear description of how it&#8217;s being hurt by this mild favoritism &#8212; and I haven&#8217;t heard of any yet &#8212; there are many bigger <a href="http://www.networkworld.com/community/node/34946" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.networkworld.com');">problems for Obama&#8217;s technology experts to solve</a>.</p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/455371426" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/11/16/the-silly-fuss-over-obamas-use-of-youtube/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/11/16/the-silly-fuss-over-obamas-use-of-youtube/</feedburner:origLink></item>
		<item>
		<title>Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/450415652/</link>
		<comments>http://www.texttechnologies.com/2008/11/12/denial-of-insight-attacks/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 07:45:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Competitive intelligence]]></category>

		<category><![CDATA[Search engines]]></category>

		<category><![CDATA[Spam and antispam]]></category>

		<category><![CDATA[Voice of the Customer]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=295</guid>
		<description><![CDATA[TechTaxi points out that it&#8217;s at least theoretically possible to, by polluting the Web, pollute somebody&#8217;s web-wide information gathering.  (Hat tip to Daniel Tunkelang.)  They further assert this is a relatively near-term threat.
The theory can&#8217;t be denied. What&#8217;s more, bad actors have other motives to pollute the Web.  For example, if they [...]]]></description>
			<content:encoded><![CDATA[<p>TechTaxi <a href="http://techtaxi.blogspot.com/2006/04/denial-of-insight-attacks-could.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/techtaxi.blogspot.com');">points out</a> that it&#8217;s at least theoretically possible to, by polluting the Web, pollute somebody&#8217;s web-wide information gathering.  (Hat tip to <a href="http://thenoisychannel.com/2008/11/11/big-google-can-be-benign/" onclick="javascript:pageTracker._trackPageview('/outbound/article/thenoisychannel.com');">Daniel Tunkelang</a>.)  They further assert this is a relatively near-term threat.</p>
<p>The theory can&#8217;t be denied. What&#8217;s more, bad actors have other motives to pollute the Web.  For example, if they plant favorable automated comments about their own products or unfavorable about the competition&#8217;s,<a href="http://www.texttechnologies.com/2008/06/17/voice-of-the-customermarket-indeed-where-the-action-is/" > Voice of the Customer/Market</a> applications will naturally be confused.  And if automated reputation-checkers get more prominent, there will be a <em>major</em> incentive to game them, just as there has been for Google&#8217;s PageRank.  So VOTC/VOTM market research tools could polluted as a side effect.</p>
<p>Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.</p>
<p>But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I&#8217;m not thinking of a serious near-term threat scenario.</p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/450415652" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/11/12/denial-of-insight-attacks/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/11/12/denial-of-insight-attacks/</feedburner:origLink></item>
		<item>
		<title>The Google flu search story is pretty interesting</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/450110433/</link>
		<comments>http://www.texttechnologies.com/2008/11/11/the-google-flu-search-story-is-pretty-interesting/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 00:13:00 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Google]]></category>

		<category><![CDATA[Search engines]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=294</guid>
		<description><![CDATA[Google reports that it is tracking flu outbreaks via search.  Actually, that&#8217;s a misnomer. Google is not tracking articles written about flu; HealthMap et al. do that.  Rather, this Google project is tracking search queries about flu-related subjects.  They have graphs suggesting a strong correlation between flu-related searches and actual cases of [...]]]></description>
			<content:encoded><![CDATA[<p>Google reports that it is <a href="http://googleblog.blogspot.com/2008/11/tracking-flu-trends.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/googleblog.blogspot.com');">tracking flu outbreaks via search</a>.  Actually, that&#8217;s a misnomer. Google is not tracking <em>articles</em> written about flu; <a href="http://www.healthmap.org/en" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.healthmap.org');">HealthMap</a> et al. do that.  Rather, this Google project is tracking <em>search queries</em> about flu-related subjects.  They have <a href="http://www.google.org/about/flutrends/how.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.google.org');">graphs</a> suggesting a strong correlation between flu-related searches and actual cases of flu, notwithstanding that many searches on &#8220;flu&#8221; would be for, say &#8220;flu shot.&#8221;  The key point is that Google tracks where searches come from, and hence detects which geographical areas are suffering flu outbreaks.  And it does this 1-2 weeks faster than the alternative method, which is physicians reporting to the Centers for Disease Control (CDC).*<span id="more-294"></span></p>
<p><em>*Which makes perfect sense when you think about how long it takes to actually get a doctor&#8217;s appointment &#8212; or, all kidding aside, even how long it takes to decide it&#8217;s necessary to go to the doctor.</em></p>
<p>Google, quite credibly, claims that these results are based on aggregated data rather than personally identifiable information.  Even so, it heralds a day in which Google observes which groups of users &#8212; geographically organized or otherwise &#8212; care particularly about certain subjects, and tailors news, ads, or search results accordingly &#8212; if that day isn&#8217;t already here.</p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/450110433" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/11/11/the-google-flu-search-story-is-pretty-interesting/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/11/11/the-google-flu-search-story-is-pretty-interesting/</feedburner:origLink></item>
		<item>
		<title>Lukewarm review of Yahoo mobile search</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/450055187/</link>
		<comments>http://www.texttechnologies.com/2008/11/11/review-yahoo-mobile-search/#comments</comments>
		<pubDate>Tue, 11 Nov 2008 23:01:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Language recognition]]></category>

		<category><![CDATA[Search engines]]></category>

		<category><![CDATA[Specialized search]]></category>

		<category><![CDATA[Speech recognition]]></category>

		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=293</guid>
		<description><![CDATA[Stephen Shankland reviewed Yahoo&#8217;s mobile voice search, which works by taking voice input and returning results onscreen (in his case on his Blackberry Pearl).  He found:

There are plenty of times when voice is a more convenient form of input than typing.
Voice recognition was good but far from perfect.
Editing search strings was annoyingly difficult.
Search results [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://news.cnet.com/8301-1023_3-10092659-93.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/news.cnet.com');">Stephen Shankland</a> reviewed Yahoo&#8217;s mobile voice search, which works by taking voice input and returning results onscreen (in his case on his Blackberry Pearl).  He found:</p>
<ul>
<li>There are plenty of times when voice is a more convenient form of input than typing.</li>
<li>Voice recognition was good but far from perfect.</li>
<li>Editing search strings was annoyingly difficult.</li>
<li>Search results themselves aren&#8217;t 100% perfect.</li>
</ul>
<p>No big surprises there. <img src='http://www.texttechnologies.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /></p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/450055187" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/11/11/review-yahoo-mobile-search/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/11/11/review-yahoo-mobile-search/</feedburner:origLink></item>
		<item>
		<title>Google and the Author’s Guild establish an ASCAP for books</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/435275813/</link>
		<comments>http://www.texttechnologies.com/2008/10/28/google-and-the-authors-guild-establish-an-ascap-for-books/#comments</comments>
		<pubDate>Wed, 29 Oct 2008 00:40:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Google]]></category>

		<category><![CDATA[Search engines]]></category>

		<category><![CDATA[Social software and online media]]></category>

		<category><![CDATA[Specialized search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=290</guid>
		<description><![CDATA[Most of the coverage of the Google/Authors Guild settlement today seems to focus on Google&#8217;s side of things.  But I think the authors&#8217; side is much more important. This deal paves the way for traditional publishers to become quaint and useless &#8212; and not a moment too soon.
Below are some quotes &#8212; fair use!! [...]]]></description>
			<content:encoded><![CDATA[<p>Most of the coverage of the Google/Authors Guild settlement today seems to focus on Google&#8217;s side of things.  But I think the authors&#8217; side is much more important. This deal paves the way for <strong>traditional publishers to become quaint and useless</strong> &#8212; and not a moment too soon.</p>
<p>Below are some quotes &#8212; fair use!! <img src='http://www.texttechnologies.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> &#8212; from the Authors Guild official statement on the deal (emphasis mine): <span id="more-290"></span></p>
<blockquote><p>Our proposal to Google back in May 2006 was simple:  while we don’t approve of your unauthorized scanning of our books and displaying snippets for profit, if you’re willing to do something far more ambitious and useful, and you’re willing to cut authors in for their fair share, then it would be our pleasure to work with you.</p>
<p>&#8230;</p>
<p>The payments will flow through the Book Rights Registry, a new independent entity that can be thought of as <strong>the writers’ equivalent of ASCAP. </strong> Much as ASCAP tracks the uses of songs and collects royalties for songwriters and musicians, the Registry will serve the interests of authors and others who own the rights to books appearing online as a result of this settlement. The Registry will be controlled by a board of authors and publishers; as part of the settlement, Google will pay $34.5 million to get the Registry up and running, notify rightsholders of the settlement, and process claims.</p>
<p>Readers are also big winners under the settlement of Authors Guild v. Google.  Readers will be able to browse from their own computers an enormous collection of books.  We hope this will encourage some readers to buy full online access to some of the books.  Readers wanting to view books online in their entirety for free need only reacquaint themselves with their participating local public library: <strong> every public library building is entitled to a free, view-only license to the collection. </strong> College students working on term papers will be able to point their computers to resources other than Wikipedia, if they’re so inclined:<strong> students at subscribing institutions will be able to read and print out any books in the collection.</strong></p></blockquote>
<p>This is what writers have been &#8212; or at least should have been &#8212; awaiting for over a decade, ever since it became clear that the Web would transform media and publishing.  With print-on-demand plus an online book registry, authors get complete access (starting in the US, at least) to readers, paying and otherwise.</p>
<p>So let&#8217;s review what publishers are good for.  In some order, their role is:</p>
<ul>
<li>Marketing through branding/imprimatur.</li>
<li>Marketing through advertising, book tours, and the like.</li>
<li>Marketing/sales/distribution through the physical supply chain.</li>
<li>Editing, art, and other actual contributions to the quality of the product.</li>
</ul>
<p>It&#8217;s a publishing industry open secret that advertising and such like are pretty useless, doing more for the egos of all concerned than they do for actual book sales.  Amazon is obsoleting most physical book stores, airport locations (for impulse purchases) and the like perhaps excepted.</p>
<p>As for branding/imprimatur: The backing of a major publisher can be worth a few thousand hard copy sales to libraries.  But where else does it matter? I was going to suggest that it might in the academic world. But then I looked over at the math books on my shelves, a number of which are bound in familiar Springer Verlag yellow and white. Is there one book there I wouldn&#8217;t own if it weren&#8217;t a Springer Verlag publication? Probably not.  I suspect that your initial reputation boost in academic publishing comes from your institution and peers much more than from an actual academic press.</p>
<p>So for the most part, book publishers and music publishers are left with just one marketing function &#8212; getting the ball rolling.  Published products ultimately sell through word of mouth, but if you don&#8217;t start out with listeners or readers, how can the word of mouth build? The answer, of course, will increasingly be online promotion. If your natural audience is scattered around the country or the world, without being concentrated in one particular geographic location, online marketing is the obvious way to go. And in this world of search engines, YouTube, blogs, and the like, ever more channels for marketing are opening up.</p>
<p>Other than promotion and aggregation (the latter applying more to news/blog publishers than books/music), I do see one other role for publishers &#8212; actually creating product.  Movies/TV and video games are both far bigger businesses than book publishing, and in both cases products are produced by large teams of people. Music is generally produced by small teams of people. And by the way, books can spin off from other kinds of entertainment (I suspect that a large fraction of all science fiction book sales at this point are Star Wars/Star Trek/etc.). Or vice-versa &#8212; some day the economic model for trying an ambitious new comic book project may factor in hoped-for movie and other spin-offs from the getgo.</p>
<p>But traditional publishers aren&#8217;t generally set up to do anything that ambitious. As for lesser but still important functions &#8212; editing, art direction, etc. &#8212; who needs large firms for that?  The music industry seems to get by just fine with small recording studios and independent record producers; book producers could spring up just as well.  Indeed, they exist already, but suffer from the problem that nobody wants to pay them much because book sales overall are so weak.</p>
<p>If I were a best-selling author, I&#8217;d hire away my favorite editor, put her on the payroll directly, and then send her and my agent to squeeze a few extra percentage points out of the big publishers, more than covering the cost of her paycheck.  At least, I would if I were a best-selling author and also had my current personality.  Now, I&#8217;m not, and if I were that kind of author I&#8217;d probably be quite fixated on being left alone to write, and might find it easier to leave money on the table than to take that kind of business responsibility myself. That kind of consideration is probably a big reason why traditional publishers are allowed to stay in business.</p>
<p>But the times sure are a&#8217;changin&#8217;.</p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/435275813" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/10/28/google-and-the-authors-guild-establish-an-ascap-for-books/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/10/28/google-and-the-authors-guild-establish-an-ascap-for-books/</feedburner:origLink></item>
		<item>
		<title>Maybe text mining SHOULD be playing a bigger role in data warehousing</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/430336205/</link>
		<comments>http://www.texttechnologies.com/2008/10/24/text-mining-data-warehousin/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 04:39:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Attensity]]></category>

		<category><![CDATA[Comprehensive or exhaustive extraction]]></category>

		<category><![CDATA[Sentiment analysis]]></category>

		<category><![CDATA[Text mining]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=289</guid>
		<description><![CDATA[When I chatted last week with David Bean of Attensity, I commented to him on a paradox: 
Many people think text information is important to analyze, but even so data warehouses don&#8217;t seem to wind up holding very much of it. 
My working theory explaining this has two parts, both of which purport to show [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><span style="font-style: normal;">When <a href="http://www.texttechnologies.com/2008/10/24/attensity-update-2/" >I chatted last week with </a></span><a href="http://www.texttechnologies.com/2008/10/24/attensity-update-2/" >David Bean of Attensity</a>, <span style="font-style: normal;">I commented to him on a paradox: </span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">Many people think text information is important to analyze, but even so data warehouses don&#8217;t seem to wind up holding very much of it. </span></strong></p>
<p style="margin-bottom: 0in;"><span id="more-289"></span><span style="font-style: normal;">My working theory explaining this has two parts, both of which purport to show why text data generally doesn&#8217;t fit well into BI or data mining systems. One is that it&#8217;s just too messy and inconsistently organized.  The other </span><span style="font-style: normal;"><span>is that text corpuses generally don&#8217;t contain enough information.</span></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>Now, I know that these theories aren&#8217;t wholly true, for I know of counterexamples.  E.g., while I&#8217;ve haven&#8217;t written it up yet, I did a call confirming that a recently published </span></span><a href="http://www.spss.com/press/template_view.cfm?PR_ID=1059" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.spss.com');"><span>SPSS text/tabular integrated data mining story</span></a><span style="font-style: normal;"><span> is quite real.  Still, it has felt for a while as if truth lies in those directions.</span></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>Anyhow, David offered one useful number range:</span></span></p>
<p><span style="font-style: normal;"><strong>If you do exhaustive extraction on a text corpus, you wind up with 10-20X as much tabular data as you had in text format in the first place.</strong></span><span style="font-style: normal;"><span> (Comparing total bytes to total bytes.)</span></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>So how big are those corpuses? I think most text mining installations usually have at least 10s of thousands of documents or verbatims to play with.  Special cases aside, the upper bound seems to usually be about two orders of magnitude higher. And most text-mined documents probably tend to be short, as they commonly are just people&#8217;s reports on a single product/service experience – perhaps 1 KB or so, give or take a factor of 2-3?  So we&#8217;re probably looking at 10 gigabytes of text at the low end, and a few terabytes at the high end, before applying David&#8217;s 10-20X multiplier.</span></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>Hmm – that IS enough data for respectable data warehousing &#8230;</span></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>Obviously, special cases like national intelligence or very broad-scale web surveys could run larger, as per <a href="http://www.dbms2.com/2008/10/05/marklogic-architecture-deep-dive/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.dbms2.com');">the biggest Marklogic databases</a>.  Medline runs larger too.</span></span></p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/430336205" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/10/24/text-mining-data-warehousin/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/10/24/text-mining-data-warehousin/</feedburner:origLink></item>
		<item>
		<title>Attensity update</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/430331440/</link>
		<comments>http://www.texttechnologies.com/2008/10/24/attensity-update-2/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 04:29:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Application areas]]></category>

		<category><![CDATA[Attensity]]></category>

		<category><![CDATA[Clarabridge]]></category>

		<category><![CDATA[Competitive intelligence]]></category>

		<category><![CDATA[Software as a Service (SaaS)]]></category>

		<category><![CDATA[Text mining]]></category>

		<category><![CDATA[Text mining SaaS]]></category>

		<category><![CDATA[Voice of the Customer]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=288</guid>
		<description><![CDATA[I had a brief chat with the Attensity guys at their Teradata Partners Conference booth – mainly CTO David Bean, although he did buck one question to sales chief Jeff Johnson.  The business trends story remained the same as it was in June:  The sweet spot for new sales remains Voice of the [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I had a brief chat with the Attensity guys at their Teradata Partners Conference booth – mainly CTO David Bean, although he did buck one question to sales chief Jeff Johnson.  The business trends story remained the same as it was in <a href="http://www.texttechnologies.com/2008/06/16/attensity-update-updated/" >June</a>:  The sweet spot for new sales remains Voice of the Customer/Voice of the Market, while on-premise/SaaS new-name accounts are split around 50-50 (by number, not revenue).</p>
<p style="margin-bottom: 0in;">David&#8217;s thoughts as to why the SaaS share isn&#8217;t even higher – as it seems to be for <a href="http://www.texttechnologies.com/2008/06/04/clarabridge-is-now-all-about-text-mining-saas/" >Clarabridge</a>* – centered on the point that some customers want to blend internal and external data, and may not want to ship the internal part out to a SaaS provider.  Besides, if it&#8217;s tabular data, I suspect Attensity isn&#8217;t the right place to ship it anyway.</p>
<p style="margin-bottom: 0in;"><em>*Speaking of Clarabridge, CEO Sid Banerjee recently posted a thoughtful company update in <a href="http://www.texttechnologies.com/2008/09/08/attensit-layered-messaging-marketing-model/" >this comment thread.</a></em></p>
<p style="margin-bottom: 0in;">When I challenged him on ease of use, David said that <strong>Attensity is readying a Microstrategy-based offering,</strong> which is obviously meant to compete with Clarabridge and any of its perceived advantages head-on.</p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/430331440" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/10/24/attensity-update-2/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/10/24/attensity-update-2/</feedburner:origLink></item>
		<item>
		<title>Lynda Moulton prefers enterprise search products that get up and running quickly</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/418234751/</link>
		<comments>http://www.texttechnologies.com/2008/10/11/lynda-moulton-on-enterprise-search-2/#comments</comments>
		<pubDate>Sun, 12 Oct 2008 02:46:07 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Coveo]]></category>

		<category><![CDATA[Enterprise search]]></category>

		<category><![CDATA[FAST]]></category>

		<category><![CDATA[Microsoft]]></category>

		<category><![CDATA[Search engines]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=287</guid>
		<description><![CDATA[Lynda Moulton, to put it mildly, disagrees with the Gartner Magic Quadrant analysis of enterprise search.  Her preferred approach is captured in:
Coveo, Exalead, ISYS, Recommind, Vivisimo, and X1 are a few of a select group that are marking a mark in their respective niches, as products ready for action with a short implementation cycle [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://gilbane.com/search_blog/2008/10/what_determines_a_leader_in_th.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/gilbane.com');">Lynda Moulton</a>, to put it mildly, disagrees with the Gartner Magic Quadrant analysis of enterprise search.  Her preferred approach is captured in:</p>
<blockquote><p>Coveo, Exalead, ISYS, Recommind, Vivisimo, and X1 are a few of a select group that are marking a mark in their respective niches, as products ready for action with a short implementation cycle (weeks or months not years).</p></blockquote>
<p>By way of contrast, Lynda opines:</p>
<blockquote><p>Autonomy and Endeca continue to bring value to very large projects in large companies but are not plug-and-play solutions, by any means. Oracle, IBM, and Microsoft offer search solutions of a very different type with a heavy vendor or third-party service requirement. Google Search Appliance has a much larger installed base than any of these but needs serious tuning and customization to make it suitable to enterprise needs.</p></blockquote>
<p>In particular, her views about FAST (now Microsoft) are scathing.</p>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/418234751" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/10/11/lynda-moulton-on-enterprise-search-2/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/10/11/lynda-moulton-on-enterprise-search-2/</feedburner:origLink></item>
		<item>
		<title>More on Languageware</title>
		<link>http://feeds.feedburner.com/~r/TextTechnologies/~3/416676209/</link>
		<comments>http://www.texttechnologies.com/2008/10/10/more-on-languageware/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 10:38:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[IBM and UIMA]]></category>

		<category><![CDATA[Language recognition]]></category>

		<category><![CDATA[Natural language processing (NLP)]]></category>

		<category><![CDATA[Languageware]]></category>

		<category><![CDATA[uima]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=286</guid>
		<description><![CDATA[Marie Wallace of IBM wrote back in response to my post on Languageware.  In particular, it seems I got the Languageware/UIMA relationship wrong.  Marie&#8217;s email was long and thoughtful enough that, rather than just pointing her at the comment thread, I asked for permission to repost it.  Here goes:
Thanks for your mention [...]]]></description>
			<content:encoded><![CDATA[<p>Marie Wallace of IBM wrote back in response to my post on Languageware.  In particular, it seems I got the Languageware/UIMA relationship wrong.  Marie&#8217;s email was long and thoughtful enough that, rather than just pointing her at the comment thread, I asked for permission to repost it.  Here goes:</p>
<blockquote><p>Thanks for your mention to LanguageWare on your blog, albeit a  skeptical one <img src="file:///C:/Eudora%20December%202006/Eudora%20legacy/Emoticons/!3a-)%20Happy.png" alt=":-)" align="absmiddle" /> I totally understand your scepticism as there  is so much talk about text analytics these days and everyone believes they have solved the problem. I guess I can only hope that our approach will indeed prove to  be different and offers some new and interesting perspectives.</p>
<p>The key differentiation in our approach is that we have completely decoupled the language model from the code that runs the analysis. This  has been generalized to a set of data-driven algorithms that apply across  many languages so that you can have an approach that makes the solution hugely and rapidly customizable (without having to change code). It is this flexibility  that we believe is core to realizing multi-lingual and multi-domain text analysis applications in a real-word scenario. This customization environment is available for download from Alphaworks, <a href="http://www.alphaworks.ibm.com/tech/lrw" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.alphaworks.ibm.com');">http://www.alphaworks.ibm.com/tech/lrw</a>, and we would love  to get feedback from your community.</p>
<p>On your point about performance, we actually consider UIMA one of our greatest performance optimizations and core to our design. The point  about one-pass is that we never go back over the same piece of text twice at  the same &#8220;level&#8221; and take a very careful approach when defining our UIMA Annotators. Certain layers of language processing just don&#8217;t make sense  to split up due to their interconnectedness and therefore we create our  UIMA annotators according to where they sit in the overall processing  layers. That&#8217;s the key point.</p>
<p>Anyway those are my thoughts, and thanks again for the mention. It&#8217;s  really great to see these topics being discussed in an open and challenging  forum.</p></blockquote>
<img src="http://feeds.feedburner.com/~r/TextTechnologies/~4/416676209" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/10/10/more-on-languageware/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.texttechnologies.com/2008/10/10/more-on-languageware/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic Page Served (once) in 0.776 seconds -->
