<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Text Technologies &#187; Categorization and filtering</title>
	<atom:link href="http://www.texttechnologies.com/category/categorization-filtering/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.texttechnologies.com</link>
	<description>Understanding technology ... in both senses of the phrase</description>
	<lastBuildDate>Sat, 05 Jun 2010 04:23:24 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Data marts in the world of text</title>
		<link>http://www.texttechnologies.com/2009/09/20/data-marts-in-the-world-of-text/</link>
		<comments>http://www.texttechnologies.com/2009/09/20/data-marts-in-the-world-of-text/#comments</comments>
		<pubDate>Sun, 20 Sep 2009 09:08:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Specialized search]]></category>
		<category><![CDATA[Structured search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=334</guid>
		<description><![CDATA[CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried &#8220;Shadow IT&#8221;, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he&#8217;s talking about data marts, only for documents rather than tabular data.
Notwithstanding the manifest virtues of centralization, there are numerous reasons you might [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">CMS/search (Content Management System) expert Alan Pelz-Sharpe recently <a href="http://www.intelligententerprise.com/blog/archives/2009/08/shadow_it_and_e.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.intelligententerprise.com');">decried &#8220;Shadow IT&#8221;</a>, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he&#8217;s talking about data marts, only for documents rather than tabular data.</p>
<p style="margin-bottom: 0in;">Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts,  in the tabular and document worlds alike.  For example:</p>
<ul>
<li><strong>Price/performance.</strong> Your 	main/central data manager might be too expensive to support 	additional large specialized databases. Or different databases and 	applications might have sufficiently different profiles so as to get 	great price/performance from different kinds of data managers. This 	is particularly prevalent in the relational world, where each of 	column stores, sequentially-oriented row stores, and random 	I/O-oriented row stores have compelling use cases.</li>
<li><strong>Different SLAs</strong> (Service-Level Agreements). Similarly, different applications may 	have very different requirements for uptime, response time, and the 	like.  (In the relational world, think of operational data stores.)</li>
<li><strong>Different security 	requirements.</strong> Different subsets of the data may need different 	levels of security. This is particularly prevalent in the document 	world, where security problems are not as well-solved as in the 	tabular arena, and where it&#8217;s common for a search engine to index 	across different corpuses with radically different levels of 	sensitivity.</li>
<li><strong>Integrated application and user 	interfaces.</strong> In the relational world, there&#8217;s a pretty clean 	separation between data management and interface logic; most serious 	business intelligence tools can talk to most DBMS. The document 	world is quite different. Some search engines bundle, for example, 	various kinds of faceted or parameterized search interfaces. What&#8217;s 	more, in public-facing search, a major differentiator is the 	facilities that the product offers for skewing search results.</li>
<li><strong>Different text applications 	require different thesauruses or taxonomy management systems</strong>. 	Ideally, those should all be integrated &#8212; but <a href="../2005/12/11/the-text-technologies-market-3-heres-whats-missing/">the 	requisite technology still doesn&#8217;t exist</a>.</li>
</ul>
<p style="margin-bottom: 0in;">Bottom line: <strong>Text data marts, much like relational data marts, are almost surely here to stay.</strong></p>
<p style="margin-bottom: 0in;"><em><strong>Related link</strong></em></p>
<ul>
<li>
<p style="margin-bottom: 0in;"><a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.dbms2.com');">The 	future of data marts</a></p>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2009/09/20/data-marts-in-the-world-of-text/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Yet more NoFollow whining</title>
		<link>http://www.texttechnologies.com/2009/03/07/yet-more-nofollow-whining/</link>
		<comments>http://www.texttechnologies.com/2009/03/07/yet-more-nofollow-whining/#comments</comments>
		<pubDate>Sat, 07 Mar 2009 04:58:15 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Online marketing]]></category>
		<category><![CDATA[Search engine optimization (SEO)]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Spam and antispam]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=307</guid>
		<description><![CDATA[Andy Beal has a blog post up to the effect that NoFollow is a bad thing.  (Edit: Andy points out in the comment thread that his opposition to NoFollow isn&#8217;t as absolute as I was suggesting.)  Other SEO types are promoting this is if it were some kind of important cause.  I think that&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Andy Beal has a blog post up to the effect that <a href="http://www.marketingpilgrim.com/2009/03/google-twitter-ditch-nofollow.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.marketingpilgrim.com');">NoFollow is a bad thing</a>. <em> (Edit: Andy points out in the comment thread that his opposition to NoFollow isn&#8217;t as absolute as I was suggesting.) </em> Other SEO types are promoting this is if it were some kind of important cause.  I think that&#8217;s nuts, and <a href="http://www.monashreport.com/2007/01/23/nofollow-does-matter-a-lot/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monashreport.com');">NoFollow is a huge spam-reducer</a>.</p>
<p>The weakness of Andy&#8217;s argument is illustrated by the one and only scenario he posits in support of his crusade:</p>
<blockquote><p>The result is that a blog post added to a brand new site may well have just broken the story about the capture of Bin Laden (we wish!)–and a link to said post may have been Tweeted and re-tweeted–but Google won’t discover or index that post until it finds a “followed” link. Likely from a trusted site in Google’s index and likely hours, if not days, after it was first shared on Twitter.</p></blockquote>
<p>Helloooo &#8212; if I post something here, it is indexed at least in Google blog search immediately. (As in, within a minute or so.) Ping, crawl, pop &#8212; there it is.  The only remotely valid version of Andy&#8217;s complaint is that It might take some hours for Google&#8217;s main index to update &#8212; but even there there&#8217;s a News listing at the top.  This simply is not a problem.</p>
<p>Now, I think it would be personally great for me if all the links to my sites from Wikipedia and Twitter and the comment threads of major blogs pointed back with &#8220;link juice.&#8221; On the other hand, even with NoFollow out there, my sites come up high in Google&#8217;s rankings for all sorts of keywords, driving a lot of their readership.  I imagine the same is true for most other sites containing fairly unique content that people find interesting enough to link to.</p>
<p>So other than making it harder to engage in deceptive SEO, I fail to see what problems NoFollow is causing.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2009/03/07/yet-more-nofollow-whining/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Where &#8220;semantic&#8221; technology is or isn&#8217;t important</title>
		<link>http://www.texttechnologies.com/2008/12/29/where-semantic-technology-is-or-isnt-important/</link>
		<comments>http://www.texttechnologies.com/2008/12/29/where-semantic-technology-is-or-isnt-important/#comments</comments>
		<pubDate>Tue, 30 Dec 2008 00:59:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Specialized search]]></category>
		<category><![CDATA[Structured search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=301</guid>
		<description><![CDATA[At Lynda Moulton&#8217;s behest, I spoke a couple of times recently on the subject of where &#8220;semantic&#8221; technology is or isn&#8217;t likely to be important.  One was at the Gilbane conference in early December.  The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. [...]]]></description>
			<content:encoded><![CDATA[<p>At Lynda Moulton&#8217;s behest, I spoke a couple of times recently on the subject of where &#8220;semantic&#8221; technology is or isn&#8217;t likely to be important.  One was at the Gilbane conference in early December.  The slides were based on my previously posted deck for a June talk I gave on a <a href="http://www.texttechnologies.com/2008/06/19/text-analytics-marketplace-competitive-landscape-trends/" >text analytics market overview</a>. The actual Gilbane slides may be found <a href="http://www.monash.com/uploads/Gilbane-December-2008.ppt" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monash.com');">here</a>.</p>
<p>My opinions about the applicability of semantic technology include:</p>
<ul>
<li>The big bucks in web search are for &#8220;transactional&#8221; web search, and semantics isn&#8217;t the issue there. <em>(Slides 3-4)</em></li>
<li>When UIs finally go beyond the simple search box &#8212; e.g. to clusters/facets or to voice &#8212; semantics should have a role to play. <em>(Slide 5)</em></li>
<li>Public-facing site search depends &#8212; more than any other area of text analytics &#8212; on hand-tagging. <em>(Slide 7)</em></li>
<li>&#8220;Enterprise&#8221; search that searches specialized external databases could benefit from semantic technologies. <em>(Slide <img src='http://www.texttechnologies.com/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </em></li>
<li>True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. <em>(Slides 10-11)</em></li>
<li>Semantics &#8212; specifically extraction &#8212; is central to custom publishing. <em>(Slide 12 &#8212; upon review I regret using the word &#8220;sophisticated&#8221;)</em></li>
<li>Semantics is central to text mining. <em>(Slide 18)</em></li>
<li>Semantics could play a big role in all sorts of exciting future developments. <em>(Slide 19)</em></li>
</ul>
<p>So what would your list be like?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/12/29/where-semantic-technology-is-or-isnt-important/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?</title>
		<link>http://www.texttechnologies.com/2008/11/12/denial-of-insight-attacks/</link>
		<comments>http://www.texttechnologies.com/2008/11/12/denial-of-insight-attacks/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 07:45:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Competitive intelligence]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Spam and antispam]]></category>
		<category><![CDATA[Voice of the Customer]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=295</guid>
		<description><![CDATA[TechTaxi points out that it&#8217;s at least theoretically possible to, by polluting the Web, pollute somebody&#8217;s web-wide information gathering.  (Hat tip to Daniel Tunkelang.)  They further assert this is a relatively near-term threat.
The theory can&#8217;t be denied. What&#8217;s more, bad actors have other motives to pollute the Web.  For example, if they [...]]]></description>
			<content:encoded><![CDATA[<p>TechTaxi <a href="http://techtaxi.blogspot.com/2006/04/denial-of-insight-attacks-could.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/techtaxi.blogspot.com');">points out</a> that it&#8217;s at least theoretically possible to, by polluting the Web, pollute somebody&#8217;s web-wide information gathering.  (Hat tip to <a href="http://thenoisychannel.com/2008/11/11/big-google-can-be-benign/" onclick="javascript:pageTracker._trackPageview('/outbound/article/thenoisychannel.com');">Daniel Tunkelang</a>.)  They further assert this is a relatively near-term threat.</p>
<p>The theory can&#8217;t be denied. What&#8217;s more, bad actors have other motives to pollute the Web.  For example, if they plant favorable automated comments about their own products or unfavorable about the competition&#8217;s,<a href="http://www.texttechnologies.com/2008/06/17/voice-of-the-customermarket-indeed-where-the-action-is/" > Voice of the Customer/Market</a> applications will naturally be confused.  And if automated reputation-checkers get more prominent, there will be a <em>major</em> incentive to game them, just as there has been for Google&#8217;s PageRank.  So VOTC/VOTM market research tools could polluted as a side effect.</p>
<p>Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.</p>
<p>But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I&#8217;m not thinking of a serious near-term threat scenario.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/11/12/denial-of-insight-attacks/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The phrase &#8220;business intelligence&#8221; was COINED for text analytics</title>
		<link>http://www.texttechnologies.com/2008/07/11/the-phrase-business-intelligence-was-coined-for-text-analytics/</link>
		<comments>http://www.texttechnologies.com/2008/07/11/the-phrase-business-intelligence-was-coined-for-text-analytics/#comments</comments>
		<pubDate>Fri, 11 Jul 2008 07:31:00 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[BI integration]]></category>
		<category><![CDATA[Categorization and filtering]]></category>
		<category><![CDATA[IBM and UIMA]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[knowledge management]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=264</guid>
		<description><![CDATA[Late last year, there was a little flap about who invented the phrase business intelligence.  Credit turns out to go to an IBM researcher named H. P. Luhn, as per this 1958 paper.  Well, I finally took a look at the paper, after Jeff Jones of IBM sent over another copy.  And [...]]]></description>
			<content:encoded><![CDATA[<p>Late last year, there was <a href="http://www.softwarememories.com/2007/12/02/disputed-history-of-the-term-business-intelligence/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.softwarememories.com');">a little flap about who invented the phrase <em>business intelligence</em></a>.  Credit turns out to go to an IBM researcher named H. P. Luhn, as per <a href="http://www.research.ibm.com/journal/rd/024/ibmrd0204H.pdf" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.research.ibm.com');">this 1958 paper</a>.  Well, I finally took a look at the paper, after Jeff Jones of IBM sent over another copy.  And guess what?  It&#8217;s all about text analytics.  Specifically, it&#8217;s about what we might now call a combination of classification and knowledge management.</p>
<p>Half a century later, the industry is finally poised to deliver on that vision.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/07/11/the-phrase-business-intelligence-was-coined-for-text-analytics/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>3 specialized markets for text analytics</title>
		<link>http://www.texttechnologies.com/2008/06/19/3-specialized-markets-for-text-analytics/</link>
		<comments>http://www.texttechnologies.com/2008/06/19/3-specialized-markets-for-text-analytics/#comments</comments>
		<pubDate>Thu, 19 Jun 2008 07:44:09 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Language recognition]]></category>
		<category><![CDATA[Natural language processing (NLP)]]></category>
		<category><![CDATA[Spam and antispam]]></category>
		<category><![CDATA[Speech recognition]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=250</guid>
		<description><![CDATA[In the previous post, I offered a list of eight linguistics-based market segments, and a slide deck surveying them.  And I promised a series of follow-up posts based on the slides.
Let me begin by explaining what I mean by some of that list (taken from Slide 2), starting from the bottom.

Machine translation is a [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>In the <a href="http://www.texttechnologies.com/2008/06/19/text-analytics-marketplace-competitive-landscape-trends/#more-249" >previous post</a>, I offered a list of eight linguistics-based market segments, and a <a href="http://www.monash.com/Text-analytics-markets-June-2008.ppt" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monash.com');">slide deck</a> surveying them.  And I promised a series of follow-up posts based on the slides.</span></span><span id="more-250"></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>Let me begin by explaining what I mean by some of that list (taken from Slide 2), starting from the bottom.</span></span></p>
<ul>
<li><span style="font-style: normal;"><span><strong>Machine translation</strong> is a small business, with small specialized vendors. Lernout &amp; Hauspie attempted to combine it with voice recognition in a complex financial play, but that collapsed in a miasma of stock fraud. The remnants turned into what became Nuance Communications.</span></span></li>
<li><span style="font-style: normal;"><span>Nuance is a roll-up of most of the important independent <strong>voice recognition </strong>vendors. So far voice recognition has worked best in two areas: “Hands-free” computer use/dictation, and IVR (interactive voice response). While both are important, neither is exactly a mainstream enterprise computer software business. So voice recognition is not closely integrated with the other market segments.</span></span></li>
<li><strong>“</strong><span style="font-style: normal;"><span><strong>Natural language processing”</strong> other than voice recognition isn&#8217;t much of a business at this time (with apologies to Progress EasyAsk). It doesn&#8217;t make the list at all.</span></span></li>
<li><span style="font-style: normal;"><span><strong>Spam filtering</strong> is obviously a major business, whether or not it is getting combined into more general security and/or messaging product suites. Antispam vendors actually perform a lot of machine learning, much like text miners do. But the types of rules they wind up with are quite different. And their hardest problems aren&#8217;t linguistic ones, usually, as the spammers have gone beyond text to, e.g., words depicted in graphical images. Besides, even where linguistics are involved, it&#8217;s a very different problem to identify words used by bad guys trying to spoof you (and the rest of the world) than it is to understand your particular users.</span></span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span>Why and to what extent I see the other five as separate markets was explained in connection with the subsequent 17 slides.</span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/06/19/3-specialized-markets-for-text-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Text Analytics Marketplace: Competitive landscape and trends</title>
		<link>http://www.texttechnologies.com/2008/06/19/text-analytics-marketplace-competitive-landscape-trends/</link>
		<comments>http://www.texttechnologies.com/2008/06/19/text-analytics-marketplace-competitive-landscape-trends/#comments</comments>
		<pubDate>Thu, 19 Jun 2008 07:35:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Audio and video search]]></category>
		<category><![CDATA[BI integration]]></category>
		<category><![CDATA[Custom publishing]]></category>
		<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Natural language processing (NLP)]]></category>
		<category><![CDATA[Nuance]]></category>
		<category><![CDATA[Progress and EasyAsk]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Social software and online media]]></category>
		<category><![CDATA[Spam and antispam]]></category>
		<category><![CDATA[Speech recognition]]></category>
		<category><![CDATA[Structured search]]></category>
		<category><![CDATA[Text Analytics Summit]]></category>
		<category><![CDATA[Text mining]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=249</guid>
		<description><![CDATA[As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:
1.  Web search
2.  Public-facing site search
3.  Enterprise search and knowledge management
4.  Custom publishing
5.  Text mining and extraction
Three are more standalone:
6.  Spam filtering
7. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:</p>
<p style="margin-bottom: 0in; font-style: normal; padding-left: 30px;">1.  Web search</p>
<p style="margin-bottom: 0in; font-style: normal; padding-left: 30px;">2.  Public-facing site search</p>
<p style="margin-bottom: 0in; font-style: normal; padding-left: 30px;">3.  Enterprise search and knowledge management</p>
<p style="margin-bottom: 0in; font-style: normal; padding-left: 30px;">4.  Custom publishing</p>
<p style="padding-left: 30px;">5.  Text mining and extraction</p>
<p style="margin-bottom: 0in; font-style: normal;">Three are more standalone:</p>
<p style="margin-bottom: 0in; font-style: normal; padding-left: 30px;">6.  Spam filtering</p>
<p style="margin-bottom: 0in; font-style: normal; padding-left: 30px;">7.  Voice recognition</p>
<p style="margin-bottom: 0in; font-style: normal; padding-left: 30px;">8.  Machine translation</p>
<p><span id="more-249"></span></p>
<p style="margin-bottom: 0in;">This list comes from a talk I gave Monday at the Text Analytics Summit called <em>The Text Analytics Marketplace: Competitive landscape and trends. </em>In half an hour, I covered the first five areas (in Sue Feldman&#8217;s word, at a “gallop”). The slide deck has been uploaded to the link below.  <span style="font-style: normal;"><span>I plan to break out the material from the talk into a series of blog posts over the next few (or perhaps not-so-few) weeks. </span></span></p>
<p style="margin-bottom: 0in;"><em><strong>Slides:</strong></em></p>
<ul>
<li><a href="http://www.monash.com/Text-analytics-markets-June-2008.ppt " onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monash.com');"><span>The Text Analytics Marketplace: Competitive landscape and trends</span></a></li>
</ul>
<p style="margin-bottom: 0in;"><strong><em>Other posts based on those slides:</em></strong></p>
<ul>
<li><span><a href="http://www.texttechnologies.com/2008/06/19/3-specialized-markets-for-text-analytics/" >Three specialized markets for text analytics</a> (based on Slide 2)</span></li>
<li><span><a href="http://www.texttechnologies.com/2008/06/19/6-trends-that-could-shake-up-the-text-analytics-market/" >6 trends that could shake up the text analytics market</a> (based on Slide 19)</span></li>
<li><span><a href="(in A World of Bytes)">Why search technologies are going to recombine</a> (in <em>A World of Bytes</em>, based on Slide 19)<br />
</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/06/19/text-analytics-marketplace-competitive-landscape-trends/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>How text search has evolved over the past 15 years</title>
		<link>http://www.texttechnologies.com/2008/06/15/how-text-search-has-evolved-over-the-past-15-years/</link>
		<comments>http://www.texttechnologies.com/2008/06/15/how-text-search-has-evolved-over-the-past-15-years/#comments</comments>
		<pubDate>Sun, 15 Jun 2008 07:26:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Structured search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=239</guid>
		<description><![CDATA[I just stumbled across a brilliant summary of evolution in text search technology, written four years ago.  It&#8217;s equally valid today (which in itself says something).  I found it on the Prism Legal blog, but the actual author is Sharon Flank.  My own comments are interspersed in bold.
“There are several underlying important [...]]]></description>
			<content:encoded><![CDATA[<p>I just stumbled across a brilliant summary of evolution in text search technology, written four years ago.  It&#8217;s equally valid today (which in itself says something).  I found it on the <a href="http://www.prismlegal.com/wordpress/index.php?m=200407#post-190" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.prismlegal.com');">Prism Legal</a> blog, but the actual author is Sharon Flank.  My own comments are interspersed in bold.<span id="more-239"></span></p>
<blockquote><p>“There are several underlying important developments over the last decade or so:</p>
<ul>
<li>Incorporating user feedback to refine search results, usually indirectly rather than explicitly, making results better through machine learning. [Amazon.com is the most-often cited example of this with it’s “if you like A, you’ll also like B.”]  <strong>[CAM] Technically, that&#8217;s not a search example, but the general point is correct even so.</strong></li>
<li>Assessments based on usage or referral. This is what makes Google so useful and popular. This approach gives higher rankings if other web sites point to a target or if that target gets a lot of hits.</li>
<li>Various approaches to using taxonomies. The better applications use taxonomies as a navigation guide but don’t force it or require administrators to implement it. Vivisimo.com is an example of interesting, automated clustering approach. <strong>[CAM] &#8220;Faceted search&#8221; seems to be the buzzword here. It&#8217;s a big part of what I call &#8220;structured search.&#8221; But taxonomy use is probably more trivial in search than it is in, say, text mining.</strong></li>
<li>Better handling of phrases. Google automatically parses phrases and deals with search terms as phrases. This now seems natural but in the AltaVista days, you couldn’t tell a Venetian blind from a blind Venetian [example courtesy of Prof. George Miller, Princeton Univ. - too good not to cite].</li>
<li>Context-sensitive search is now an emerging trend. Systems track what users have previously searched for and infer interest in the same domain to refine search result. So if you look for “line” and a system knows you’ve just looked for “tacklebox,” then it infers you mean “fishing line.” Or if you search for bagels and the system knows you are in 20009, it tells you that you can buy them at Comet Liquors (which happens to sell bagels).  <strong>[CAM] That happens a lot with ad serving.  But I&#8217;m not convinced it hit actual search until Google&#8217;s personal search kicked off, and that was quite recent.</strong></li>
<p>“More generally in natural language processing, the statistical and linguistic approaches are converging in a new way: use massive amounts of data (i.e. the Web) to get statistical answers to deep linguistic questions, like “How do we figure out what the most likely referent is for the pronoun ‘they’?” Or “How do we determine the correct sense for ambiguous words?” These things aren’t in search engines yet, but you can expect to see more “intelligent” features coming out of this approach.</p>
<p>“Looking at this list, you can see that the conceptual changes (breakthroughs?), with the exception of better phrase handling, are primarily focused around Web searches. When dealing with one-of-a-kind document collections behind the corporate firewall, many of these developments turn out not to add much to older approaches. So, at least for enterprise search, I too remain partial to some of the older products you mention, though I am disappointed that most of the old-time vendors have not updated their approaches beyond adding taxonomy support.” <strong>[CAM] Yep, web search and enterprise search are <a href="http://www.texttechnologies.com/2008/01/14/enterprise-search-versus-web-search/" >very different things</a>.</strong></ul>
</blockquote>
<p>The original blog post did have one error &#8212; Sharon&#8217;s PhD isn&#8217;t in Computational Linguistics, but rather Slavic Linguistics, as I recently noted in my post about <a href="http://www.texttechnologies.com/2008/06/10/text-analytics-technology-jobs-humanities-majors/" >text analytics careers for humanities majors</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/06/15/how-text-search-has-evolved-over-the-past-15-years/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Expert System S.p.A. update</title>
		<link>http://www.texttechnologies.com/2008/06/11/expert-system-s-p/</link>
		<comments>http://www.texttechnologies.com/2008/06/11/expert-system-s-p/#comments</comments>
		<pubDate>Wed, 11 Jun 2008 11:12:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Competitive intelligence]]></category>
		<category><![CDATA[Coveo]]></category>
		<category><![CDATA[Expert System S.p.A.]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Text mining]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=237</guid>
		<description><![CDATA[I chatted with Brooke Aker, the new CEO of Expert System&#8217;s US subsidiary, for quite a while last week.  Unfortunately, we had some cell phone problems, and email followup hasn&#8217;t gone well, so I&#8217;m hazy on a few details.  But here are some highlights, as best I understood them.

Expert System now has 145 [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I chatted with Brooke Aker, the new CEO of Expert System&#8217;s US subsidiary, for quite a while last week.  Unfortunately, we had some cell phone problems, and email followup hasn&#8217;t gone well, so I&#8217;m hazy on a few details.  But here are some highlights, as best I understood them.<span id="more-237"></span></p>
<ul>
<li><strong>Expert System now has 145 	employees.</strong></li>
<li><strong>2 of the employees are in the US</strong> (plus at least one more full-time equivalent on a contract basis). 	<strong>Brooke believes the US operation will eventually be the biggest 	part of the company.</strong></li>
<li><strong>Expert System has sold its 	market intelligence SaaS offering to two global auto manufacturers. </strong><span>Competitors were Nielsen 	BuzzMetrics, somebody whose name sounded like “flexilytics” (I 	presume that would be Lexalytics  <em>Edit:  But see Lexalytics CEO Jeff Catlin&#8217;s comment below</em>), and somebody whose named sounded 	like “Truecast” (I haven&#8217;t yet guessed who that is).</span></li>
<li><span>If 	I understood correctly, Expert System acquired that product by 	picking up Brooke&#8217;s tiny company <a href="http://www.acuitysoftware.com/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.acuitysoftware.com');">Acuity 	Software</a>.  Acuity was/is a user of Expert System&#8217;s technology, 	having replaced Coveo&#8217;s with it so as to get better semantics.</span></li>
<li><span>Brooke 	is </span><strong>optimistic about Expert System&#8217;s prospects in the 	intelligence market. </strong><span> New 	semantic networks in Arabic and English (joining one Expert System 	already had in Italian) are a big part of the reason.  Brooke says 	the intelligence community is now actively interested in technology 	that&#8217;s been validated by the commercial market, on the theory it&#8217;s 	apt to be more complete than research/government-only products.  	Expert System is also working on a semantic network in another 	undisclosed Middle Eastern language; Brooke stoically refrained from 	confirming the blindingly obvious guess that this would be Farsi.</span></li>
<li><span>Expert 	System&#8217;s third effort in the US market, coming soon, will be a 	semantic ad platform.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span>Once again, however, I made it through an Expert System briefing without gaining a real understanding of its technology.  I gather they&#8217;re proud of their in-memory data structure for their semantic network, but I haven&#8217;t a clue (beyond the obvious guesses) as to what that data structure is.  Similarly, Brooke said that a distinguishing feature of Expert Systems semantic network is that words have lots of attributes, which are the same thing as categories, and supplied a list of the 11 top-level categories:  <em>Objects, animals, plants, people, concepts, places, time, natural phenomena, state, quantity, group.</em> But it&#8217;s easy to come up with a lot of things that don&#8217;t seem to fit that list very well (especially events, such as numerous different word-senses of “strike”), so absent further elucidation I didn&#8217;t find that particularly instructive either.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/06/11/expert-system-s-p/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Google seems to have rehabilitated us</title>
		<link>http://www.texttechnologies.com/2008/05/08/google-seems-to-have-rehabilitated-us/</link>
		<comments>http://www.texttechnologies.com/2008/05/08/google-seems-to-have-rehabilitated-us/#comments</comments>
		<pubDate>Thu, 08 May 2008 09:16:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Search engine optimization (SEO)]]></category>
		<category><![CDATA[Spam and antispam]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=218</guid>
		<description><![CDATA[As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links.  We&#8217;re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration.   And thus we once [...]]]></description>
			<content:encoded><![CDATA[<p>As previously noted, we were <a href="http://www.texttechnologies.com/2008/04/25/drive-by-google-de-listing/" >de-indexed by Google</a>, due to the injection of a whole lot of spammy hidden links.  We&#8217;re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration.   And thus we once again have high rankings for search terms such as <em>Netezza, DATAllegro, Clarabridge, </em>and <em>Attivio.</em></p>
<p>We&#8217;re designing a new blog theme &#8212; the current one is just an emergency stopgap &#8212; that will (among myriad more important virtues) be more SEO-friendly.  I&#8217;ll be curious to see whether that makes much actual difference from a search ranking standpoint.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/05/08/google-seems-to-have-rehabilitated-us/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
