<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Text Technologies &#187; Ontologies</title>
	<atom:link href="http://www.texttechnologies.com/category/categorization-filtering/ontology-taxonomy/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.texttechnologies.com</link>
	<description>Understanding technology ... in both senses of the phrase</description>
	<lastBuildDate>Wed, 18 Jan 2012 17:02:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Data marts in the world of text</title>
		<link>http://www.texttechnologies.com/2009/09/20/data-marts-in-the-world-of-text/</link>
		<comments>http://www.texttechnologies.com/2009/09/20/data-marts-in-the-world-of-text/#comments</comments>
		<pubDate>Sun, 20 Sep 2009 09:08:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Specialized search]]></category>
		<category><![CDATA[Structured search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=334</guid>
		<description><![CDATA[CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried &#8220;Shadow IT&#8221;, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he&#8217;s talking about data marts, only for documents rather than tabular data. Notwithstanding the manifest virtues of centralization, there are numerous reasons you [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">CMS/search (Content Management System) expert Alan Pelz-Sharpe recently <a href="http://www.intelligententerprise.com/blog/archives/2009/08/shadow_it_and_e.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.intelligententerprise.com');">decried &#8220;Shadow IT&#8221;</a>, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he&#8217;s talking about data marts, only for documents rather than tabular data.</p>
<p style="margin-bottom: 0in;">Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts,  in the tabular and document worlds alike.  For example:</p>
<ul>
<li><strong>Price/performance.</strong> Your 	main/central data manager might be too expensive to support 	additional large specialized databases. Or different databases and 	applications might have sufficiently different profiles so as to get 	great price/performance from different kinds of data managers. This 	is particularly prevalent in the relational world, where each of 	column stores, sequentially-oriented row stores, and random 	I/O-oriented row stores have compelling use cases.</li>
<li><strong>Different SLAs</strong> (Service-Level Agreements). Similarly, different applications may 	have very different requirements for uptime, response time, and the 	like.  (In the relational world, think of operational data stores.)</li>
<li><strong>Different security 	requirements.</strong> Different subsets of the data may need different 	levels of security. This is particularly prevalent in the document 	world, where security problems are not as well-solved as in the 	tabular arena, and where it&#8217;s common for a search engine to index 	across different corpuses with radically different levels of 	sensitivity.</li>
<li><strong>Integrated application and user 	interfaces.</strong> In the relational world, there&#8217;s a pretty clean 	separation between data management and interface logic; most serious 	business intelligence tools can talk to most DBMS. The document 	world is quite different. Some search engines bundle, for example, 	various kinds of faceted or parameterized search interfaces. What&#8217;s 	more, in public-facing search, a major differentiator is the 	facilities that the product offers for skewing search results.</li>
<li><strong>Different text applications 	require different thesauruses or taxonomy management systems</strong>. 	Ideally, those should all be integrated &#8212; but <a href="../2005/12/11/the-text-technologies-market-3-heres-whats-missing/">the 	requisite technology still doesn&#8217;t exist</a>.</li>
</ul>
<p style="margin-bottom: 0in;">Bottom line: <strong>Text data marts, much like relational data marts, are almost surely here to stay.</strong></p>
<p style="margin-bottom: 0in;"><em><strong>Related link</strong></em></p>
<ul>
<li>
<p style="margin-bottom: 0in;"><a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.dbms2.com');">The 	future of data marts</a></p>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2009/09/20/data-marts-in-the-world-of-text/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Where &#8220;semantic&#8221; technology is or isn&#8217;t important</title>
		<link>http://www.texttechnologies.com/2008/12/29/where-semantic-technology-is-or-isnt-important/</link>
		<comments>http://www.texttechnologies.com/2008/12/29/where-semantic-technology-is-or-isnt-important/#comments</comments>
		<pubDate>Tue, 30 Dec 2008 00:59:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Specialized search]]></category>
		<category><![CDATA[Structured search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=301</guid>
		<description><![CDATA[At Lynda Moulton&#8217;s behest, I spoke a couple of times recently on the subject of where &#8220;semantic&#8221; technology is or isn&#8217;t likely to be important.  One was at the Gilbane conference in early December.  The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. [...]]]></description>
			<content:encoded><![CDATA[<p>At Lynda Moulton&#8217;s behest, I spoke a couple of times recently on the subject of where &#8220;semantic&#8221; technology is or isn&#8217;t likely to be important.  One was at the Gilbane conference in early December.  The slides were based on my previously posted deck for a June talk I gave on a <a href="http://www.texttechnologies.com/2008/06/19/text-analytics-marketplace-competitive-landscape-trends/" >text analytics market overview</a>. The actual Gilbane slides may be found <a href="http://www.monash.com/uploads/Gilbane-December-2008.ppt" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monash.com');">here</a>.</p>
<p>My opinions about the applicability of semantic technology include:</p>
<ul>
<li>The big bucks in web search are for &#8220;transactional&#8221; web search, and semantics isn&#8217;t the issue there. <em>(Slides 3-4)</em></li>
<li>When UIs finally go beyond the simple search box &#8212; e.g. to clusters/facets or to voice &#8212; semantics should have a role to play. <em>(Slide 5)</em></li>
<li>Public-facing site search depends &#8212; more than any other area of text analytics &#8212; on hand-tagging. <em>(Slide 7)</em></li>
<li>&#8220;Enterprise&#8221; search that searches specialized external databases could benefit from semantic technologies. <em>(Slide <img src='http://www.texttechnologies.com/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </em></li>
<li>True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. <em>(Slides 10-11)</em></li>
<li>Semantics &#8212; specifically extraction &#8212; is central to custom publishing. <em>(Slide 12 &#8212; upon review I regret using the word &#8220;sophisticated&#8221;)</em></li>
<li>Semantics is central to text mining. <em>(Slide 18)</em></li>
<li>Semantics could play a big role in all sorts of exciting future developments. <em>(Slide 19)</em></li>
</ul>
<p>So what would your list be like?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/12/29/where-semantic-technology-is-or-isnt-important/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>How text search has evolved over the past 15 years</title>
		<link>http://www.texttechnologies.com/2008/06/15/how-text-search-has-evolved-over-the-past-15-years/</link>
		<comments>http://www.texttechnologies.com/2008/06/15/how-text-search-has-evolved-over-the-past-15-years/#comments</comments>
		<pubDate>Sun, 15 Jun 2008 07:26:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Structured search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=239</guid>
		<description><![CDATA[I just stumbled across a brilliant summary of evolution in text search technology, written four years ago. It&#8217;s equally valid today (which in itself says something). I found it on the Prism Legal blog, but the actual author is Sharon Flank. My own comments are interspersed in bold. “There are several underlying important developments over [...]]]></description>
			<content:encoded><![CDATA[<p>I just stumbled across a brilliant summary of evolution in text search technology, written four years ago.  It&#8217;s equally valid today (which in itself says something).  I found it on the <a href="http://www.prismlegal.com/wordpress/index.php?m=200407#post-190" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.prismlegal.com');">Prism Legal</a> blog, but the actual author is Sharon Flank.  My own comments are interspersed in bold.<span id="more-239"></span></p>
<blockquote><p>“There are several underlying important developments over the last decade or so:</p>
<ul>
<li>Incorporating user feedback to refine search results, usually indirectly rather than explicitly, making results better through machine learning. [Amazon.com is the most-often cited example of this with it’s “if you like A, you’ll also like B.”]  <strong>[CAM] Technically, that&#8217;s not a search example, but the general point is correct even so.</strong></li>
<li>Assessments based on usage or referral. This is what makes Google so useful and popular. This approach gives higher rankings if other web sites point to a target or if that target gets a lot of hits.</li>
<li>Various approaches to using taxonomies. The better applications use taxonomies as a navigation guide but don’t force it or require administrators to implement it. Vivisimo.com is an example of interesting, automated clustering approach. <strong>[CAM] &#8220;Faceted search&#8221; seems to be the buzzword here. It&#8217;s a big part of what I call &#8220;structured search.&#8221; But taxonomy use is probably more trivial in search than it is in, say, text mining.</strong></li>
<li>Better handling of phrases. Google automatically parses phrases and deals with search terms as phrases. This now seems natural but in the AltaVista days, you couldn’t tell a Venetian blind from a blind Venetian [example courtesy of Prof. George Miller, Princeton Univ. - too good not to cite].</li>
<li>Context-sensitive search is now an emerging trend. Systems track what users have previously searched for and infer interest in the same domain to refine search result. So if you look for “line” and a system knows you’ve just looked for “tacklebox,” then it infers you mean “fishing line.” Or if you search for bagels and the system knows you are in 20009, it tells you that you can buy them at Comet Liquors (which happens to sell bagels).  <strong>[CAM] That happens a lot with ad serving.  But I&#8217;m not convinced it hit actual search until Google&#8217;s personal search kicked off, and that was quite recent.</strong></li>
<p>“More generally in natural language processing, the statistical and linguistic approaches are converging in a new way: use massive amounts of data (i.e. the Web) to get statistical answers to deep linguistic questions, like “How do we figure out what the most likely referent is for the pronoun ‘they’?” Or “How do we determine the correct sense for ambiguous words?” These things aren’t in search engines yet, but you can expect to see more “intelligent” features coming out of this approach.</p>
<p>“Looking at this list, you can see that the conceptual changes (breakthroughs?), with the exception of better phrase handling, are primarily focused around Web searches. When dealing with one-of-a-kind document collections behind the corporate firewall, many of these developments turn out not to add much to older approaches. So, at least for enterprise search, I too remain partial to some of the older products you mention, though I am disappointed that most of the old-time vendors have not updated their approaches beyond adding taxonomy support.” <strong>[CAM] Yep, web search and enterprise search are <a href="http://www.texttechnologies.com/2008/01/14/enterprise-search-versus-web-search/" >very different things</a>.</strong></ul>
</blockquote>
<p>The original blog post did have one error &#8212; Sharon&#8217;s PhD isn&#8217;t in Computational Linguistics, but rather Slavic Linguistics, as I recently noted in my post about <a href="http://www.texttechnologies.com/2008/06/10/text-analytics-technology-jobs-humanities-majors/" >text analytics careers for humanities majors</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/06/15/how-text-search-has-evolved-over-the-past-15-years/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Expert System S.p.A. update</title>
		<link>http://www.texttechnologies.com/2008/06/11/expert-system-s-p/</link>
		<comments>http://www.texttechnologies.com/2008/06/11/expert-system-s-p/#comments</comments>
		<pubDate>Wed, 11 Jun 2008 11:12:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Competitive intelligence]]></category>
		<category><![CDATA[Coveo]]></category>
		<category><![CDATA[Expert System S.p.A.]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Text mining]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/?p=237</guid>
		<description><![CDATA[I chatted with Brooke Aker, the new CEO of Expert System&#8217;s US subsidiary, for quite a while last week. Unfortunately, we had some cell phone problems, and email followup hasn&#8217;t gone well, so I&#8217;m hazy on a few details. But here are some highlights, as best I understood them. Expert System now has 145 employees. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I chatted with Brooke Aker, the new CEO of Expert System&#8217;s US subsidiary, for quite a while last week.  Unfortunately, we had some cell phone problems, and email followup hasn&#8217;t gone well, so I&#8217;m hazy on a few details.  But here are some highlights, as best I understood them.<span id="more-237"></span></p>
<ul>
<li><strong>Expert System now has 145 	employees.</strong></li>
<li><strong>2 of the employees are in the US</strong> (plus at least one more full-time equivalent on a contract basis). 	<strong>Brooke believes the US operation will eventually be the biggest 	part of the company.</strong></li>
<li><strong>Expert System has sold its 	market intelligence SaaS offering to two global auto manufacturers. </strong><span>Competitors were Nielsen 	BuzzMetrics, somebody whose name sounded like “flexilytics” (I 	presume that would be Lexalytics  <em>Edit:  But see Lexalytics CEO Jeff Catlin&#8217;s comment below</em>), and somebody whose named sounded 	like “Truecast” (I haven&#8217;t yet guessed who that is).</span></li>
<li><span>If 	I understood correctly, Expert System acquired that product by 	picking up Brooke&#8217;s tiny company <a href="http://www.acuitysoftware.com/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.acuitysoftware.com');">Acuity 	Software</a>.  Acuity was/is a user of Expert System&#8217;s technology, 	having replaced Coveo&#8217;s with it so as to get better semantics.</span></li>
<li><span>Brooke 	is </span><strong>optimistic about Expert System&#8217;s prospects in the 	intelligence market. </strong><span> New 	semantic networks in Arabic and English (joining one Expert System 	already had in Italian) are a big part of the reason.  Brooke says 	the intelligence community is now actively interested in technology 	that&#8217;s been validated by the commercial market, on the theory it&#8217;s 	apt to be more complete than research/government-only products.  	Expert System is also working on a semantic network in another 	undisclosed Middle Eastern language; Brooke stoically refrained from 	confirming the blindingly obvious guess that this would be Farsi.</span></li>
<li><span>Expert 	System&#8217;s third effort in the US market, coming soon, will be a 	semantic ad platform.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span>Once again, however, I made it through an Expert System briefing without gaining a real understanding of its technology.  I gather they&#8217;re proud of their in-memory data structure for their semantic network, but I haven&#8217;t a clue (beyond the obvious guesses) as to what that data structure is.  Similarly, Brooke said that a distinguishing feature of Expert Systems semantic network is that words have lots of attributes, which are the same thing as categories, and supplied a list of the 11 top-level categories:  <em>Objects, animals, plants, people, concepts, places, time, natural phenomena, state, quantity, group.</em> But it&#8217;s easy to come up with a lot of things that don&#8217;t seem to fit that list very well (especially events, such as numerous different word-senses of “strike”), so absent further elucidation I didn&#8217;t find that particularly instructive either.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/06/11/expert-system-s-p/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The biggest text analytics company you probably never heard of</title>
		<link>http://www.texttechnologies.com/2008/01/31/expert-system-s-p-a/</link>
		<comments>http://www.texttechnologies.com/2008/01/31/expert-system-s-p-a/#comments</comments>
		<pubDate>Thu, 31 Jan 2008 14:05:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Competitive intelligence]]></category>
		<category><![CDATA[Enterprise search]]></category>
		<category><![CDATA[Expert System S.p.A.]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Text mining]]></category>
		<category><![CDATA[Expert System]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2008/01/31/expert-system-s-p-a/</guid>
		<description><![CDATA[I caught up with Expert System S.p.A. last week. They turn out to be doing $10 million in text technology annual revenue. That alone is surprising (sadly), but what&#8217;s really remarkable is that they did it almost entirely in the Italian market. As you might guess, that figure includes a little bit of everything, from [...]]]></description>
			<content:encoded><![CDATA[<p>I caught up with <a href="http://www.expertsystem.net" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.expertsystem.net');">Expert System S.p.A.</a> last week.  They turn out to be doing $10 million in text technology annual revenue.  That alone is surprising (sadly), but what&#8217;s really remarkable is that they did it almost entirely in the Italian market.  As you might guess, that figure includes a little bit of everything, from search engines to Italian language filters for Microsoft Office to text mining.  But only $3 ½ million of Expert System&#8217;s revenue is from the government (and I think that includes civilian agencies), and under 30% is professional services, so on the whole it seems like a pretty real accomplishment.  Oh yes – Expert Systems says it&#8217;s entirely self-funded.</p>
<p style="margin-bottom: 0in">As of last year, Expert System also has English-language products, and a couple of minor OEM sales in the US (for mobile search and semantic web applications).  German- and Arabic-language products are in beta test.  The company says that its market focus going forward is national security – surely the reason for the Arabic – and competitive intelligence.  It envisions selling through partners such as system integrators, although I think that makes more sense for the government market than it does vis-a-vis civilian companies.  In February the company is introducing a market intelligence product focused on sentiment analysis.</p>
<p style="margin-bottom: 0in">Expert System is a bit of a throwback, in that it talks lovingly of the semantic network that informs its products. <span id="more-175"></span> This semantic net was assembled in the usual way – start with WordNet, add a huge number of proper nouns, license a bunch of domain-specific dictionaries, and handcraft further as individual customers require it. In English the whole thing has 300,000 nodes and 1.2 million relationships.</p>
<p style="margin-bottom: 0in">Expert System insists that there&#8217;s a secret sauce in how the semantic net is organized, to optimize performance. But I haven&#8217;t gotten the slightest hint of what that magic data structure is &#8212;  despite having asked more than once – and so have to reserve judgment on that part.</p>
<p style="margin-bottom: 0in">On the search side, Expert System sounds fairly rich in terms of deciding relevancy, going beyond Term Frequency/Inverse Document Frequency + synonyms. Terms are also assigned importance by their grammatical roles, such as whether they&#8217;re sentence subjects, sentence objects, in paragraph topic sentences, and so on.  Of course, the whole concept of exploiting grammatical structure is a bit old-school.  Specifically, it presupposes you&#8217;re looking at decently grammatical documents in the first place, which can be a dubious assumption in the era of “im in ur cuzztom3r bazz eatin ur r3venue$z.”</p>
<p style="margin-bottom: 0in">The most interesting application Expert System told me about was one for Pirelli Tire, scanning the web for prices Pirelli products were sold at, to detect gray market activity.  Web-crawling for text analytics and price detection are both major activities – e.g., <a href="http://www.texttechnologies.com/2007/12/07/ql2-web-text-extraction-and-more/" >QL2 is active in both</a> – but this is the first I&#8217;ve heard of the areas being so tightly integrated.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/01/31/expert-system-s-p-a/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Clarabridge approach to text mining</title>
		<link>http://www.texttechnologies.com/2007/10/06/the-clarabridge-approach-to-text-mining/</link>
		<comments>http://www.texttechnologies.com/2007/10/06/the-clarabridge-approach-to-text-mining/#comments</comments>
		<pubDate>Sun, 07 Oct 2007 00:14:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[BI integration]]></category>
		<category><![CDATA[Clarabridge]]></category>
		<category><![CDATA[Comprehensive or exhaustive extraction]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Text mining]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2007/10/06/the-clarabridge-approach-to-text-mining/</guid>
		<description><![CDATA[And for my sixth text mining post this weekend, here are some highlights of the Clarabridge technology story. (Sorry if it sounds clipped, but I&#8217;m a bit burned out &#8230;) Like Attensity, Clarabridge practices exhaustive extraction.* That is, they do linguistics against documents, extract all sorts of entities and relationships among the entities from each [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in">And for my sixth text mining post this weekend, here are some highlights of the Clarabridge technology story.  (Sorry if it sounds clipped, but I&#8217;m a bit burned out &#8230;)</p>
<ul>
<li>Like Attensity, Clarabridge practices <em>exhaustive extraction.*  </em>That is, they do linguistics against documents, extract all sorts of entities and relationships among the entities from each document, and dump the results into a relational database.</li>
<li>Unlike Attensity, which uses <a href="http://www.texttechnologies.com/2006/06/24/attensity-extractive-exhaustion-and-the-frn/" >a simple normalized relational schema</a>, Clarabridge dumps the extracted data into a star schema.  (The Clarabridge folks are from Microstrategy, which – surely not coincidentally – also favors star schemas.)<span id="more-132"></span></li>
<li>For now, the linguistic part of the analysis is within a sentence, or else based on proximity, or (this sounded minor) based on the whole document.   But actual <em><a href="http://en.wikipedia.org/wiki/Anaphora_(linguistics)" onclick="javascript:pageTracker._trackPageview('/outbound/article/en.wikipedia.org');">anaphora</a> resolution</em> is coming soon.</li>
<li>The other big thing that goes into Clarabridge&#8217;s star schema is a category hierarchy, which has two aspects.  One is categories fixed in advance.  When I asked how many, CTO Justin Langseth cited an example range of 10-400.  I.e., it varies widely.  In principle, these are established by line-of-business folks at Clarabridge customers, but I&#8217;d venture to guess that professional services play a significant role as well.</li>
<li>The other kind of categories – subcategories to the first group – are created automagically at data load time via document clustering.  Indeed, they&#8217;re called “clusters.” These are available for drilldown via business intelligence tools.</li>
<li>Obviously it is good practice to have dashboards and scheduled reports depend only on the fixed categories, not the clusters.</li>
</ul>
<p><em>*I should note that Clarabridge understandably bristles a bit at my use of this Attensity-introduced term to describe what they do too. If Clarabridge wants to start talking about, say, “comprehensive extraction, I&#8217;ll consider adopting that term as well. But for now I&#8217;m going with what&#8217;s most widely used.</em></p>
<p><em>Want to continue getting great research about text mining, data warehouse appliances, and other hot analytics-related topics? Then subscribe to our comprehensive (if not exhaustive) <a href="http://www.monash.com/blogs.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monash.com');">feed</a>, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.</em></p>
<p style="margin-bottom: 0in"><em><p>Technorati Tags: <a href="http://technorati.com/tag/Clarabridge" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag">Clarabridge</a>, <a href="http://technorati.com/tag/text+mining" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag"> text mining</a>, <a href="http://technorati.com/tag/exhaustive+extraction" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag"> exhaustive extraction</a></p></em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2007/10/06/the-clarabridge-approach-to-text-mining/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Wise Crowds of Long-Tailed Ants, or something like that</title>
		<link>http://www.texttechnologies.com/2007/04/30/baynote-buzzwords/</link>
		<comments>http://www.texttechnologies.com/2007/04/30/baynote-buzzwords/#comments</comments>
		<pubDate>Tue, 01 May 2007 02:03:16 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Baynote]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engine optimization (SEO)]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Social software and online media]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Specialized search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2007/04/30/baynote-buzzwords/</guid>
		<description><![CDATA[Baynote sells a recommendation engine whose motto appears to be “popularity implies accuracy.” While that leads to some interesting technological ideas (below), Baynote carries that principle to an unfortunate extreme in its marketing, which is jam-packed with inaccurate buzzspeak. While most of that is focused on a few trendy meme-oriented books, the low point of [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">Baynote sells a recommendation engine whose motto appears to be “popularity implies accuracy.”<span> </span>While that leads to some interesting technological ideas (below), Baynote carries that principle to an unfortunate extreme in its marketing, which is jam-packed with inaccurate buzzspeak.<span> </span>While most of that is focused on a few trendy meme-oriented books, the low point of my briefing today was the probably the insistence against pushback that “95%” of Google’s results depend on “PageRank.” <span> </span>(I think what Baynote really meant is “all off-page factors combined,” but anyhow I sure didn’t get the sense that accuracy was an important metric for them in setting their briefing strategy.<span> </span>And by the way, one reason I repeat the company’s name rather than referring to Baynote by a pronoun is that on-page factors DO matter in search engine rankings.)</p>
<p class="MsoNormal">That said, here’s the essence of Baynote’s story, as best I could figure it out. <span id="more-105"></span></p>
<ul style="margin-top: 0in" type="disc">
<li class="MsoNormal">Baynote’s secret sauce is      a set of 20+ behavioral metrics to identify whether, if somebody clicks on      a page, they are SATISFIED with the content.</li>
<li class="MsoNormal">Based on that, Baynote      provides a “content recommendation” engine. (For now, the distinction      between “content” and “web page” is not important, but the concepts are in      my opinion diverging over time.) <span> </span>This      is manifested in two forms (a typical installation uses both).<span> </span>One is just a list of      recommendations.<span> </span>The other is in a      search engine – “social search” with an “implicit folksonomy” &#8212; and its      results pages.<span> </span>Both sit on web      pages as boxes/widgets.</li>
<li class="MsoNormal">Baynote’s first markets were online support and eMarketing.  The company is now rolling out eCommerce as well.   I didn’t get clarity about what was      different in the nature of the recommendations, if anything, that      underlies any small separation between these apps.<span> </span>(Baynote was clear about saying that the      differences were indeed small.)</li>
<li class="MsoNormal">The whole thing is SaaS,      built on a LAMP stack.  MySQL      4.something seems to suffice, which makes sense given that Baynote’s      system is not handling any significant transactions directly.<span> </span>That said, I didn’t push to understand      what it means for a search engine to be built on MySQL.<span> </span>This wasn’t the kind of conversation in      which one could elicit substantive detail.</li>
<li class="MsoNormal">Baynote claims that a      sample size of as few as 7-10 visitors liking a particular piece of      content suffices to provide a good basis for predicting who else will like      it.<span> </span>I’m not in a position to assess      the credibility or, more to the point, limitations of this claim.</li>
<li class="MsoNormal">Baynote has the philosophy      that they try to watch a user’s behavior on a site and map that to a “context.”<span> </span>I like that approach.</li>
<li class="MsoNormal">The company cites tested      stats of 20% net lift (revenue increase), with 50% of sales being touched      by its recommendations.<span> </span>Those      numbers don’t sound terribly impressive, perhaps unless they’re truly      additive to those provided by, say, Endeca, which is an announced partner.</li>
</ul>
<p class="MsoNormal"><em> </em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2007/04/30/baynote-buzzwords/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>So THAT&#8217;S why Andrew Orlowski still has a job (Part 2)</title>
		<link>http://www.texttechnologies.com/2007/03/26/andrew-orlowski-berners-lee-spam-semantic-web/</link>
		<comments>http://www.texttechnologies.com/2007/03/26/andrew-orlowski-berners-lee-spam-semantic-web/#comments</comments>
		<pubDate>Tue, 27 Mar 2007 01:55:32 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Spam and antispam]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2007/03/26/andrew-orlowski-berners-lee-spam-semantic-web/</guid>
		<description><![CDATA[Andrew Orlowski is an over-the-top jerk, and a pretty sloppy reporter and analyst to boot. But he occasionally makes a good point even so. In the most recent instance, he confronted Tim Berners-Lee. As the article makes clear, Berners-Lee reacted badly to Orlowski, reflecting an attitude that is probably shared by 99% of the people [...]]]></description>
			<content:encoded><![CDATA[<p>Andrew Orlowski is an over-the-top jerk, and a pretty <a href="http://www.monashreport.com/2006/03/22/goodmail-esther-dyson-andrew-orlowski-etc/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monashreport.com');">sloppy reporter and analyst</a> to boot.  But he occasionally <a href="http://www.monashreport.com/2006/07/03/so-thats-why-andrew-orlowski-still-has-a-job/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monashreport.com');">makes a good point</a> even so.  In the most recent instance, he <a href="http://www.theregister.co.uk/2007/03/23/tim_berners_lee_postal/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.theregister.co.uk');">confronted Tim Berners-Lee</a>.  As the article makes clear, Berners-Lee reacted badly to Orlowski, reflecting an attitude that is probably shared by 99% of the people who encounter the guy, and in the future will probably be adopted by sentient computers as well.  Even so, Orlowski&#8217;s underlying point is valid:  <strong>If the Semantic Web is going to be any more spam-free than the current Web, nobody has adequately explained why.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2007/03/26/andrew-orlowski-berners-lee-spam-semantic-web/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>InQuira’s and Mercado’s approaches to structured search</title>
		<link>http://www.texttechnologies.com/2007/02/15/inquira-mercado-structured-search/</link>
		<comments>http://www.texttechnologies.com/2007/02/15/inquira-mercado-structured-search/#comments</comments>
		<pubDate>Thu, 15 Feb 2007 06:52:28 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[InQuira]]></category>
		<category><![CDATA[Mercado]]></category>
		<category><![CDATA[Natural language processing (NLP)]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Search engines]]></category>
		<category><![CDATA[Structured search]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2007/02/15/inquira-mercado-structured-search/</guid>
		<description><![CDATA[InQuira and Mercado both have broadened their marketing pitches beyond their traditional specialties of structured search for e-commerce. Even so, it’s well worth talking about those search technologies, which offer features and precision that you just don’t get from generic search engines. There’s a lot going on in these rather cool products. In broad outline, [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal"><a href="http://www.inquira.com" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.inquira.com');">InQuira</a> and <a href="http://www.mercado.com" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.mercado.com');">Mercado</a> both have broadened their marketing pitches beyond their traditional specialties of structured search for e-commerce.  Even so, it’s well worth talking about those search technologies, which offer features and precision that you just don’t get from generic search engines.  There’s a lot going on in these rather cool products.</p>
<p class="MsoNormal">In broad outline, Mercado and InQuira each combine three basic search approaches:</p>
<ul>
<li>Generic text indexing.</li>
<li>Augmentation via an ontology.</li>
<li>A rules engine that helps the site owner determine which results and responses are shown under various circumstances.</li>
</ul>
<p class="MsoNormal">Of the two, InQuira seems to have the more sophisticated ontology.  Indeed, the not-wholly-absurd claim is that InQuira does natural-language processing (NLP).  Both vendors incorporate user information in deciding which search results to show, in ways that may be harbingers of what generic search engines like Google and Yahoo will do down the road. <span id="more-83"></span></p>
<p class="MsoNormal">InQuira has all three standard levels of an ontology – generic, vertical, and customer-specific.  They readily admit to being an instantiation of Monash’s Second Law of Commercial Semantics:  <em>Where there’s an ontology, there’s consulting.</em> Indeed, professional services are almost 40% of InQuira’s revenue (which was almost $20 million last year).  Beyond the ontology, they incorporate surfing and profile evidence to disambiguate users’ interests.</p>
<p class="MsoNormal">Let’s pause a moment to reflect on structured search and parts of speech.  Obviously, when somebody’s shopping, it’s very important to interpret nouns.  But adjectives are important too.  If a customer expresses interest in a “gold” car, the website had better tell her about which “Metallic Champagne” vehicles are available.  And on a retail site it’s rather important to know the difference between a “dress shirt” and a “shirt dress,” a test About.com’s ad-serving software currently <a href="http://fashion.about.com/cs/glossary/g/bldefshirtdress.htm" onclick="javascript:pageTracker._trackPageview('/outbound/article/fashion.about.com');">fails</a>.</p>
<p class="MsoNormal">The major differentiating feature of InQuira’s NLP/search technology is to take this further, and also think about verbs.  More precisely, the focus is on “intents,” sometimes called “intent categories” instead – i.e., actions the customer is trying to undertake.  These are defined in a kind of rules engine, which is separate from the semantic net used to represent the noun/adjective ontology.</p>
<p class="MsoNormal">Given that they’re defined by rulesets, there are a fair number of these intents.  Back in June, 2005, InQuira told me they had packaged the linguistic knowledge for 100 “intents” for cell phone service companies, and were covering 72-72% of total inquiries that way.  The most popular intents accounted for 10-12% of inquiries each.</p>
<p class="MsoNormal">Mercado doesn’t do “intents,” and I don’t think the ontology is as sophisticated either, but otherwise its search story is a lot like InQuira’s.   Both companies, for example, offer rules-engine capabilities for displaying various page elements – i.e., portlets &#8212; alongside the actual search results (e.g., for upsell).  Both also let web site owners tweak search results too, according to what they want to sell, what they think the customer is most likely to buy, or to provide some sort of near matches when the exact search isn’t a good enough match to actual inventory.</p>
<p class="MsoNormal">Mercado argues that its rules-based technology is particularly powerful, because of a capability they call RBT, for <em>result(s)-based triggers. </em> The idea is that rules can fire based on any characteristics of the search results themselves.  Particularly important inputs seem to be the size and estimated precision of the raw result set.</p>
<p class="MsoNormal">I’ve hinted above that generic search can be beaten by more specialized technologies.  I absolutely believe that, with one caveat.  Whatever happens under the covers, word-based interaction with computers may well always have a generic interface – search boxes today, voice increasingly in the future.  It’s what happens after the initial disambiguation that will be specialized according to – well, according to both the user’s and the server owner’s <em>intent.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2007/02/15/inquira-mercado-structured-search/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Is DMOZ the cure to Wikipedia&#8217;s spam problem?</title>
		<link>http://www.texttechnologies.com/2007/02/07/dmoz-cure-wikipedia-spam-problem/</link>
		<comments>http://www.texttechnologies.com/2007/02/07/dmoz-cure-wikipedia-spam-problem/#comments</comments>
		<pubDate>Thu, 08 Feb 2007 01:48:07 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Categorization and filtering]]></category>
		<category><![CDATA[Directories]]></category>
		<category><![CDATA[ODP and DMOZ]]></category>
		<category><![CDATA[Ontologies]]></category>
		<category><![CDATA[Spam and antispam]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2007/02/07/dmoz-cure-wikipedia-spam-problem/</guid>
		<description><![CDATA[Joost de Valk makes an interesting suggestion, namely that Wikipedia should drop all external links other than to DMOZ, and rely on DMOZ as the outside link directory. As division of labor, it makes perfect sense. However, it&#8217;s a total non-starter until at least two problems are solved. First, DMOZ has to be much more [...]]]></description>
			<content:encoded><![CDATA[<p>Joost de Valk makes an interesting suggestion, namely that <a href="http://www.joostdevalk.nl/blog/dmoz-and-wikipedia-how-it-should-work/" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.joostdevalk.nl');">Wikipedia should drop all external links other than to DMOZ</a>, and rely on DMOZ as the outside link directory.  As division of labor, it makes perfect sense.  However, it&#8217;s a total non-starter until at least two problems are solved.<span id="more-82"></span> First, DMOZ has to be much more current and comprehensive.  I don&#8217;t think that can be done to the level Joost envisions without a multi-tiered site selection system &#8212; part anyone-can-vote social media, with a controlled group of editors able to preempt or override the mass selections.  Reading his post, I gather he recognized that point, or had similar thoughts.</p>
<p>But there&#8217;s a second problem as well &#8212; mapping Wikipedia subjects to DMOZ categories.  How&#8217;s that supposed to work?  For most Wikipedia subjects, there&#8217;s no obvious single match in the DMOZ ontology.  And it&#8217;s more than just a matter of the categories not existing <em>yet;</em> I don&#8217;t think they <em>can</em> exist until the DMOZ hierarchy becomes much more interconnected.</p>
<p>I think it would be great if ODP/DMOZ were enhanced to A.  Accomodate public input and B.  Have a multifaceted ontology.   But until there&#8217;s a DMOZ 2.0, I don&#8217;t see how Joost&#8217;s idea could work.</p>
<p><em><br />
</em></p>
<p>Technorati Tags: <a href="http://technorati.com/tag/Wikipedia" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag">Wikipedia</a>, <a href="http://technorati.com/tag/DMOZ" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag"> DMOZ</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2007/02/07/dmoz-cure-wikipedia-spam-problem/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

