<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Text Technologies &#187; MuseGlobal</title>
	<atom:link href="http://www.texttechnologies.com/category/vendors/museglobal/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.texttechnologies.com</link>
	<description>Understanding technology ... in both senses of the phrase</description>
	<lastBuildDate>Wed, 18 Jan 2012 17:02:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>MuseGlobal – ETL for text, sort of</title>
		<link>http://www.texttechnologies.com/2008/03/15/museglobal/</link>
		<comments>http://www.texttechnologies.com/2008/03/15/museglobal/#comments</comments>
		<pubDate>Sat, 15 Mar 2008 11:16:16 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MuseGlobal]]></category>

		<guid isPermaLink="false">http://www.texttechnologies.com/2008/03/15/museglobal/</guid>
		<description><![CDATA[Lynda Moulton introduced me to MuseGlobal, and specifically CEO Kate Noerr, last month. MuseGlobal sort of does ETL (Extract/Transform/Load) for text, although they prefer to call it Gather/Transform/Deliver. In any case, each of the three parts of the process are rather different for text than they are for traditional data warehousing. To wit: Gathering happens [...]]]></description>
			<content:encoded><![CDATA[<p>Lynda Moulton introduced me to MuseGlobal, and specifically CEO Kate Noerr, last month.  MuseGlobal sort of does ETL (Extract/Transform/Load) for text, although they prefer to call it Gather/Transform/Deliver.  In any case, each of the three parts of the process are rather different for text than they are for traditional data warehousing.  To wit: <span id="more-202"></span></p>
<ul>
<li>
<p style="margin-bottom: 0in"><em>Gathering</em> happens from a 	variety of repositories, which often are individually unstructured. 	Different repositories may use different file formats, manage 	different kinds of metadata, and require different kinds of 	authentication to get at. A significant part of MuseGlobal&#8217;s 	value-add seems to be knowing how to get at data from hundreds or 	thousands of third-party sources, both from a technology and 	licensing standpoint.</p>
</li>
<li>
<p style="margin-bottom: 0in"><em>Transformation</em> amounts to 	extracting document attributes and subject matter, and slathering on 	a whole lot of XML tags accordingly.  More precisely, documents are 	mapped into a grand cosmic XML schema with 2200 or so nodes, and 	from there the desired output (usually XML) is produced.  MuseGlobal 	has built its own extraction technology, but will work with Inxight 	or Temis for finer-level extraction as needed.  They do some 	sentiment analysis, but that doesn&#8217;t seem to be a strong point.  	Extracting dates seems to be a strength of theirs, as is recognizing 	duplicate documents, both of which would be important in 	news-oriented applications.</p>
</li>
<li>
<p style="margin-bottom: 0in"><em>Delivery</em> is, in database 	terms, record-at-a-time, not set-at-a-time. (Here I&#8217;m basically 	equating “document” and “record”.) And there are relatively 	few records, each of which typically undergoes a whole lot of 	transformation. So most ETL issues of latency and throughput don&#8217;t 	carry over to MuseGlobal&#8217;s use cases.  Even the batch/real-time 	distinction is somewhat moot.</p>
</li>
</ul>
<p style="margin-bottom: 0in">The original market for MuseGlobal&#8217;s technology was scientific and professional publishers, but  they&#8217;ve gotten into more general news and blog document handling as well. They also sell to enterprises. Mark Logic and Endeca are among their partners; in each case, MuseGlobal does the preprocessing.  Indeed, MuseGlobal usually sells on an OEM basis, exceptions coming most commonly when they want to establish references in a new market segment.</p>
<p style="margin-bottom: 0in">As to MuseGlobal&#8217;s own dates and other facts: The company was founded in 2001, but technology development had been going on for 2-3 years prior.  Headcount is around 70, and growing by 20% or so this year.  The company is headquartered in San Francisco, with support in Salt Lake City. It has been self-funded all the way.</p>
<p style="margin-bottom: 0in"> <em><strong>Please <a href="http://www.monash.com/signup.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.monash.com');">subscribe</a> to our feed!</strong></em></p>
<p style="margin-bottom: 0in"><em><p>Technorati Tags: <a href="http://technorati.com/tag/ETL" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag">ETL</a>, <a href="http://technorati.com/tag/Endeca" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag"> Endeca</a>, <a href="http://technorati.com/tag/Mark+Logic" onclick="javascript:pageTracker._trackPageview('/outbound/article/technorati.com');" rel="tag"> Mark Logic</a></p></em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.texttechnologies.com/2008/03/15/museglobal/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

