August 26th, 2006 Curt Monash
After even more glitches than usual with their content management system, Computerworld finally posted the second part of my series on enterprise text technology architectures. I already posted the main points of the column here several weeks ago, but of course the column includes further material. In particular, I draw an analogy between text technologies and business intelligence, inspired in part by various direct ties between the two disciplines. Dave Kellogg makes a similar point, focused on general market development.
Just how precisely accurate the analogy winds up being will depend in a large part, I think, on whether search engines (analogous to data warehouses) will wind up being the foundation of text-heavy functionality. The jury is still out on that.
Posted in BI integration, Search and text storage | No Comments »
August 26th, 2006 Curt Monash
I talked again with Mark Logic, makers of MarkLogic Server, and they continue to have an interesting story. Basically, their technology is better search/retrieval through XML. The retrieval part is where their major differentiation lies. Accordingly, their initial market focus (they’re up to 46 customers now, including lots of big names) is on custom publishing. And by the way, they’re a good partner for fact-extraction companies, at least in the case of ClearForest.
Here, as best I understand, is the story of the custom publishing business.
Read the rest of this entry »
Posted in ClearForest and Reuters, Mark Logic, Search and text storage, Specialized search engines | 2 Comments »
August 17th, 2006 Curt Monash
I had a call with Business Objects, mainly about their overall EIM/ETL product line (Enterprise Information Management, a superset of Extract/Transform/Load). But I took the opportunity to ask about their deal with Attensity. (Attensity themselves posted more about the relationship, including some detailed links, here.) It actually sounds pretty real. They also mentioned that there seem to be a bunch of startups proposing search as a substitute for data warehousing, much as FAST sometimes likes to.
Read the rest of this entry »
Posted in Attensity, BI integration, Search and text storage, Text mining | 1 Comment »
August 12th, 2006 Curt Monash
I previously noted that Attensity seemed to putting a lot of emphasis on a partnership with Business Objects and Teradata, although due to vacations I’ve still failed to get anybody from Business Objects to give me their view of the relationship’s importance. Now Greenplum tells me that O’Reilly is using their system to support text mining (apparently via homegrown technology), although I wasn’t too clear on the details. I also got the sense Greenplum is doing more in text mining, but the details of that completely escaped me.
It’s just a couple of data points, but I feel a trend here.
Posted in BI integration, Text mining | 2 Comments »
August 4th, 2006 Curt Monash
I’m a huge fan of the idea that companies should deliberately capture as much information as possible for analysis. In the case of text, since I personally hate structured survey forms, I believe that free-form surveys have the potential to capture a lot more information than traditionally Procustean abominations do. SPSS indicated that there’s indeed some activity in this regard.
I found another example. Read the rest of this entry »
Posted in ClearForest and Reuters, SPSS, Text mining | No Comments »
August 3rd, 2006 Curt Monash
My August Computerworld column starts where July’s left off, and suggests principles for enterprise text technology architecture. This will not run Monday, August 7, as I was originally led to believe, but rather in my usual second-Monday slot, namely August 14. Thus, I finished it a week earlier than necessary, and I apologize to those of you I inconvenienced with the unnecessary rush to meet that deadline.
The principles I came up with are:
- Deploy search widely across the enterprise.
- It’s OK for your text data to be distributed across a range of silos.
- Integrate fact extraction/text mining aggressively into your predictive analytics and dashboards.
- Having a preferred enterprise text technology tool suite is nice, but accept that there will probably be lots of departmental exceptions.
- Reinvent your customer communication (and other) processes to exploit text technologies.
- Integrate your taxonomies.
I’ll provide a link when the column is actually posted.
Posted in Enterprise search, Ontologies and context identification, Search and text storage, Text mining | 1 Comment »
August 2nd, 2006 Curt Monash
FAST, aka Fast Search & Transfer (www.fastsearch.com) is a pretty interesting and important company. They have 3500 enterprise customers, a rapidly growing $100 million revenue run rate, and a quarter billion dollars in the bank. Their core business is of course enterprise search, where they boast great scalability, based on a Google-like grid architecture, which they fondly think is actually more efficient than Google’s. Beyond that, they’ve verticalized search, exploiting the modularity of their product line to better serve a variety of niche markets. And they’re active in elementary fact/entity extraction as well. Oh yes – they also have forms of guided navigation, taxonomy-awareness, and probably everything else one might think of as a checkmark item for a search or search-like product.
Read the rest of this entry »
Posted in Enterprise search, FAST, Google, Search and text storage | 1 Comment »
August 2nd, 2006 Curt Monash
I’ve had a couple of good talks with Andrew McKay of FAST recently. When discussing FAST’s scalability, he likes to use the word “petabytes.” I haven’t probed yet as to exactly which corpus(es) he’s referring to, but here’s a thought for comparison:
Google, if I recall correctly, caches a little over 100Kb/page (assuming, of course, that the page has at least that much text, which is not necessarily the case at all). And they were up into Carl Sagan range – i.e., “billions and billions” – before they stopped giving counts of how many pages they’d indexed.
10 billion times 100 Kb is, indeed, a petabyte. So, in the roughest of approximations, the Web is a petabyte-range corpus.
EDIT: Hah. I bet eBay and its 2-petabyte database is one of the examples Andrew is referring to …
Posted in FAST, Search and text storage | 4 Comments »