Discussion of how text analytics technologies are used for custom publishing and similar applications. Related subjects include:
Here’s something longer-lasting and weirder than Vertica’s “We sell turkeys” theme: Mark Logic, whose product is used primarily to help enterprises make their content more acceptable, doesn’t have a search engine on its own website.* Read more
As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:
1. Web search
2. Public-facing site search
3. Enterprise search and knowledge management
4. Custom publishing
5. Text mining and extraction
Three are more standalone:
6. Spam filtering
7. Voice recognition
8. Machine translation
Mark Logic basically makes an XML DBMS – confusingly called Marklogic without a space – optimized for document processing (including text search). Mark Logic’s main market is custom publishing – assembling documents on the fly, whether based on search or some other starting point.
Airlines put Marklogic to an interesting use: They create “electronic flight bags.” Apparently, flight crews typically carry a whole satchel of documents (flight bags) onto a plane, the precise contents of which frequently vary. Marklogic lets these be automatically generated in electronic form.
Well, in recent news it turns out that a $1.4 billion B-1 bomber crashed because a known prudent take-off/maintenance procedure hadn’t been followed. (Something about heating the components to evaporate water that otherwise destroyed the electronics.) This plane-saving had been discovered, but not propagated to all bases and maintenance crews responsible for the B-1. You think something like Marklogic might have helped? Read more
I’m putting up two posts this morning on Mark Logic and its MarkLogic product family. The main one, over on DBMS2, outlines the technical architecture — focusing on MarkLogic as an XML database management system — and provides a bit of overall context. This post attempts to position MarkLogic against alternative kinds of text analytics engine.
For the most part, MarkLogic is indeed sold (and bought) for the storage, manipulation, and retrieval of text. (One long-confidential exception to this rule is scheduled to be unveiled at the June user conference.) Most applications seem to fit a custom publishing/enhanced search paradigm:
Serve it up in chunks, typically via a sophisticated search interface.
Differences vs. conventional search engines include:
Documents are indexed on the fly, and available for query immediately upon ingestion.
MarkLogic is a real, ACID-compliant DBMS. So everything else – such as a user tag or comment — is also available for immediate query. Mark Logic says customers are making a lot of use of this feature.
MarkLogic has a real programming language – specifically XQuery. (Note: XQuery is a much fuller language than, say, standard SQL, with conditional logic, arithmetic, try/catch, and so on.)
MarkLogic handles fielded information, document chunks, and whole documents in a completely integrated fashion. Truth be told, I don’t know exactly to what extent Autonomy or FAST do or don’t fall short of this standard, but it’s never seemed to be as much of a priority on their part as I’ve felt it should be.
Mark Logic also claims huge advantages in corpus administration. Scalability seems good too; there’s a national-intelligence customer with a 200 terabyte database. And they’re proud of a feature called lexicons, although it seems so obvious to me that I’ve so far failed to muster what they’d probably regard as the proper level of excitement about it. (In SQL terms, it seems to be a combination of SELECT and COUNT DISTINCT, both of which are capabilities I’d think would be in XQuery anyway.)
CEO Eric Bregand of Temis recently checked in by email with an update on text mining market activity. Highlights of Eric’s views include:
- Yep, Voice Of The Customer is hot, in “many markets”; Eric specifically mentioned banking, car, energy, food, and retail. He further sees IBM backing VotC as text’s “killer app.” (Note: Temis has a history of partnering with IBM, most notably via its unusually strong commitment to UIMA.)
- Specifically, THE hot topics in the European market these days are competitive intelligence and sentiment analysis. (Note: I’ve always thought Temis got serious about competitive analysis a little earlier than most other text mining vendors did.)
- Life sciences is an ever growing focus for Temis.
- I confused him a bit with how I phrased my question about custom publishing and Temis’ Mark Logic partnership. But he did express favorable views of the market, specifically in the area of integrating text mining and native XML database management, and even volunteered that nStein appears to be doing well.
|Categories: Application areas, Competitive intelligence, Custom publishing, IBM and UIMA, Investment research and trading, Mark Logic, nStein, TEMIS, Text mining, Voice of the Customer||1 Comment|
It was tough to judge user demand at the recent Text Analytics Summit because, well, very few users showed up. And frankly, I wasn’t as aggressive at pumping vendors for trends as I am some other times. That said, I have talked with most text analytics vendors recently,* and here are my impressions of what’s going on. Any contrary – or confirming! — opinions would be most welcome.
*Factiva is the most significant exception. Hint, hint.
If you think about it, text analytics is a “secret ingredient” in search, antispam, and data cleaning,* and this dominates all other uses of the technology. A significant minority of the research effort at companies that do any kind of text filtering is – duh — text analytics. Cold comfort for specialist text analytics vendors, to be sure, but that’s the way it is.
*I.e., part of the “T” in “ETL” (Extract/Transform/Load).
Text-analytics-enhanced custom publishing will surely at some point become a must-have for business and technical publishers. However, it appears that we’re not quite there yet, as large publishers make do with simple-minded search and the like. In what I suspect is a telling market commentary, there’s no headlong rush among vendors to dump text mining for custom publishing, notwithstanding the examples of nStein and (sort of) ClearForest. I don’t want to be overly negative – either my friends at Mark Logic are doing just fine or else they’re putting up a mighty brave front – but I don’t think the nonspecialist publishing market is there yet. Read more
|Categories: Application areas, ClearForest/Reuters, Custom publishing, Factiva/Dow Jones, Mark Logic, nStein, SAS, Search engines, Spam and antispam, Text Analytics Summit, Text mining, Voice of the Customer||2 Comments|
So far as I can tell, Attensity’s strategy when the company was originally founded was rather like ClearForest’s strategy today – and vice-versa. That said, here’s where they seem to stand at this time:
- Attensity wants to make text analytics very easy to integrate into business intelligence and data mining – at the moment, they’re not too focused on the differences between those two disciplines – and is trying to deliver the best possible fact extraction consistent with that charter.
- ClearForest wants to provide really great information extraction — to the limits of what can be done without excessive knowledge engineering – and is trying to integrate as well as possible with other technologies, the better to serve the customers who need what they offer.
|Categories: Application areas, Attensity, ClearForest/Reuters, Custom publishing, Mark Logic, TEMIS, Text mining||Leave a Comment|
I talked again with Mark Logic, makers of MarkLogic Server, and they continue to have an interesting story. Basically, their technology is better search/retrieval through XML. The retrieval part is where their major differentiation lies. Accordingly, their initial market focus (they’re up to 46 customers now, including lots of big names) is on custom publishing. And by the way, they’re a good partner for fact-extraction companies, at least in the case of ClearForest.
Here, as best I understand, is the story of the custom publishing business. Read more