Analysis of ontologies, of the role they play in text analytics, and of technology and techniques to build and manage them.
CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried “Shadow IT”, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he’s talking about data marts, only for documents rather than tabular data.
Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts, in the tabular and document worlds alike. For example:
- Price/performance. Your main/central data manager might be too expensive to support additional large specialized databases. Or different databases and applications might have sufficiently different profiles so as to get great price/performance from different kinds of data managers. This is particularly prevalent in the relational world, where each of column stores, sequentially-oriented row stores, and random I/O-oriented row stores have compelling use cases.
- Different SLAs (Service-Level Agreements). Similarly, different applications may have very different requirements for uptime, response time, and the like. (In the relational world, think of operational data stores.)
- Different security requirements. Different subsets of the data may need different levels of security. This is particularly prevalent in the document world, where security problems are not as well-solved as in the tabular arena, and where it’s common for a search engine to index across different corpuses with radically different levels of sensitivity.
- Integrated application and user interfaces. In the relational world, there’s a pretty clean separation between data management and interface logic; most serious business intelligence tools can talk to most DBMS. The document world is quite different. Some search engines bundle, for example, various kinds of faceted or parameterized search interfaces. What’s more, in public-facing search, a major differentiator is the facilities that the product offers for skewing search results.
- Different text applications require different thesauruses or taxonomy management systems. Ideally, those should all be integrated — but the requisite technology still doesn’t exist.
Bottom line: Text data marts, much like relational data marts, are almost surely here to stay.
|Categories: Enterprise search, Ontologies, Search engines, Specialized search, Structured search||2 Comments|
At Lynda Moulton’s behest, I spoke a couple of times recently on the subject of where “semantic” technology is or isn’t likely to be important. One was at the Gilbane conference in early December. The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. The actual Gilbane slides may be found here.
My opinions about the applicability of semantic technology include:
- The big bucks in web search are for “transactional” web search, and semantics isn’t the issue there. (Slides 3-4)
- When UIs finally go beyond the simple search box — e.g. to clusters/facets or to voice — semantics should have a role to play. (Slide 5)
- Public-facing site search depends — more than any other area of text analytics — on hand-tagging. (Slide 7)
- “Enterprise” search that searches specialized external databases could benefit from semantic technologies. (Slide 8)
- True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. (Slides 10-11)
- Semantics — specifically extraction — is central to custom publishing. (Slide 12 — upon review I regret using the word “sophisticated”)
- Semantics is central to text mining. (Slide 18)
- Semantics could play a big role in all sorts of exciting future developments. (Slide 19)
So what would your list be like?
|Categories: Enterprise search, Ontologies, Search engines, Specialized search, Structured search||5 Comments|
I just stumbled across a brilliant summary of evolution in text search technology, written four years ago. It’s equally valid today (which in itself says something). I found it on the Prism Legal blog, but the actual author is Sharon Flank. My own comments are interspersed in bold. Read more
I chatted with Brooke Aker, the new CEO of Expert System’s US subsidiary, for quite a while last week. Unfortunately, we had some cell phone problems, and email followup hasn’t gone well, so I’m hazy on a few details. But here are some highlights, as best I understood them. Read more
|Categories: Application areas, Competitive intelligence, Coveo, Expert System S.p.A., Ontologies, Text mining||2 Comments|
I caught up with Expert System S.p.A. last week. They turn out to be doing $10 million in text technology annual revenue. That alone is surprising (sadly), but what’s really remarkable is that they did it almost entirely in the Italian market. As you might guess, that figure includes a little bit of everything, from search engines to Italian language filters for Microsoft Office to text mining. But only $3 ½ million of Expert System’s revenue is from the government (and I think that includes civilian agencies), and under 30% is professional services, so on the whole it seems like a pretty real accomplishment. Oh yes – Expert Systems says it’s entirely self-funded.
As of last year, Expert System also has English-language products, and a couple of minor OEM sales in the US (for mobile search and semantic web applications). German- and Arabic-language products are in beta test. The company says that its market focus going forward is national security – surely the reason for the Arabic – and competitive intelligence. It envisions selling through partners such as system integrators, although I think that makes more sense for the government market than it does vis-a-vis civilian companies. In February the company is introducing a market intelligence product focused on sentiment analysis.
Expert System is a bit of a throwback, in that it talks lovingly of the semantic network that informs its products. Read more
|Categories: Application areas, Competitive intelligence, Enterprise search, Expert System S.p.A., Ontologies, Search engines, Text mining||Leave a Comment|
And for my sixth text mining post this weekend, here are some highlights of the Clarabridge technology story. (Sorry if it sounds clipped, but I’m a bit burned out …)
- Like Attensity, Clarabridge practices exhaustive extraction.* That is, they do linguistics against documents, extract all sorts of entities and relationships among the entities from each document, and dump the results into a relational database.
- Unlike Attensity, which uses a simple normalized relational schema, Clarabridge dumps the extracted data into a star schema. (The Clarabridge folks are from Microstrategy, which – surely not coincidentally – also favors star schemas.) Read more
|Categories: BI integration, Clarabridge, Comprehensive or exhaustive extraction, Ontologies, Text mining||2 Comments|
Baynote sells a recommendation engine whose motto appears to be “popularity implies accuracy.” While that leads to some interesting technological ideas (below), Baynote carries that principle to an unfortunate extreme in its marketing, which is jam-packed with inaccurate buzzspeak. While most of that is focused on a few trendy meme-oriented books, the low point of my briefing today was the probably the insistence against pushback that “95%” of Google’s results depend on “PageRank.” (I think what Baynote really meant is “all off-page factors combined,” but anyhow I sure didn’t get the sense that accuracy was an important metric for them in setting their briefing strategy. And by the way, one reason I repeat the company’s name rather than referring to Baynote by a pronoun is that on-page factors DO matter in search engine rankings.)
That said, here’s the essence of Baynote’s story, as best I could figure it out. Read more
|Categories: Baynote, Google, Ontologies, Search engine optimization (SEO), Search engines, Social software and online media, Software as a Service (SaaS), Specialized search||4 Comments|
Andrew Orlowski is an over-the-top jerk, and a pretty sloppy reporter and analyst to boot. But he occasionally makes a good point even so. In the most recent instance, he confronted Tim Berners-Lee. As the article makes clear, Berners-Lee reacted badly to Orlowski, reflecting an attitude that is probably shared by 99% of the people who encounter the guy, and in the future will probably be adopted by sentient computers as well. Even so, Orlowski’s underlying point is valid: If the Semantic Web is going to be any more spam-free than the current Web, nobody has adequately explained why.
InQuira and Mercado both have broadened their marketing pitches beyond their traditional specialties of structured search for e-commerce. Even so, it’s well worth talking about those search technologies, which offer features and precision that you just don’t get from generic search engines. There’s a lot going on in these rather cool products.
In broad outline, Mercado and InQuira each combine three basic search approaches:
- Generic text indexing.
- Augmentation via an ontology.
- A rules engine that helps the site owner determine which results and responses are shown under various circumstances.
Of the two, InQuira seems to have the more sophisticated ontology. Indeed, the not-wholly-absurd claim is that InQuira does natural-language processing (NLP). Both vendors incorporate user information in deciding which search results to show, in ways that may be harbingers of what generic search engines like Google and Yahoo will do down the road. Read more
|Categories: InQuira, Mercado, Natural language processing (NLP), Ontologies, Search engines, Structured search||2 Comments|
Joost de Valk makes an interesting suggestion, namely that Wikipedia should drop all external links other than to DMOZ, and rely on DMOZ as the outside link directory. As division of labor, it makes perfect sense. However, it’s a total non-starter until at least two problems are solved. Read more
|Categories: Categorization and filtering, Directories, ODP and DMOZ, Ontologies, Spam and antispam||5 Comments|