Categorization and filtering

Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:

Any subcategory
Text mining
Search engines

September 20, 2009

Data marts in the world of text

CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried “Shadow IT”, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he’s talking about data marts, only for documents rather than tabular data.

Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts, in the tabular and document worlds alike. For example:

Price/performance. Your main/central data manager might be too expensive to support additional large specialized databases. Or different databases and applications might have sufficiently different profiles so as to get great price/performance from different kinds of data managers. This is particularly prevalent in the relational world, where each of column stores, sequentially-oriented row stores, and random I/O-oriented row stores have compelling use cases.
Different SLAs (Service-Level Agreements). Similarly, different applications may have very different requirements for uptime, response time, and the like. (In the relational world, think of operational data stores.)
Different security requirements. Different subsets of the data may need different levels of security. This is particularly prevalent in the document world, where security problems are not as well-solved as in the tabular arena, and where it’s common for a search engine to index across different corpuses with radically different levels of sensitivity.
Integrated application and user interfaces. In the relational world, there’s a pretty clean separation between data management and interface logic; most serious business intelligence tools can talk to most DBMS. The document world is quite different. Some search engines bundle, for example, various kinds of faceted or parameterized search interfaces. What’s more, in public-facing search, a major differentiator is the facilities that the product offers for skewing search results.
Different text applications require different thesauruses or taxonomy management systems. Ideally, those should all be integrated — but the requisite technology still doesn’t exist.

Bottom line: Text data marts, much like relational data marts, are almost surely here to stay.

Related link

The future of data marts

Categories: Enterprise search, Ontologies, Search engines, Specialized search, Structured search

2 Comments

March 7, 2009

Yet more NoFollow whining

Andy Beal has a blog post up to the effect that NoFollow is a bad thing. (Edit: Andy points out in the comment thread that his opposition to NoFollow isn’t as absolute as I was suggesting.) Other SEO types are promoting this is if it were some kind of important cause. I think that’s nuts, and NoFollow is a huge spam-reducer.

The weakness of Andy’s argument is illustrated by the one and only scenario he posits in support of his crusade:

The result is that a blog post added to a brand new site may well have just broken the story about the capture of Bin Laden (we wish!)–and a link to said post may have been Tweeted and re-tweeted–but Google won’t discover or index that post until it finds a “followed” link. Likely from a trusted site in Google’s index and likely hours, if not days, after it was first shared on Twitter.

Helloooo — if I post something here, it is indexed at least in Google blog search immediately. (As in, within a minute or so.) Ping, crawl, pop — there it is. The only remotely valid version of Andy’s complaint is that It might take some hours for Google’s main index to update — but even there there’s a News listing at the top. This simply is not a problem.

Now, I think it would be personally great for me if all the links to my sites from Wikipedia and Twitter and the comment threads of major blogs pointed back with “link juice.” On the other hand, even with NoFollow out there, my sites come up high in Google’s rankings for all sorts of keywords, driving a lot of their readership. I imagine the same is true for most other sites containing fairly unique content that people find interesting enough to link to.

So other than making it harder to engage in deceptive SEO, I fail to see what problems NoFollow is causing.

Categories: Google, Online marketing, Search engine optimization (SEO), Search engines, Spam and antispam

2 Comments

December 29, 2008

Where “semantic” technology is or isn’t important

At Lynda Moulton’s behest, I spoke a couple of times recently on the subject of where “semantic” technology is or isn’t likely to be important. One was at the Gilbane conference in early December. The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. The actual Gilbane slides may be found here.

My opinions about the applicability of semantic technology include:

The big bucks in web search are for “transactional” web search, and semantics isn’t the issue there. (Slides 3-4)
When UIs finally go beyond the simple search box — e.g. to clusters/facets or to voice — semantics should have a role to play. (Slide 5)
Public-facing site search depends — more than any other area of text analytics — on hand-tagging. (Slide 7)
“Enterprise” search that searches specialized external databases could benefit from semantic technologies. (Slide 8)
True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. (Slides 10-11)
Semantics — specifically extraction — is central to custom publishing. (Slide 12 — upon review I regret using the word “sophisticated”)
Semantics is central to text mining. (Slide 18)
Semantics could play a big role in all sorts of exciting future developments. (Slide 19)

So what would your list be like?

Categories: Enterprise search, Ontologies, Search engines, Specialized search, Structured search

5 Comments

November 12, 2008

Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?

TechTaxi points out that it’s at least theoretically possible to, by polluting the Web, pollute somebody’s web-wide information gathering. (Hat tip to Daniel Tunkelang.) They further assert this is a relatively near-term threat.

The theory can’t be denied. What’s more, bad actors have other motives to pollute the Web. For example, if they plant favorable automated comments about their own products or unfavorable about the competition’s, Voice of the Customer/Market applications will naturally be confused. And if automated reputation-checkers get more prominent, there will be a major incentive to game them, just as there has been for Google’s PageRank. So VOTC/VOTM market research tools could polluted as a side effect.

Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.

But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I’m not thinking of a serious near-term threat scenario.

Categories: Competitive intelligence, Search engines, Spam and antispam, Voice of the Customer

2 Comments

July 11, 2008

The phrase “business intelligence” was COINED for text analytics

Late last year, there was a little flap about who invented the phrase business intelligence. Credit turns out to go to an IBM researcher named H. P. Luhn, as per this 1958 paper. Well, I finally took a look at the paper, after Jeff Jones of IBM sent over another copy. And guess what? It’s all about text analytics. Specifically, it’s about what we might now call a combination of classification and knowledge management.

Half a century later, the industry is finally poised to deliver on that vision.

Categories: BI integration, Categorization and filtering, IBM and UIMA

3 Comments

June 19, 2008

3 specialized markets for text analytics

In the previous post, I offered a list of eight linguistics-based market segments, and a slide deck surveying them. And I promised a series of follow-up posts based on the slides. Read more

Categories: Language recognition, Natural language processing (NLP), Spam and antispam, Speech recognition

2 Comments

June 19, 2008

The Text Analytics Marketplace: Competitive landscape and trends

As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:

1. Web search

2. Public-facing site search

3. Enterprise search and knowledge management

4. Custom publishing

5. Text mining and extraction

Three are more standalone:

6. Spam filtering

7. Voice recognition

8. Machine translation

Categories: Audio and video search, BI integration, Custom publishing, Enterprise search, Google, Natural language processing (NLP), Nuance, Progress and EasyAsk, Search engines, Social software and online media, Spam and antispam, Speech recognition, Structured search, Text Analytics Summit, Text mining

3 Comments

June 15, 2008

How text search has evolved over the past 15 years

I just stumbled across a brilliant summary of evolution in text search technology, written four years ago. It’s equally valid today (which in itself says something). I found it on the Prism Legal blog, but the actual author is Sharon Flank. My own comments are interspersed in bold. Read more

Categories: Enterprise search, Ontologies, Search engines, Structured search

Expert System S.p.A. update

I chatted with Brooke Aker, the new CEO of Expert System’s US subsidiary, for quite a while last week. Unfortunately, we had some cell phone problems, and email followup hasn’t gone well, so I’m hazy on a few details. But here are some highlights, as best I understood them. Read more

Categories: Application areas, Competitive intelligence, Coveo, Expert System S.p.A., Ontologies, Text mining

2 Comments

May 8, 2008

Google seems to have rehabilitated us

As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.

We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.

Categories: Google, Search engine optimization (SEO), Spam and antispam

1 Comment

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Categorization and filtering

Data marts in the world of text

Yet more NoFollow whining

Where “semantic” technology is or isn’t important

Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?

The phrase “business intelligence” was COINED for text analytics

3 specialized markets for text analytics

The Text Analytics Marketplace: Competitive landscape and trends

How text search has evolved over the past 15 years

Expert System S.p.A. update

Google seems to have rehabilitated us

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin