Categorization and filtering

Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:

October 6, 2007

The Clarabridge approach to text mining

And for my sixth text mining post this weekend, here are some highlights of the Clarabridge technology story. (Sorry if it sounds clipped, but I’m a bit burned out …)

September 30, 2007

A tip for submitting to DMOZ — make your site description clear

I just picked out a few of the many unreviewed sites in my DMOZ categories to evaluate, and listed most of those I reviewed.

How did I choose them to get screened? Mainly, I picked out ones with focused descriptions, titles, and so on, that just seemed likely to be listable based on that info (which is the essence of what I see on the page where all the various submitted sites are linked). I correctly guessed that I’d be able to quickly understand what I was seeing and judge whether to list the site or not, quickly write the official site description, and so on. Read more

August 31, 2007

A challenge to DMOZ bashers

Give or take a corrected typo, here’s a challenge to DMOZ bashers I just wrote in the flame war thread.

If you want to do something that is:

A. Correct
B. Credible
C. Potentially useful

just go find a specific category with terrible listings, and publicize the fact with overwhelmingly clear proof of your assessment.

If that’s not EASY for you to do … then maybe DMOZ isn’t so bad after all, eh?

In particular, I’d encourage you to post a version of the category that is clearly better than what is currently there.

Technorati Tags: ,

August 31, 2007

DMOZ — yet another flame war

My latest thoughts about DMOZ and the ODP may be found in this blog comment thread.

The gist is:

Or something like that. As I said, it’s a flame war …

Anyhow, I’m flying off on a two-week snorkeling trip Saturday, and should be much mellower soon.

July 22, 2007

Text analytics marketplace trends

It was tough to judge user demand at the recent Text Analytics Summit because, well, very few users showed up. And frankly, I wasn’t as aggressive at pumping vendors for trends as I am some other times. That said, I have talked with most text analytics vendors recently,* and here are my impressions of what’s going on. Any contrary – or confirming! — opinions would be most welcome.

*Factiva is the most significant exception. Hint, hint.

If you think about it, text analytics is a “secret ingredient” in search, antispam, and data cleaning,* and this dominates all other uses of the technology. A significant minority of the research effort at companies that do any kind of text filtering is – duh — text analytics. Cold comfort for specialist text analytics vendors, to be sure, but that’s the way it is.

*I.e., part of the “T” in “ETL” (Extract/Transform/Load).

Text-analytics-enhanced custom publishing will surely at some point become a must-have for business and technical publishers. However, it appears that we’re not quite there yet, as large publishers make do with simple-minded search and the like. In what I suspect is a telling market commentary, there’s no headlong rush among vendors to dump text mining for custom publishing, notwithstanding the examples of nStein and (sort of) ClearForest. I don’t want to be overly negative – either my friends at Mark Logic are doing just fine or else they’re putting up a mighty brave front – but I don’t think the nonspecialist publishing market is there yet. Read more

June 6, 2007

I’ve decided to trust Akismet/Bad Behavior

Akismet recently upgraded so that you can see all the spam it’s holding, not just the last 150 messages. This made me a lot happier — but ironically I quickly gave up, and decided to trust Akismet without checking. Why? Well, Akismet sequesters 15 days of spam, and I currently have the following numbers of messages stashed away in it:

That’s over 800 spam per day across four blogs. And when I did check, I almost never found a false positive, except occasionally a trackback of my own.

More problematic is my e-mail. Eudora flags pretty much everything that isn’t from an established sender as spam. So along with my 300+ true spam, I get a number of false positives per day, some of which have turned into paying customer relationships. So THAT spam directory I do check carefully …

April 30, 2007

Wise Crowds of Long-Tailed Ants, or something like that

Baynote sells a recommendation engine whose motto appears to be “popularity implies accuracy.” While that leads to some interesting technological ideas (below), Baynote carries that principle to an unfortunate extreme in its marketing, which is jam-packed with inaccurate buzzspeak. While most of that is focused on a few trendy meme-oriented books, the low point of my briefing today was the probably the insistence against pushback that “95%” of Google’s results depend on “PageRank.” (I think what Baynote really meant is “all off-page factors combined,” but anyhow I sure didn’t get the sense that accuracy was an important metric for them in setting their briefing strategy. And by the way, one reason I repeat the company’s name rather than referring to Baynote by a pronoun is that on-page factors DO matter in search engine rankings.)

That said, here’s the essence of Baynote’s story, as best I could figure it out. Read more

April 17, 2007

For search, extreme network neutrality must not be compromised

In a recent post on the Monash Report, I drew a distinction between two aspects of the Internet:Jeffersonet and Edisonet.Jeffersonet deals in thoughts and ideas and research and scholarship and news and politics, and in commerce too.It’s what makes people so passionate about the Internet’s democracy-enhancing nature.It’s what needs to be protected by extreme network neutrality.And it’s modest enough in its bandwidth requirements that net neutrality is completely workable.(Edisonet, by way of contrast, comprises advanced applications in entertainment, teleconferencing, etc. that probably do require new capital investment and tiered pricing schemes.)

And if there’s one application that’s at the core of Jeffersonet, it’s search.No matter how much scary posturing telecom CEOs do – and no matter how profitable or monopolistic Google becomes – telecom carriers must never be allowed to show any preference among search engines!At least, that’s the case for text-centric search engines such as Google, Yahoo, and Microsoft run today.The reason is simple:The democratic part of the Internet only works so long as things can be found.And search will long be a huge part of how to find them.So search engine vendors must never be able to succeed based on a combination of good-enough results plus superior marketing and business development.They always have to be kept afraid of competition from engines that provide better actual search engine results. Read more

March 26, 2007

So THAT’S why Andrew Orlowski still has a job (Part 2)

Andrew Orlowski is an over-the-top jerk, and a pretty sloppy reporter and analyst to boot. But he occasionally makes a good point even so. In the most recent instance, he confronted Tim Berners-Lee. As the article makes clear, Berners-Lee reacted badly to Orlowski, reflecting an attitude that is probably shared by 99% of the people who encounter the guy, and in the future will probably be adopted by sentient computers as well. Even so, Orlowski’s underlying point is valid: If the Semantic Web is going to be any more spam-free than the current Web, nobody has adequately explained why.

February 15, 2007

InQuira’s and Mercado’s approaches to structured search

InQuira and Mercado both have broadened their marketing pitches beyond their traditional specialties of structured search for e-commerce. Even so, it’s well worth talking about those search technologies, which offer features and precision that you just don’t get from generic search engines. There’s a lot going on in these rather cool products.

In broad outline, Mercado and InQuira each combine three basic search approaches:

Of the two, InQuira seems to have the more sophisticated ontology. Indeed, the not-wholly-absurd claim is that InQuira does natural-language processing (NLP). Both vendors incorporate user information in deciding which search results to show, in ways that may be harbingers of what generic search engines like Google and Yahoo will do down the road. Read more

← Previous PageNext Page →

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.