Categorization and filtering

Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:

November 12, 2008

Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?

TechTaxi points out that it’s at least theoretically possible to, by polluting the Web, pollute somebody’s web-wide information gathering. (Hat tip to Daniel Tunkelang.) They further assert this is a relatively near-term threat.

The theory can’t be denied. What’s more, bad actors have other motives to pollute the Web. For example, if they plant favorable automated comments about their own products or unfavorable about the competition’s, Voice of the Customer/Market applications will naturally be confused. And if automated reputation-checkers get more prominent, there will be a major incentive to game them, just as there has been for Google’s PageRank. So VOTC/VOTM market research tools could polluted as a side effect.

Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.

But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I’m not thinking of a serious near-term threat scenario.

July 11, 2008

The phrase “business intelligence” was COINED for text analytics

Late last year, there was a little flap about who invented the phrase business intelligence. Credit turns out to go to an IBM researcher named H. P. Luhn, as per this 1958 paper. Well, I finally took a look at the paper, after Jeff Jones of IBM sent over another copy. And guess what? It’s all about text analytics. Specifically, it’s about what we might now call a combination of classification and knowledge management.

Half a century later, the industry is finally poised to deliver on that vision.

June 19, 2008

3 specialized markets for text analytics

In the previous post, I offered a list of eight linguistics-based market segments, and a slide deck surveying them. And I promised a series of follow-up posts based on the slides.

Read more

June 19, 2008

The Text Analytics Marketplace: Competitive landscape and trends

As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:

1. Web search

2. Public-facing site search

3. Enterprise search and knowledge management

4. Custom publishing

5. Text mining and extraction

Three are more standalone:

6. Spam filtering

7. Voice recognition

8. Machine translation

Read more

June 15, 2008

How text search has evolved over the past 15 years

I just stumbled across a brilliant summary of evolution in text search technology, written four years ago. It’s equally valid today (which in itself says something). I found it on the Prism Legal blog, but the actual author is Sharon Flank. My own comments are interspersed in bold. Read more

June 11, 2008

Expert System S.p.A. update

I chatted with Brooke Aker, the new CEO of Expert System’s US subsidiary, for quite a while last week. Unfortunately, we had some cell phone problems, and email followup hasn’t gone well, so I’m hazy on a few details. But here are some highlights, as best I understood them.

Read more

May 8, 2008

Google seems to have rehabilitated us

As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.

We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.

April 25, 2008

Drive-by Google de-listing

As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …

We’ve now upgraded to Wordpress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)

All these hours of aggravation because some criminal wanted a bit of SEO advantage …

March 4, 2008

Over 80 percent of blog posts are probably spam

Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on Wordpress.com were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:

But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:

In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.

February 3, 2008

19 Microsoft/Yahoo synergies that could revolutionize the Internet

Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.

The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.

Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie.

Read more

Next Page →

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Recent white paper

The Explosion in DBMS Choice

August, 2008

Recent webcast

What leading database vendors don't want you to know

Originally broadcast April 9, 2008

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.