Categorization and filtering

Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:

April 25, 2008

Drive-by Google de-listing

As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …

We’ve now upgraded to WordPress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)

All these hours of aggravation because some criminal wanted a bit of SEO advantage …

March 4, 2008

Over 80 percent of blog posts are probably spam

Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on WordPress.com were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:

But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:

In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.

February 3, 2008

19 Microsoft/Yahoo synergies that could revolutionize the Internet

Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.

The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.

Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie. Read more

January 31, 2008

The biggest text analytics company you probably never heard of

I caught up with Expert System S.p.A. last week. They turn out to be doing $10 million in text technology annual revenue. That alone is surprising (sadly), but what’s really remarkable is that they did it almost entirely in the Italian market. As you might guess, that figure includes a little bit of everything, from search engines to Italian language filters for Microsoft Office to text mining. But only $3 ½ million of Expert System’s revenue is from the government (and I think that includes civilian agencies), and under 30% is professional services, so on the whole it seems like a pretty real accomplishment. Oh yes – Expert Systems says it’s entirely self-funded.

As of last year, Expert System also has English-language products, and a couple of minor OEM sales in the US (for mobile search and semantic web applications). German- and Arabic-language products are in beta test. The company says that its market focus going forward is national security – surely the reason for the Arabic – and competitive intelligence. It envisions selling through partners such as system integrators, although I think that makes more sense for the government market than it does vis-a-vis civilian companies. In February the company is introducing a market intelligence product focused on sentiment analysis.

Expert System is a bit of a throwback, in that it talks lovingly of the semantic network that informs its products. Read more

January 26, 2008

Anatomy of spam blogs

A post that gives you a clear sense of how gobbledydook is automatically generated (from another knowledgeable black-hat SEO who can’t be bothered to get his permalink structure sensible ;) )

January 16, 2008

Automation secrets of black hat SEO

XMCP writes one of the better black hat SEO blogs. In a post last November, he laid out a ton of advice about automating black hat SEO. Personally, I don’t approve of doing black hat SEO. Still, it’s an intellectually interesting subject. What’s more, black hat SEOs create a large fraction of all websites, and certainly of all blog comments, links, and so on. So it’s interesting to track them.

Most interesting to me and probably to most readers here is the part that shows where black hat SEOs get their content: Read more

January 8, 2008

A very fast splogger

The first post ever on Strategic Messaging went up at 2:49 am. Within four hours, I had my first splog trackbacks, all from the same site. The strategicmessaging.com domain itself had just repropagated through DNS hours earlier, and had no incoming links other than Whois and the like.

Pretty impressive spamming. Not that it did him any good, of course, except insofar as he was stealing a bit of my content …

January 2, 2008

Restoring security and function to my mail and websites

OK. Here’s the story as I now know it.

  1. monash.com was hit by a massive mail-bomb Christmas Eve. My email and websites went down for a while as a consequence. What’s more, with a flooded mail queue, there were further mail problems through at least 12/28. Some mail bounced, and other mail that appeared to go through was lost forever. If you’ve mailed me since 12/24 and I haven’t answered, please send again.
  2. The mail-bomb paved the way for an injection of some malware. I started noticing possible trojans on monash.com 12/31. Melissa Bradshaw, my stellar web designer, noticed Javascript that she hadn’t written, both on monash.com and dbms2.com. So far as we could tell, standard anti-malware client protections were sufficient to keep any trojans from being successfully downloaded to clients.
  3. My very attentive web hosting company, Dimension Servers, is rebuilding its Linux kernel accordingly. Scheduled downtime for all my sites and mail is midnight to 3:00 am Eastern tonight, but that’s obviously just a rough estimate. Company president Jonathan MacAllister telephoned me to tell me this personally, notwithstanding that his wife delivered a baby by emergency C-section today. (Wife and baby are OK!)
  4. Jonathan also told me that after the attack, he bought a Cisco appliance. Every web hosting company needs to do that, as appliances are much more efficient at dealing with overloading attacks than the servers themselves. Cisco was a brand choice pretty much dictated by his remote data center.
  5. David Ferris and Richi Jennings have convinced me to move monash.com email to Google’s free mail hosting service. This is what they’re doing with ferris.com mail and all of Richi’s domains as well. NO analysts are more reliable on email than David and Richi. And hosting is surely no exception, as David and I did a research project together some years ago uncovering the Critical Path sham.
  6. The net effect of that move will be that monash.com and dbms2.com have their email managed quite separately, so if you can’t get me at one, please try the other. Generally, if you don’t know me you should write to monash.com, and I’ll probably write back from dbms2.com.
  7. I’ll post about all this again after things seem to have worked out, possibly over on the Monash Report.

Happy New Year,

CAM

December 31, 2007

I’m getting mailbombed again

Shortly after my first reference to Shoemoney’s DMOZ issues — who did you think I meant with “shoe in his mouth“? — I got mailbombed big time. Things calmed down after a month or so, although I did change web hosting companies in the fallout.

Starting Christmas Eve — which coincidentally was shortly after a forum mention of various Shoemoney flaps, and of the first attack — I got hit again. And there was another wave right after Christmas. A fair amount of email was lost forever, possibly both professional and personal. My blogs also were down for a while, as were other sites on the same server. (And if you sent me any email over that time period, please resend it.)

It seems that I should move my email/MX record to a different service than hosts my websites, perhaps one that has invested in technology to efficiently deflect DDOS attacks. (Or perhaps I should move one domain with it, if a traditional hosting deal seems best.) Does anybody have any recommendations of such services? Read more

November 29, 2007

An Occurrence at Owl Creek Bridge and other SEO spam explained

I average upwards of 100 spam comments per day per blog, very little of which actually gets through (although that very little is obviously enough to be quite annoying!). Recent research from Sunbelt explains part of what’s going on. (More here in Computerworld.) What’s going on is this:

1. Aggressive black-hat SEO is being done for all kind of long-tail terms and phrases, by posting comment spam filled with little except links on those phrases. For example, one of the first spams I checked for this post consists simply of 10 links to the same .cn, with anchor text, with anchor text and subdomain name being the same keyphrase. Keyphrases included “an occurrence at owl creek bridge”, “allegheny assessment county tax”, and “am been hate i ive who who.” As this kind of spam came by, I’d been wondering why people bothered, since it didn’t seem terribly easy to monetize. Read more

← Previous PageNext Page →

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.