Spam and antispam
Analysis of spam, both e-mail and web-based, and of technology that attempts to defeat it.
A very fast splogger
The first post ever on Strategic Messaging went up at 2:49 am. Within four hours, I had my first splog trackbacks, all from the same site. The strategicmessaging.com domain itself had just repropagated through DNS hours earlier, and had no incoming links other than Whois and the like.
Pretty impressive spamming. Not that it did him any good, of course, except insofar as he was stealing a bit of my content …
Restoring security and function to my mail and websites
OK. Here’s the story as I now know it.
- monash.com was hit by a massive mail-bomb Christmas Eve. My email and websites went down for a while as a consequence. What’s more, with a flooded mail queue, there were further mail problems through at least 12/28. Some mail bounced, and other mail that appeared to go through was lost forever. If you’ve mailed me since 12/24 and I haven’t answered, please send again.
- The mail-bomb paved the way for an injection of some malware. I started noticing possible trojans on monash.com 12/31. Melissa Bradshaw, my stellar web designer, noticed Javascript that she hadn’t written, both on monash.com and dbms2.com. So far as we could tell, standard anti-malware client protections were sufficient to keep any trojans from being successfully downloaded to clients.
- My very attentive web hosting company, Dimension Servers, is rebuilding its Linux kernel accordingly. Scheduled downtime for all my sites and mail is midnight to 3:00 am Eastern tonight, but that’s obviously just a rough estimate. Company president Jonathan MacAllister telephoned me to tell me this personally, notwithstanding that his wife delivered a baby by emergency C-section today. (Wife and baby are OK!)
- Jonathan also told me that after the attack, he bought a Cisco appliance. Every web hosting company needs to do that, as appliances are much more efficient at dealing with overloading attacks than the servers themselves. Cisco was a brand choice pretty much dictated by his remote data center.
- David Ferris and Richi Jennings have convinced me to move monash.com email to Google’s free mail hosting service. This is what they’re doing with ferris.com mail and all of Richi’s domains as well. NO analysts are more reliable on email than David and Richi. And hosting is surely no exception, as David and I did a research project together some years ago uncovering the Critical Path sham.
- The net effect of that move will be that monash.com and dbms2.com have their email managed quite separately, so if you can’t get me at one, please try the other. Generally, if you don’t know me you should write to monash.com, and I’ll probably write back from dbms2.com.
- I’ll post about all this again after things seem to have worked out, possibly over on the Monash Report.
Happy New Year,
CAM
Categories: Spam and antispam | 2 Comments |
I’m getting mailbombed again
Shortly after my first reference to Shoemoney’s DMOZ issues — who did you think I meant with “shoe in his mouth“? — I got mailbombed big time. Things calmed down after a month or so, although I did change web hosting companies in the fallout.
Starting Christmas Eve — which coincidentally was shortly after a forum mention of various Shoemoney flaps, and of the first attack — I got hit again. And there was another wave right after Christmas. A fair amount of email was lost forever, possibly both professional and personal. My blogs also were down for a while, as were other sites on the same server. (And if you sent me any email over that time period, please resend it.)
It seems that I should move my email/MX record to a different service than hosts my websites, perhaps one that has invested in technology to efficiently deflect DDOS attacks. (Or perhaps I should move one domain with it, if a traditional hosting deal seems best.) Does anybody have any recommendations of such services? Read more
An Occurrence at Owl Creek Bridge and other SEO spam explained
I average upwards of 100 spam comments per day per blog, very little of which actually gets through (although that very little is obviously enough to be quite annoying!). Recent research from Sunbelt explains part of what’s going on. (More here in Computerworld.) What’s going on is this:
1. Aggressive black-hat SEO is being done for all kind of long-tail terms and phrases, by posting comment spam filled with little except links on those phrases. For example, one of the first spams I checked for this post consists simply of 10 links to the same .cn, with anchor text, with anchor text and subdomain name being the same keyphrase. Keyphrases included “an occurrence at owl creek bridge”, “allegheny assessment county tax”, and “am been hate i ive who who.” As this kind of spam came by, I’d been wondering why people bothered, since it didn’t seem terribly easy to monetize. Read more
Categories: Search engine optimization (SEO), Spam and antispam | 1 Comment |
Text analytics marketplace trends
It was tough to judge user demand at the recent Text Analytics Summit because, well, very few users showed up. And frankly, I wasn’t as aggressive at pumping vendors for trends as I am some other times. That said, I have talked with most text analytics vendors recently,* and here are my impressions of what’s going on. Any contrary – or confirming! — opinions would be most welcome.
*Factiva is the most significant exception. Hint, hint.
If you think about it, text analytics is a “secret ingredient” in search, antispam, and data cleaning,* and this dominates all other uses of the technology. A significant minority of the research effort at companies that do any kind of text filtering is – duh — text analytics. Cold comfort for specialist text analytics vendors, to be sure, but that’s the way it is.
*I.e., part of the “T” in “ETL” (Extract/Transform/Load).
Text-analytics-enhanced custom publishing will surely at some point become a must-have for business and technical publishers. However, it appears that we’re not quite there yet, as large publishers make do with simple-minded search and the like. In what I suspect is a telling market commentary, there’s no headlong rush among vendors to dump text mining for custom publishing, notwithstanding the examples of nStein and (sort of) ClearForest. I don’t want to be overly negative – either my friends at Mark Logic are doing just fine or else they’re putting up a mighty brave front – but I don’t think the nonspecialist publishing market is there yet. Read more
I’ve decided to trust Akismet/Bad Behavior
Akismet recently upgraded so that you can see all the spam it’s holding, not just the last 150 messages. This made me a lot happier — but ironically I quickly gave up, and decided to trust Akismet without checking. Why? Well, Akismet sequesters 15 days of spam, and I currently have the following numbers of messages stashed away in it:
- 2246 here on Text Technologies.
- 4427 on DBMS2.
- 816 on Software Memories.
- 5156 on the Monash Report.
That’s over 800 spam per day across four blogs. And when I did check, I almost never found a false positive, except occasionally a trackback of my own.
More problematic is my e-mail. Eudora flags pretty much everything that isn’t from an established sender as spam. So along with my 300+ true spam, I get a number of false positives per day, some of which have turned into paying customer relationships. So THAT spam directory I do check carefully …
Categories: Blogosphere, Spam and antispam | Leave a Comment |
So THAT’S why Andrew Orlowski still has a job (Part 2)
Andrew Orlowski is an over-the-top jerk, and a pretty sloppy reporter and analyst to boot. But he occasionally makes a good point even so. In the most recent instance, he confronted Tim Berners-Lee. As the article makes clear, Berners-Lee reacted badly to Orlowski, reflecting an attitude that is probably shared by 99% of the people who encounter the guy, and in the future will probably be adopted by sentient computers as well. Even so, Orlowski’s underlying point is valid: If the Semantic Web is going to be any more spam-free than the current Web, nobody has adequately explained why.
Categories: Ontologies, Spam and antispam | 2 Comments |
Is DMOZ the cure to Wikipedia’s spam problem?
Joost de Valk makes an interesting suggestion, namely that Wikipedia should drop all external links other than to DMOZ, and rely on DMOZ as the outside link directory. As division of labor, it makes perfect sense. However, it’s a total non-starter until at least two problems are solved. Read more
Categories: Categorization and filtering, Directories, ODP and DMOZ, Ontologies, Spam and antispam | 5 Comments |
Please switch to my back-up e-mail address
At least for the moment.
monash.com e-mail has been turned off by my hosting company, due to what they claim is a still on-going attack. My backup address, however — FirstnameLastname@domain.com, where domain = dbms2 — is working fine. And my e-mail client traditionally checks them at the same time. So I suggest switching, at least for the moment.
Both are through the same hosting company (Hostgator, which I aspire to replace in the immediate future, given that I also lost admin access to the blogs on two separate occasions this week, and given that support claims over half my e-mails are unreadably empty and hence suitable for being ignored, despite me never having that problem elsewhere). Thus, for other kinds of problems there might be a single point of failure. But in this case, the dbms2 address is a working alternative to the standard one.
Categories: About this blog, Spam and antispam | Leave a Comment |
A great new (to me) phrase – “Adversarial Information Retrieval”
I’ve just discovered a great new phrase – adversarial information retrieval. It’s not really new, since papers are now being accepted for what will be the third annual conference on the subject. But it seems to have gained currency over the past few months.
Edit: The term seems to have been coined in 2000.
I think this area is really where the bulk of the research into public search engine algorithms goes. And that’s another way of saying that web and enterprise search are very different things.