Spam and antispam
Analysis of spam, both e-mail and web-based, and of technology that attempts to defeat it.
Andy Beal has a blog post up to the effect that NoFollow is a bad thing. (Edit: Andy points out in the comment thread that his opposition to NoFollow isn’t as absolute as I was suggesting.) Other SEO types are promoting this is if it were some kind of important cause. I think that’s nuts, and NoFollow is a huge spam-reducer.
The weakness of Andy’s argument is illustrated by the one and only scenario he posits in support of his crusade:
The result is that a blog post added to a brand new site may well have just broken the story about the capture of Bin Laden (we wish!)–and a link to said post may have been Tweeted and re-tweeted–but Google won’t discover or index that post until it finds a “followed” link. Likely from a trusted site in Google’s index and likely hours, if not days, after it was first shared on Twitter.
Helloooo — if I post something here, it is indexed at least in Google blog search immediately. (As in, within a minute or so.) Ping, crawl, pop — there it is. The only remotely valid version of Andy’s complaint is that It might take some hours for Google’s main index to update — but even there there’s a News listing at the top. This simply is not a problem.
Now, I think it would be personally great for me if all the links to my sites from Wikipedia and Twitter and the comment threads of major blogs pointed back with “link juice.” On the other hand, even with NoFollow out there, my sites come up high in Google’s rankings for all sorts of keywords, driving a lot of their readership. I imagine the same is true for most other sites containing fairly unique content that people find interesting enough to link to.
So other than making it harder to engage in deceptive SEO, I fail to see what problems NoFollow is causing.
|Categories: Google, Online marketing, Search engine optimization (SEO), Search engines, Spam and antispam||2 Comments|
TechTaxi points out that it’s at least theoretically possible to, by polluting the Web, pollute somebody’s web-wide information gathering. (Hat tip to Daniel Tunkelang.) They further assert this is a relatively near-term threat.
The theory can’t be denied. What’s more, bad actors have other motives to pollute the Web. For example, if they plant favorable automated comments about their own products or unfavorable about the competition’s, Voice of the Customer/Market applications will naturally be confused. And if automated reputation-checkers get more prominent, there will be a major incentive to game them, just as there has been for Google’s PageRank. So VOTC/VOTM market research tools could polluted as a side effect.
Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.
But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I’m not thinking of a serious near-term threat scenario.
|Categories: Competitive intelligence, Search engines, Spam and antispam, Voice of the Customer||2 Comments|
|Categories: Language recognition, Natural language processing (NLP), Spam and antispam, Speech recognition||2 Comments|
As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:
1. Web search
2. Public-facing site search
3. Enterprise search and knowledge management
4. Custom publishing
5. Text mining and extraction
Three are more standalone:
6. Spam filtering
7. Voice recognition
8. Machine translation
As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.
We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.
As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …
We’ve now upgraded to WordPress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)
All these hours of aggravation because some criminal wanted a bit of SEO advantage …
Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on WordPress.com were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:
- WordPress.com provides costless hosting; using your own domain costs money.
- Besides being free, WordPress.com hosting may provide a little “google juice”, which is the whole SEO point of spam blogging.
But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:
- 10% of all blogs are spam.
- 10% of all blogs are actively written by humans.
- 80% of all blogs belong to humans, but are updated very infrequently if at all.
In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.
|Categories: Blogosphere, Search engine optimization (SEO), Social software and online media, Spam and antispam||Leave a Comment|
Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.
The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.
Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie. Read more
|Categories: Enterprise search, Google, Microsoft, Search engines, Social software and online media, Spam and antispam, Website filtering, Yahoo||16 Comments|
XMCP writes one of the better black hat SEO blogs. In a post last November, he laid out a ton of advice about automating black hat SEO. Personally, I don’t approve of doing black hat SEO. Still, it’s an intellectually interesting subject. What’s more, black hat SEOs create a large fraction of all websites, and certainly of all blog comments, links, and so on. So it’s interesting to track them.
Most interesting to me and probably to most readers here is the part that shows where black hat SEOs get their content: Read more