Spam and antispam

Analysis of spam, both e-mail and web-based, and of technology that attempts to defeat it.

March 7, 2009

Yet more NoFollow whining

Andy Beal has a blog post up to the effect that NoFollow is a bad thing. (Edit: Andy points out in the comment thread that his opposition to NoFollow isn’t as absolute as I was suggesting.) Other SEO types are promoting this is if it were some kind of important cause. I think that’s nuts, and NoFollow is a huge spam-reducer.

The weakness of Andy’s argument is illustrated by the one and only scenario he posits in support of his crusade:

The result is that a blog post added to a brand new site may well have just broken the story about the capture of Bin Laden (we wish!)–and a link to said post may have been Tweeted and re-tweeted–but Google won’t discover or index that post until it finds a “followed” link. Likely from a trusted site in Google’s index and likely hours, if not days, after it was first shared on Twitter.

Helloooo — if I post something here, it is indexed at least in Google blog search immediately. (As in, within a minute or so.) Ping, crawl, pop — there it is. The only remotely valid version of Andy’s complaint is that It might take some hours for Google’s main index to update — but even there there’s a News listing at the top. This simply is not a problem.

Now, I think it would be personally great for me if all the links to my sites from Wikipedia and Twitter and the comment threads of major blogs pointed back with “link juice.” On the other hand, even with NoFollow out there, my sites come up high in Google’s rankings for all sorts of keywords, driving a lot of their readership. I imagine the same is true for most other sites containing fairly unique content that people find interesting enough to link to.

So other than making it harder to engage in deceptive SEO, I fail to see what problems NoFollow is causing.

November 12, 2008

Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?

TechTaxi points out that it’s at least theoretically possible to, by polluting the Web, pollute somebody’s web-wide information gathering. (Hat tip to Daniel Tunkelang.) They further assert this is a relatively near-term threat.

The theory can’t be denied. What’s more, bad actors have other motives to pollute the Web. For example, if they plant favorable automated comments about their own products or unfavorable about the competition’s, Voice of the Customer/Market applications will naturally be confused. And if automated reputation-checkers get more prominent, there will be a major incentive to game them, just as there has been for Google’s PageRank. So VOTC/VOTM market research tools could polluted as a side effect.

Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.

But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I’m not thinking of a serious near-term threat scenario.

June 19, 2008

3 specialized markets for text analytics

In the previous post, I offered a list of eight linguistics-based market segments, and a slide deck surveying them. And I promised a series of follow-up posts based on the slides. Read more

June 19, 2008

The Text Analytics Marketplace: Competitive landscape and trends

As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:

1. Web search

2. Public-facing site search

3. Enterprise search and knowledge management

4. Custom publishing

5. Text mining and extraction

Three are more standalone:

6. Spam filtering

7. Voice recognition

8. Machine translation

Read more

May 8, 2008

Google seems to have rehabilitated us

As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.

We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.

April 25, 2008

Drive-by Google de-listing

As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …

We’ve now upgraded to WordPress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)

All these hours of aggravation because some criminal wanted a bit of SEO advantage …

March 4, 2008

Over 80 percent of blog posts are probably spam

Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:

But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:

In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.

February 3, 2008

19 Microsoft/Yahoo synergies that could revolutionize the Internet

Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.

The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.

Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie. Read more

January 26, 2008

Anatomy of spam blogs

A post that gives you a clear sense of how gobbledydook is automatically generated (from another knowledgeable black-hat SEO who can’t be bothered to get his permalink structure sensible 😉 )

January 16, 2008

Automation secrets of black hat SEO

XMCP writes one of the better black hat SEO blogs. In a post last November, he laid out a ton of advice about automating black hat SEO. Personally, I don’t approve of doing black hat SEO. Still, it’s an intellectually interesting subject. What’s more, black hat SEOs create a large fraction of all websites, and certainly of all blog comments, links, and so on. So it’s interesting to track them.

Most interesting to me and probably to most readers here is the part that shows where black hat SEOs get their content: Read more

Next Page →

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Warning: include(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/texttechnologies/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include( failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/texttechnologies/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include(): Failed opening '' for inclusion (include_path='.:/usr/lib/php:/usr/local/lib/php') in /home/texttechnologies/public_html/wp-content/themes/monash/static_sidebar.php on line 29