Spam and antispam
Analysis of spam, both e-mail and web-based, and of technology that attempts to defeat it.
Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?
TechTaxi points out that it’s at least theoretically possible to, by polluting the Web, pollute somebody’s web-wide information gathering. (Hat tip to Daniel Tunkelang.) They further assert this is a relatively near-term threat.
The theory can’t be denied. What’s more, bad actors have other motives to pollute the Web. For example, if they plant favorable automated comments about their own products or unfavorable about the competition’s, Voice of the Customer/Market applications will naturally be confused. And if automated reputation-checkers get more prominent, there will be a major incentive to game them, just as there has been for Google’s PageRank. So VOTC/VOTM market research tools could polluted as a side effect.
Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.
But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I’m not thinking of a serious near-term threat scenario.
| Categories: Competitive intelligence, Search engines, Spam and antispam, Voice of the Customer | 2 Comments |
3 specialized markets for text analytics
In the previous post, I offered a list of eight linguistics-based market segments, and a slide deck surveying them. And I promised a series of follow-up posts based on the slides.
| Categories: Language recognition, Natural language processing (NLP), Spam and antispam, Speech recognition | 2 Comments |
The Text Analytics Marketplace: Competitive landscape and trends
As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:
1. Web search
2. Public-facing site search
3. Enterprise search and knowledge management
4. Custom publishing
5. Text mining and extraction
Three are more standalone:
6. Spam filtering
7. Voice recognition
8. Machine translation
Google seems to have rehabilitated us
As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.
We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.
| Categories: Google, Search engine optimization (SEO), Spam and antispam | 1 Comment |
Drive-by Google de-listing
As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …
We’ve now upgraded to Wordpress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)
All these hours of aggravation because some criminal wanted a bit of SEO advantage …
| Categories: Google, Search engine optimization (SEO), Spam and antispam | 1 Comment |
Over 80 percent of blog posts are probably spam
Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on Wordpress.com were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:
- Wordpress.com provides costless hosting; using your own domain costs money.
- Besides being free, Wordpress.com hosting may provide a little “google juice”, which is the whole SEO point of spam blogging.
But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:
- 10% of all blogs are spam.
- 10% of all blogs are actively written by humans.
- 80% of all blogs belong to humans, but are updated very infrequently if at all.
In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.
| Categories: Blogosphere, Search engine optimization (SEO), Social software and online media, Spam and antispam | Leave a Comment |
19 Microsoft/Yahoo synergies that could revolutionize the Internet
Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.
The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.
Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie.
| Categories: Enterprise search, Google, Microsoft, Search engines, Social software and online media, Spam and antispam, Website filtering, Yahoo | 15 Comments |
Anatomy of spam blogs
A post that gives you a clear sense of how gobbledydook is automatically generated (from another knowledgeable black-hat SEO who can’t be bothered to get his permalink structure sensible
)
Automation secrets of black hat SEO
XMCP writes one of the better black hat SEO blogs. In a post last November, he laid out a ton of advice about automating black hat SEO. Personally, I don’t approve of doing black hat SEO. Still, it’s an intellectually interesting subject. What’s more, black hat SEOs create a large fraction of all websites, and certainly of all blog comments, links, and so on. So it’s interesting to track them.
Most interesting to me and probably to most readers here is the part that shows where black hat SEOs get their content: Read more
| Categories: Search engine optimization (SEO), Spam and antispam | 2 Comments |
A very fast splogger
The first post ever on Strategic Messaging went up at 2:49 am. Within four hours, I had my first splog trackbacks, all from the same site. The strategicmessaging.com domain itself had just repropagated through DNS hours earlier, and had no incoming links other than Whois and the like.
Pretty impressive spamming. Not that it did him any good, of course, except insofar as he was stealing a bit of my content …
