Spam and antispam – Text Technologies

Yet more NoFollow whining

Curt Monash — Sat, 07 Mar 2009 04:58:15 +0000

Andy Beal has a blog post up to the effect that NoFollow is a bad thing. (Edit: Andy points out in the comment thread that his opposition to NoFollow isn’t as absolute as I was suggesting.) Other SEO types are promoting this is if it were some kind of important cause. I think that’s nuts, and NoFollow is a huge spam-reducer.

The weakness of Andy’s argument is illustrated by the one and only scenario he posits in support of his crusade:

The result is that a blog post added to a brand new site may well have just broken the story about the capture of Bin Laden (we wish!)–and a link to said post may have been Tweeted and re-tweeted–but Google won’t discover or index that post until it finds a “followed” link. Likely from a trusted site in Google’s index and likely hours, if not days, after it was first shared on Twitter.

Helloooo — if I post something here, it is indexed at least in Google blog search immediately. (As in, within a minute or so.) Ping, crawl, pop — there it is. The only remotely valid version of Andy’s complaint is that It might take some hours for Google’s main index to update — but even there there’s a News listing at the top. This simply is not a problem.

Now, I think it would be personally great for me if all the links to my sites from Wikipedia and Twitter and the comment threads of major blogs pointed back with “link juice.” On the other hand, even with NoFollow out there, my sites come up high in Google’s rankings for all sorts of keywords, driving a lot of their readership. I imagine the same is true for most other sites containing fairly unique content that people find interesting enough to link to.

So other than making it harder to engage in deceptive SEO, I fail to see what problems NoFollow is causing.

Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?

Curt Monash — Wed, 12 Nov 2008 07:45:39 +0000

TechTaxi points out that it’s at least theoretically possible to, by polluting the Web, pollute somebody’s web-wide information gathering. (Hat tip to Daniel Tunkelang.) They further assert this is a relatively near-term threat.

The theory can’t be denied. What’s more, bad actors have other motives to pollute the Web. For example, if they plant favorable automated comments about their own products or unfavorable about the competition’s, Voice of the Customer/Market applications will naturally be confused. And if automated reputation-checkers get more prominent, there will be a major incentive to game them, just as there has been for Google’s PageRank. So VOTC/VOTM market research tools could polluted as a side effect.

Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.

But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I’m not thinking of a serious near-term threat scenario.

3 specialized markets for text analytics

Curt Monash — Thu, 19 Jun 2008 07:44:09 +0000

In the previous post, I offered a list of eight linguistics-based market segments, and a slide deck surveying them. And I promised a series of follow-up posts based on the slides.

Let me begin by explaining what I mean by some of that list (taken from Slide 2), starting from the bottom.

Machine translation is a small business, with small specialized vendors. Lernout & Hauspie attempted to combine it with voice recognition in a complex financial play, but that collapsed in a miasma of stock fraud. The remnants turned into what became Nuance Communications.
Nuance is a roll-up of most of the important independent voice recognition vendors. So far voice recognition has worked best in two areas: “Hands-free” computer use/dictation, and IVR (interactive voice response). While both are important, neither is exactly a mainstream enterprise computer software business. So voice recognition is not closely integrated with the other market segments.
“Natural language processing” other than voice recognition isn’t much of a business at this time (with apologies to Progress EasyAsk). It doesn’t make the list at all.
Spam filtering is obviously a major business, whether or not it is getting combined into more general security and/or messaging product suites. Antispam vendors actually perform a lot of machine learning, much like text miners do. But the types of rules they wind up with are quite different. And their hardest problems aren’t linguistic ones, usually, as the spammers have gone beyond text to, e.g., words depicted in graphical images. Besides, even where linguistics are involved, it’s a very different problem to identify words used by bad guys trying to spoof you (and the rest of the world) than it is to understand your particular users.

Why and to what extent I see the other five as separate markets was explained in connection with the subsequent 17 slides.

The Text Analytics Marketplace: Competitive landscape and trends

Curt Monash — Thu, 19 Jun 2008 07:35:39 +0000

As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:

1. Web search

2. Public-facing site search

3. Enterprise search and knowledge management

4. Custom publishing

5. Text mining and extraction

Three are more standalone:

6. Spam filtering

7. Voice recognition

8. Machine translation

This list comes from a talk I gave Monday at the Text Analytics Summit called The Text Analytics Marketplace: Competitive landscape and trends. In half an hour, I covered the first five areas (in Sue Feldman’s word, at a “gallop”). The slide deck has been uploaded to the link below. I plan to break out the material from the talk into a series of blog posts over the next few (or perhaps not-so-few) weeks.

Slides:

The Text Analytics Marketplace: Competitive landscape and trends

Other posts based on those slides:

Three specialized markets for text analytics (based on Slide 2)
6 trends that could shake up the text analytics market (based on Slide 19)
Why search technologies are going to recombine (in A World of Bytes, based on Slide 19)

Google seems to have rehabilitated us

Curt Monash — Thu, 08 May 2008 09:16:18 +0000

As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.

We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.

Drive-by Google de-listing

Curt Monash — Fri, 25 Apr 2008 05:05:15 +0000

As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …

We’ve now upgraded to WordPress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)

All these hours of aggravation because some criminal wanted a bit of SEO advantage …

Over 80 percent of blog posts are probably spam

Curt Monash — Tue, 04 Mar 2008 13:25:42 +0000

Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on WordPress.com were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:

WordPress.com provides costless hosting; using your own domain costs money.
Besides being free, WordPress.com hosting may provide a little “google juice”, which is the whole SEO point of spam blogging.

But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:

10% of all blogs are spam.
10% of all blogs are actively written by humans.
80% of all blogs belong to humans, but are updated very infrequently if at all.

In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.

19 Microsoft/Yahoo synergies that could revolutionize the Internet

Curt Monash — Sun, 03 Feb 2008 22:04:47 +0000

Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.

The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.

Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie.

Search and contextual advertising

Query serving costs are variable, and some marketing costs are performance based. But there are major economies of scale in:

Web crawling. Those huge server farms are needed irrespective of query volume. It’s easier to compete in search overall when you can afford to do all the crawling you need.
Indexing. Ditto. (Recent discussion of Google MapReduce quantifies this processing effort a bit.)
Relevancy algorithm research. The challenge for relevancy algorithms keeps going up. Adversarial information retrieval is an ongoing struggle. Universal search and local search just multiply the challenge. Neither Microsoft nor Yahoo has consistently challenged Google’s search quality. A merged Microsoft/Yahoo, however, just might.
User interface research. Some day search results pages will change, offering more useful user drill-down. And mobile-device search is a whole different interface challenge, for input (e.g., voice) and output alike. This is one area where I think a merged Microsoft/Yahoo could easily make major contributions.
Advertising platform research. Unlike text search, which goes back to at least the 1980s, contextual advertising platforms were really introduced just in the current millennium. It’s still early in their life cycles, and a great deal of innovation is yet ahead, in all parts of the system. That’s true even on text-heavy Web pages, and it’s even truer on other platforms such as video and perhaps gaming. To see just how primitive the technology is right now, consider this: Google gets greatly more revenue per search than Yahoo or Microsoft, and there are only two reasonable explanations for the disparity – difference in the searchers/subjects, or technology. Surely to a large extent it’s the latter.
Hand assists to search. These are more important than you might first think. Google manually reviews a number of possibly-spammy sites, both to adjust their rankings directly (and those of sites in link networks with them), or to learn of needed algorithm tweaks. In the future, it’s easy to imagine user “voting” on sites becoming crucial to search in a variety of ways; while it may not identify the best sites, at least it will weed out spammy/bad ones. But whatever the system, people will try to game it, and human intervention will be needed accordingly. Again, there’s a lot of potential in this area to make the world – or at least the Web – a better place.
Marketing (partial). Marketing of search services seems to consist mainly of paying for placement, plus a whole lot of word of mouth. Neither of those is an obvious economies-of-scale cost center. But here’s the problem – Google is way ahead in the branding battle. Indeed, “to google” is a much-used verb. Microsoft, Yahoo, and/or Microsoft/Yahoo have a lot of branding ground to cover if – well, if they wish to recover. So if they ever do manage to achieve superior product to Google, an expensive advertising/sponsorship campaign might turn out to be a really good idea.
Combining enterprise and web search. As I mentioned in my initial reaction to the Microsoft offer for Yahoo, FAST could be more important to the merged entity than is at first apparent. While relvancy ranking is a very different problem on the Web than in an enterprise, user interface issues are more similar. What’s more, there are potentially major benefits from truly integrating Web and enterprise search – again mainly on the UI side, but maybe in ontology leverage as well.

Email and antispam

Mail storage and serving costs, for the most part, are variable according to usage. Even so, there are important economies of scale in:

Antispam. Google, perhaps due to the Postini acquisition, is doing a great job of antispam right now. Yahoo, however, is a disaster in that regard, with much legitimate mail not getting through at all. And antispam is an arms race, with new development constantly needed.
General email software development. Antispam aside, online email software is still in sad shape. User interfaces, searching/filtering, and general stability are all problematic. Integration with client email software and other messaging is often even worse. Advertising potential is hard to monetize without unacceptable privacy violations. All told, there’s a lot of email software development ahead.
Marketing. If it were easy to market online email services other than by word of mouth, more marketing would probably be happening. If the challenge ever gets solved, the solution may be expensive.
Email integration with other messaging. As noted below, chat and social networking stand to be utterly transformed. What emerges will transform and perhaps even subsume email-as-we-know-it.
Email integration with search. One of the worst things about email is its primitive filtering, both when it arrives and when you’re looking for it later. Google has taken the lead on email/search integration, but this will be a long race that currently still in the early laps.

Information portal and business intelligence

A few hundred thousand people rely on investment terminals such as Bloomberg or Reuters for their business news and general information. They’re pretty locked in. But the whole rest of the market is still up for grabs. Bill Gates’ “Information at your fingertips” speech was over two decades ago, yet Microsoft is still not doing great as a provider of information or analytic tools (with the huge exception of Excel).

One obvious synergy is to deliver tame MSN-style traffic to the more established Yahoo portal. A second is to finally get serious about making SharePoint an integrated Web/enterprise portal. A third, less-obvious one – and an area I really need to write a lot more about soon – is the integration of business intelligence tools with public data sources.

Gaming, virtual worlds, identity, and social networking

Social networking and gaming are both evolving at ferocious speeds. Just think of Facebook, Twitter, Scrabulous, Second Life, or console games. Some major and almost inevitable future developments include:

Integration of instant messaging, group chat (IRC, Twitter), email, and perhaps other social networking, for both personal and enterprise uses. On both the client and server sides, there are good reasons for the functions to come together.
Subscriptions or other monetization strategies that cover a broad range of casual gaming, virtual world, and possibly other online recreational activities. Consoles, and standalone games with tens of hours of play value each, seem to work well as products. Other recreation categories need other monetization models. And by the way, massively multi-player online (MMO) games are on the upswing even in categories where standalone games are also viable.
Integrated identity. This is a huge subject, all the more as the number of services we want to participate in mushrooms. I think the technological part of the solution will wind up being XML-based (LDAP is in no way enough).

These are all big problems, where Microsoft and Yahoo actually gain from adding each other’s heft.

As long as the above list is – 19 items – it is far from complete. Please point out any you feel I overlooked. As for merger negotiations, antitrust, and eventual operational issues – I’ll leave those to another time. This post is long enough already.

Related:

Long Zheng runs through the Microsoft and Yahoo brands that would need to be combined.
Google fear-mongers about Evil Microsoft.
Charlene Li opines that Yahoo will fight the merger. (I think she may be underrating tired-founder syndrome.)
Bill Burnham thinks the deal would be very bad for M&A prices.
Edit: Follow-up re: implementation.
Edit: Follow-up re: deal terms and likelihood.

Anatomy of spam blogs

Curt Monash — Sat, 26 Jan 2008 23:25:54 +0000

A post that gives you a clear sense of how gobbledydook is automatically generated (from another knowledgeable black-hat SEO who can’t be bothered to get his permalink structure sensible )

Automation secrets of black hat SEO

Curt Monash — Wed, 16 Jan 2008 04:47:19 +0000

XMCP writes one of the better black hat SEO blogs. In a post last November, he laid out a ton of advice about automating black hat SEO. Personally, I don’t approve of doing black hat SEO. Still, it’s an intellectually interesting subject. What’s more, black hat SEOs create a large fraction of all websites, and certainly of all blog comments, links, and so on. So it’s interesting to track them.

Most interesting to me and probably to most readers here is the part that shows where black hat SEOs get their content:

Content Creation

Know your approach. You really have only 4-5 options

Direct Scraping, full data.

RSS Feeds

Content Generation/Markov Scripts

Manual, offshore labor

Make sure to have an easy way for those who do your writing to retrieve their assignments. Get a reliable crew that will check the buffer every day, and start pumping out the desired articles. Include an easy way for them to submit their work on a webpage.

If possible, have an automated payout system. Keep an automatic tally of their submitted articles, and have your script login to paypal and send them their payment. Be careful though, to avoid no payment, or god fobid duplicate payment.

Gibberish (Scrape/Cloaking sites)

No matter which way you choose do get your data, make sure it’s stored in a swiftly accessible database, and backed up consistently. Have it so all sites that are out there reference this database by domain, not IP. This way, if that server goes down, or is too distant from your most active web host, you can easily re-reroute the traffic to the backup database.

Have your content creation feature tie directly in to your keyword/topic database.

The idea behind those “Markov scripts” is that you

Obtain a large amount of genuine web content.
Derive frequencies with which any given phrase is followed by another.
Plug those frequencies into a Markov process that produces meaningless text.

Since the text is randomized and hence unique, it doesn’t pass the most obvious test for being spam. Further, because in some ways it resembles normal text, the black hat hopes it won’t pass any spam tests at all.

I basically believe that post, despite a couple of minor red flags (e.g., if he’s such an SEO expert, why is he using dynamic, numeric URLs in his own blog?). For one thing, the Slightly Shady SEO blog comes well-recommended in the SEO community. Besides, I’ve done a modest amount of reading on black hat subjects, and this indeed sounds like a legitimate first approximation to what’s really going on.