Website filtering – Text Technologies

19 Microsoft/Yahoo synergies that could revolutionize the Internet

Curt Monash — Sun, 03 Feb 2008 22:04:47 +0000

Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.

The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.

Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie.

Search and contextual advertising

Query serving costs are variable, and some marketing costs are performance based. But there are major economies of scale in:

Web crawling. Those huge server farms are needed irrespective of query volume. It’s easier to compete in search overall when you can afford to do all the crawling you need.
Indexing. Ditto. (Recent discussion of Google MapReduce quantifies this processing effort a bit.)
Relevancy algorithm research. The challenge for relevancy algorithms keeps going up. Adversarial information retrieval is an ongoing struggle. Universal search and local search just multiply the challenge. Neither Microsoft nor Yahoo has consistently challenged Google’s search quality. A merged Microsoft/Yahoo, however, just might.
User interface research. Some day search results pages will change, offering more useful user drill-down. And mobile-device search is a whole different interface challenge, for input (e.g., voice) and output alike. This is one area where I think a merged Microsoft/Yahoo could easily make major contributions.
Advertising platform research. Unlike text search, which goes back to at least the 1980s, contextual advertising platforms were really introduced just in the current millennium. It’s still early in their life cycles, and a great deal of innovation is yet ahead, in all parts of the system. That’s true even on text-heavy Web pages, and it’s even truer on other platforms such as video and perhaps gaming. To see just how primitive the technology is right now, consider this: Google gets greatly more revenue per search than Yahoo or Microsoft, and there are only two reasonable explanations for the disparity – difference in the searchers/subjects, or technology. Surely to a large extent it’s the latter.
Hand assists to search. These are more important than you might first think. Google manually reviews a number of possibly-spammy sites, both to adjust their rankings directly (and those of sites in link networks with them), or to learn of needed algorithm tweaks. In the future, it’s easy to imagine user “voting” on sites becoming crucial to search in a variety of ways; while it may not identify the best sites, at least it will weed out spammy/bad ones. But whatever the system, people will try to game it, and human intervention will be needed accordingly. Again, there’s a lot of potential in this area to make the world – or at least the Web – a better place.
Marketing (partial). Marketing of search services seems to consist mainly of paying for placement, plus a whole lot of word of mouth. Neither of those is an obvious economies-of-scale cost center. But here’s the problem – Google is way ahead in the branding battle. Indeed, “to google” is a much-used verb. Microsoft, Yahoo, and/or Microsoft/Yahoo have a lot of branding ground to cover if – well, if they wish to recover. So if they ever do manage to achieve superior product to Google, an expensive advertising/sponsorship campaign might turn out to be a really good idea.
Combining enterprise and web search. As I mentioned in my initial reaction to the Microsoft offer for Yahoo, FAST could be more important to the merged entity than is at first apparent. While relvancy ranking is a very different problem on the Web than in an enterprise, user interface issues are more similar. What’s more, there are potentially major benefits from truly integrating Web and enterprise search – again mainly on the UI side, but maybe in ontology leverage as well.

Email and antispam

Mail storage and serving costs, for the most part, are variable according to usage. Even so, there are important economies of scale in:

Antispam. Google, perhaps due to the Postini acquisition, is doing a great job of antispam right now. Yahoo, however, is a disaster in that regard, with much legitimate mail not getting through at all. And antispam is an arms race, with new development constantly needed.
General email software development. Antispam aside, online email software is still in sad shape. User interfaces, searching/filtering, and general stability are all problematic. Integration with client email software and other messaging is often even worse. Advertising potential is hard to monetize without unacceptable privacy violations. All told, there’s a lot of email software development ahead.
Marketing. If it were easy to market online email services other than by word of mouth, more marketing would probably be happening. If the challenge ever gets solved, the solution may be expensive.
Email integration with other messaging. As noted below, chat and social networking stand to be utterly transformed. What emerges will transform and perhaps even subsume email-as-we-know-it.
Email integration with search. One of the worst things about email is its primitive filtering, both when it arrives and when you’re looking for it later. Google has taken the lead on email/search integration, but this will be a long race that currently still in the early laps.

Information portal and business intelligence

A few hundred thousand people rely on investment terminals such as Bloomberg or Reuters for their business news and general information. They’re pretty locked in. But the whole rest of the market is still up for grabs. Bill Gates’ “Information at your fingertips” speech was over two decades ago, yet Microsoft is still not doing great as a provider of information or analytic tools (with the huge exception of Excel).

One obvious synergy is to deliver tame MSN-style traffic to the more established Yahoo portal. A second is to finally get serious about making SharePoint an integrated Web/enterprise portal. A third, less-obvious one – and an area I really need to write a lot more about soon – is the integration of business intelligence tools with public data sources.

Gaming, virtual worlds, identity, and social networking

Social networking and gaming are both evolving at ferocious speeds. Just think of Facebook, Twitter, Scrabulous, Second Life, or console games. Some major and almost inevitable future developments include:

Integration of instant messaging, group chat (IRC, Twitter), email, and perhaps other social networking, for both personal and enterprise uses. On both the client and server sides, there are good reasons for the functions to come together.
Subscriptions or other monetization strategies that cover a broad range of casual gaming, virtual world, and possibly other online recreational activities. Consoles, and standalone games with tens of hours of play value each, seem to work well as products. Other recreation categories need other monetization models. And by the way, massively multi-player online (MMO) games are on the upswing even in categories where standalone games are also viable.
Integrated identity. This is a huge subject, all the more as the number of services we want to participate in mushrooms. I think the technological part of the solution will wind up being XML-based (LDAP is in no way enough).

These are all big problems, where Microsoft and Yahoo actually gain from adding each other’s heft.

As long as the above list is – 19 items – it is far from complete. Please point out any you feel I overlooked. As for merger negotiations, antitrust, and eventual operational issues – I’ll leave those to another time. This post is long enough already.

Related:

Long Zheng runs through the Microsoft and Yahoo brands that would need to be combined.
Google fear-mongers about Evil Microsoft.
Charlene Li opines that Yahoo will fight the merger. (I think she may be underrating tired-founder syndrome.)
Bill Burnham thinks the deal would be very bad for M&A prices.
Edit: Follow-up re: implementation.
Edit: Follow-up re: deal terms and likelihood.

The Chinese censorship threat continues to ratchet up

Curt Monash — Tue, 30 Jan 2007 14:01:07 +0000

Ted Samsen of Infoworld is worried that the Chinese are attempting to ratchet up internet censorship yet further. Welcome to the club, buddy. This problem is a big one, and I don’t think it’s going to be addressed without vigorous action. I particular, I suspect that what is needed may be some major efforts in white-hat spamming. Lance Cottrell of Anonymizer has clever ideas along those lines for fighting censorship in the short term, but I think a bigger effort is needed as well.

Google, by the way, is caught in a tough spot and knows it.

Enterprise-specific web search: High-end web search/mining appliances?

Curt Monash — Mon, 23 Oct 2006 00:27:59 +0000

OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …

Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:

Filtering out spam hits. This is obviously important for search, and in some cases could help with public-web text mining as well. It should be OK to be more aggressive on spam-site filtering in an enterprise-specific index than it is in general web search.
Filtering out malicious/undesirable downloads of various sorts. I’m thinking mainly of malware/spyware here, but of course it can also be used for netnannying porn-prevention and the like as well. Again, this is more easily done for the enterprise market than for the search world at large. (I anyway think that Google could blow Websense out of the water any time they wanted to – except, of course, for the not-so-small matter of not being seen as participating in the censorship business — but that’s a separate discussion.)
Capturing employees’ search strings. This could be useful for various purposes, including discerning their interests, and building the corporate ontology for internal web search.
Freshness control. If there’s a site you really care about, you can make sure it’s re-indexed frequently.

FAST, Convera, Google, and Microsoft all have the potential to introduce such a product package, although I’m not aware of any specific initiatives that exactly match what I have in mind. The closest may be Convera, which is providing a standard vertical-market-specific sub-Web designed for its government intelligence/law-enforcement customers. (I’ve forgotten whether this is on-site or on a SaaS basis.)

Of course, the idea has drawbacks, but I don’t see any of them as killers. Possible pitfalls include:

Copyright issues because indexing = copying? No more than for Google itself.
Reluctance to let a competitor spider you? That would be only few a few pages at most.
Lack of full Web coverage? There always are public search engines as a fallback. Indeed, it might be interesting to show results from sanitized search and public search side by side, something which also would help with credibility and adoption.

One other thing about this hypothetical product – it might well be in appliance format. (Or it might be SaaS, just as general Web search currently is. Indeed, in some ways that’s a more likely possibility.) Here are some thoughts on both sides of the appliance issue:

1. There’s a clear technology trend at the high-end of the relational data warehousing world: Massively multi-parallel “shared-nothing” systems are winning over the symmetric multi-processing “shared-everything” systems that dominate the OLTP RDBMS world. (I’ve written about that at length over on DBMS2, e.g. in this post.)

Text search for large corpuses also seems to work best on MPP shared-nothing systems. This is famously the Google architecture, and Inktomi’s (i.e., Yahoo’s) as well. It’s also the FAST approach.

Doing text in MPP makes theoretical sense too; just partition the documents by node, search each node, and aggregate the results. Indeed, text search is an even clearer fit for MPP than relational query, since you can ALWAYS use a local index, with none of those pesky relational joins. At least, that’s true of the querying part. In actual spidering, you do have the problem of shipping URLs from one node to another; the same might be true of link information and other search algorithm inputs.

2. The relational world gives very mixed signals as to how much these MPP/shared-nothing products benefit from specialized hardware. There’s certainly a consensus that customers at least want preconfigured hardware. But beyond that it gets confusing. Teradata has specialized networking. IBM is going to preconfigured generic hardware, and also mumbles about a specialized data filtering chip. Netezza has a very custom system based on FPGAs. DATallegro loves its custom Infiniband networking add-ons, but mumbles about going to standard hardware. Greenplum and Kognitio are on standard hardware, although Greenplum focuses on a quasi-appliance through Sun. (if my memory is correctly, Greenplum actually boasts at least one text-indexing customer, namely O’Reilly.)

Of that group, Kognitio (formerly White Cross) actually has software the most analogous to text indexing, in that they rely on compressed bitmaps for their data access, vs. the hashes and table scans of the other contenders. So their experience might be the most relevant here. Kognitio started out as a custom hardware vendor, but now runs on utterly standard blades. That said, they’re getting interest from their customers in having them somehow prepackage standard hardware, much as IBM is with its BCUs (Base Configuration Units).

And on the theoretical side: If you look at the specific problems solved by proprietary data warehousing hardware, such as speedy moving around of intermediate query results, they don’t seem to have strong analogues in the text search case.

3. But that’s just actual querying. There’s also the matter of spidering/indexing. And that has strong aspects of stream processing, data communications, maybe security, etc. – and those kinds of things do often call for appliances, and sometimes even special-purpose chips (more precisely, specially-programmed FPGAs).

So there you have it. The preponderance of evidence suggests appliances are the way to go, and also that the market would welcome an appliance package. But it’s hardly conclusive.

As for the related question of whether purely in-house enterprise search should be done on appliances – well, the same considerations apply. There’s good market receptivity to low-end search appliances and high-end data warehousing appliances, so there’s at least a plausibility argument that high-end enterprise search should be done on appliances as well. But it’s not an open-and-shut case.