April 30th, 2007 Curt Monash
Baynote sells a recommendation engine whose motto appears to be “popularity implies accuracy.” While that leads to some interesting technological ideas (below), Baynote carries that principle to an unfortunate extreme in its marketing, which is jam-packed with inaccurate buzzspeak. While most of that is focused on a few trendy meme-oriented books, the low point of my briefing today was the probably the insistence against pushback that “95%” of Google’s results depend on “PageRank.” (I think what Baynote really meant is “all off-page factors combined,” but anyhow I sure didn’t get the sense that accuracy was an important metric for them in setting their briefing strategy. And by the way, one reason I repeat the company’s name rather than referring to Baynote by a pronoun is that on-page factors DO matter in search engine rankings.)
That said, here’s the essence of Baynote’s story, as best I could figure it out.
Read the rest of this entry »
Posted in Baynote, Google, Ontologies and context identification, Search and text storage, Search engine optimization (SEO), Social software and media, Specialized search engines | 3 Comments »
April 17th, 2007 Curt Monash
In a recent post on the Monash Report, I drew a distinction between two aspects of the Internet: Jeffersonet and Edisonet. Jeffersonet deals in thoughts and ideas and research and scholarship and news and politics, and in commerce too. It’s what makes people so passionate about the Internet’s democracy-enhancing nature. It’s what needs to be protected by extreme network neutrality. And it’s modest enough in its bandwidth requirements that net neutrality is completely workable. (Edisonet, by way of contrast, comprises advanced applications in entertainment, teleconferencing, etc. that probably do require new capital investment and tiered pricing schemes.)
And if there’s one application that’s at the core of Jeffersonet, it’s search. No matter how much scary posturing telecom CEOs do – and no matter how profitable or monopolistic Google becomes – telecom carriers must never be allowed to show any preference among search engines! At least, that’s the case for text-centric search engines such as Google, Yahoo, and Microsoft run today. The reason is simple: The democratic part of the Internet only works so long as things can be found. And search will long be a huge part of how to find them. So search engine vendors must never be able to succeed based on a combination of good-enough results plus superior marketing and business development. They always have to be kept afraid of competition from engines that provide better actual search engine results.
Read the rest of this entry »
Posted in Censorship, Google, Search and text storage, Social software and media | No Comments »
February 23rd, 2007 Curt Monash
I’ve been musing about how big Google’s core database might be. Figuring that out is not a trivial problem, unless they’ve published the answer somewhere that I’m not aware of. But here’s a big clue, from an announcement about their n-gram data:
We processed 1,024,908,267,229 words of running text
Read the rest of this entry »
Posted in Google, Search and text storage | No Comments »
February 3rd, 2007 Curt Monash
Hakia purports to be a new search engine that indexes “semantically,” which I presume means on phrases or concepts or something. But I’ve run a few queries side by side on Hakia and Google, and they’re not doing well. I think they’re not making sufficiently good use of page reputation. Try “web hosting forum” for an example of this, looking at the top two hits in both cases.
When I queried on “Viagra,” Hakia did — as it were — outperform Google. But that’s the only case I, uh, came up with. On less snigger-worthy searches, Google seemed to do as well as or better than Hakia.
Posted in Google, Search and text storage | Comments Off
January 31st, 2007 Curt Monash
Slashdot has a long, exclusive article on proposed US legislation to fight foreign internet censorship. The gist is that companies such as Yahoo and Google seem to be saying “Please, pass a law OBLIGATING us to resist censorship and other bad behavior.”
I think this is both admirable-if-true and, better yet, probably true. Clearly, US web search companies are vulnerable in theory to competition from less scrupulous competitors in other nations. But for now our search technology lead is strong enough that their main competition is with each other. If China (for example) can’t play one of them off against the other, there’s at least it chance it will be reluctant to throw the whole lot of them out.
Posted in Censorship, Directories and filtering, Google, Search and text storage | No Comments »
January 30th, 2007 Curt Monash
Ted Samsen of Infoworld is worried that the Chinese are attempting to ratchet up internet censorship yet further. Welcome to the club, buddy. This problem is a big one, and I don’t think it’s going to be addressed without vigorous action. I particular, I suspect that what is needed may be some major efforts in white-hat spamming. Lance Cottrell of Anonymizer has clever ideas along those lines for fighting censorship in the short term, but I think a bigger effort is needed as well.
Google, by the way, is caught in a tough spot and knows it.
Posted in Censorship, Directories and filtering, Google, Search and text storage, Spam and antispam, Web site filtering | No Comments »
January 23rd, 2007 Curt Monash
Popular on Digg, for obvious reasons, is a post showing that Google is better for searching Digg than Digg’s own search engine. No shock there. If I want to search Wikipedia for information on astrowidgets, I’ll just google on the phrase wikipedia astrowidgets. That works much better than Wikipedia’s own search.
Speaking of which — if you want to search for my writing, I’m using Google web search technology too. It works like a charm.
Posted in Google, Search and text storage, Specialized search engines | 2 Comments »
January 22nd, 2007 Curt Monash
Based on a patent application, SEOmoz has discerned 65 aspects of the Google ranking algorithm.* I counted only 24 that really had much at all to do with enterprise search. This leaves 41 or so focused on spam/SEO-fighting and/or on-page linking issues that have no enterprise parallel. And for more depth, here’s a long article from another SEO site, on a specific phrase-concurrence spam-fighting technique that has no apparent applicability to trusted corpuses.
*I highly recommend this link. It is by far the best single-page overview of web search algorithmic issues I’ve ever seen.
I’ve said it before, but it bears repeating — web search and enterprise search (or search of a constrained corpus) are very different technical problems.
Posted in Enterprise search, Google, Search and text storage | 5 Comments »
November 11th, 2006 Curt Monash
Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean:
- Text mining powers search. The biggest text mining outfits in the world, possibly excepting the US intelligence community, are surely Google, Yahoo, and perhaps Microsoft.
- Search powers text mining. Restricting the corpus of documents to mine, even via a keyword search, makes tons of sense. That’s one of the good ideas in Attensity 4.
- Text mining and search are powered by the same underlying technologies. For starters, there’s all the tokenization, extraction, etc. that vendors in both areas license from Inxight and its competitors. Beyond that, I think there’s a future play in integrated taxonomy management that will rearrange the text analytics market landscape.
Read the rest of this entry »
Posted in Attensity, Business Objects and Inxight, Enterprise search, FAST, Google, IBM and UIMA, Ontologies and context identification, Open source text analytics, Search and text storage, Text mining | 3 Comments »