May 12th, 2008 Curt Monash
Powerset has done a great job of generating buzz for it’s version of smart search. That said, their current demo is mediocre — and that’s being polite. Powerset currently indexes little more than just Wikipedia, and the quality of its search results is about comparable to that of Wikipedia’s justly reviled internal search engine. To determine this, I did searches on both sites on five strings. Wikipedia typically had more total junk ranking higher, but it also put the very best hits of all higher than Powerset did. The strings were:
- Drosophila research
- Bill Clinton foreign policy
- Home run hitters
- Innocents on death row
- Text data mining
Powerset does have a nice set of UI features in terms of automatic faceted search and so on, but these days who doesn’t?
Some discussion of Powerset:
Posted in Powerset, Search and text storage, Structured search | 3 Comments »
May 8th, 2008 Curt Monash
Ironically coming right after a Google indexing problem, I am putting up my first sponsored blog post ever. It’s in connection with the forthcoming Text Analytics Summit, at which I will be speaking (in Boston) on June 16. The post itself offers a free white paper by the estimable Seth Grimes.
Read the rest of this entry »
Posted in Text Analytics Summit, Text mining | No Comments »
May 8th, 2008 Curt Monash
As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.
We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.
Posted in Google, Search engine optimization (SEO), Spam and antispam | 1 Comment »
April 29th, 2008 Curt Monash
I’m putting up two posts this morning on Mark Logic and its MarkLogic product family. The main one, over on DBMS2, outlines the technical architecture — focusing on MarkLogic as an XML database management system — and provides a bit of overall context. This post attempts to position MarkLogic against alternative kinds of text analytics engine.
For the most part, MarkLogic is indeed sold (and bought) for the storage, manipulation, and retrieval of text. (One long-confidential exception to this rule is scheduled to be unveiled at the June user conference.) Most applications seem to fit a custom publishing/enhanced search paradigm:
-
Ingest text.
-
Enhance it.
-
Serve it up in chunks, typically via a sophisticated search interface.
Differences vs. conventional search engines include:
-
Documents are indexed on the fly, and available for query immediately upon ingestion.
-
MarkLogic is a real, ACID-compliant DBMS. So everything else – such as a user tag or comment — is also available for immediate query. Mark Logic says customers are making a lot of use of this feature.
-
MarkLogic has a real programming language – specifically XQuery. (Note: XQuery is a much fuller language than, say, standard SQL, with conditional logic, arithmetic, try/catch, and so on.)
-
MarkLogic handles fielded information, document chunks, and whole documents in a completely integrated fashion. Truth be told, I don’t know exactly to what extent Autonomy or FAST do or don’t fall short of this standard, but it’s never seemed to be as much of a priority on their part as I’ve felt it should be.
Mark Logic also claims huge advantages in corpus administration. Scalability seems good too; there’s a national-intelligence customer with a 200 terabyte database. And they’re proud of a feature called lexicons, although it seems so obvious to me that I’ve so far failed to muster what they’d probably regard as the proper level of excitement about it. (In SQL terms, it seems to be a combination of SELECT and COUNT DISTINCT, both of which are capabilities I’d think would be in XQuery anyway.)
Please subscribe to our feed!
Posted in Application areas, Mark Logic | 3 Comments »
April 25th, 2008 Curt Monash
As per this job listing, at least one “major NYC investment bank” plans to do text mining on a proprietary trading desk.
The successful candidate will mine text data from numerous news sources and incorporate the information the proprietary trading systems.
Posted in Application areas, Investment research and trading, Text mining | No Comments »
April 25th, 2008 Curt Monash
As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …
We’ve now upgraded to Wordpress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)
All these hours of aggravation because some criminal wanted a bit of SEO advantage …
Posted in Google, Search engine optimization (SEO), Spam and antispam | 1 Comment »
April 7th, 2008 Curt Monash
The Microsoft/Yahoo negotiation is in a very public phase right now. In its latest letter, the Yahoo board makes two references to “certainty,” in one case spelling out that this encompasses “certainty of value” and “certainty of closing.”
It’s hard to imagine what the former could mean other than “Please make an all-cash offer (or, better yet, go away).” But I previously noted, Microsoft can indeed afford to buy Yahoo entirely for cash.
The latter part is a reference to the antitrust boogeyman, obviously a non-trivial concern whenever Microsoft is involved.
Please subscribe to our feed!
Posted in Microsoft and Windows Live Search, Yahoo | 2 Comments »
March 15th, 2008 Curt Monash
Lynda Moulton introduced me to MuseGlobal, and specifically CEO Kate Noerr, last month. MuseGlobal sort of does ETL (Extract/Transform/Load) for text, although they prefer to call it Gather/Transform/Deliver. In any case, each of the three parts of the process are rather different for text than they are for traditional data warehousing. To wit: Read the rest of this entry »
Posted in MuseGlobal | No Comments »
March 5th, 2008 Curt Monash
Google has begun to introduce a feature whereby, if your search obviously leads you to a single site (e.g., you searched on a company name), you get a second search box to search only within that site. More details at Google and Search Engine Land. Basically, this is Google Site Search made a lot easier to use.
I think this could be a really big deal. Read the rest of this entry »
Posted in Enterprise search, Google, Search and text storage, Specialized search engines | 4 Comments »