Structured search – Text Technologies

The future of search

Curt Monash — Mon, 26 Nov 2012 03:07:34 +0000

I believe there are two ways search will improve significantly in the future. First, since talking is easier than typing, speech recognition will allow longer and more accurate input strings. Second, search will be informed by much more persistent user information, with search companies having very detailed understanding of searchers. Based on that, I expect:

A small oligopoly dominating the conjoined businesses of mobile device software and search. The companies most obviously positioned for membership are Google and Apple.
The continued and growing combination of search, advertisement/recommendation, and alerting. The same user-specific data will be needed for all three.
A whole lot of privacy concerns.

My reasoning starts from several observations:

Enterprise search is greatly disappointing. My main reason for saying that is anecdotal evidence — I don’t notice users being much happier with search than they were 15 years ago. But business results are suggestive too:
- HP just disclosed serious problems with Autonomy.
- Microsoft’s acquisition of FAST was a similar debacle.
- Lesser enterprise search outfits never prospered much. (E.g., when’s the last time you heard mention of Coveo?)
- My favorable impressions of the e-commerce site search business turned out to be overdone. (E.g., Mercado’s assets were sold for a pittance soon after I wrote that, while Endeca and Inquira were absorbed into Oracle.)
- Lucene/Solr’s recent stirrings aren’t really in the area of search.
Web search, while superior to the enterprise kind, is disappointing people as well. Are Google’s results any better than they were 8 years ago? Google’s ongoing hard work notwithstanding, are they even as good?
Consumer computer usage is swinging toward mobile devices. I hope I don’t have to convince you about that one.

In principle, there are two main ways to make search better:

Understand more about the documents being searched over. But Google’s travails, combined with the rather dismal history of enterprise search, suggest we’re well into the diminishing-returns part of that project.
Understand more about what the searcher wants.

The latter, I think, is where significant future improvement will be found.

So how does a search engine understand what you want? It can listen to you directly, parsing your search string. It can ask for more clarity, through some kind of disambiguation interface. Or it can make inferences, based on — well, based on just about any kind of information that might exist about you and your online behavior.

Search strings are short, typically four words or less. That doesn’t leave room for a lot of innovative parsing. Not a lot of progress can be made until search strings get a lot longer, and that is unlikely except perhaps through the convenience of speech recognition.

Faceted/parameterized selection has its place. For example, when I search on Amazon.com, the site encourages me to also select a department from its dropdown menu; otherwise, it refuses to rank the search results. And when I buy shirts from Land’s End, I just click through and never search at all. Still, Google’s been around for 15 years, and about all its successes in searcher-does-the-work disambiguation boil down to is:

A list of a few major subcategories to search (News, YouTube, etc.).
Spelling correction.
A desultory list of related/more specific searches, perhaps just longer search strings other people have recently entered.
Well-hidden “Advanced Search” features, which look much like AltaVista’s and AllTheWeb’s similar features did late in the 20th Century.

Whatever the user attitudes and behaviors are that constrain Google’s or its competitors’ success in this area, I can’t imagine them changing much — except, once again, in the event that speech recognition leads to richer human-computer conversations.

I’ve now highlighted two different ways in which there’s a search-interface challenge that will be tough to beat without turning to speech recognition. But the case for speech recognition is even stronger than that. We’re moving to small, mobile devices, and:

Traditional search interfaces work worse on mobile devices than on desktop computers. Typing is harder. So is dealing with picky forms.
Speech may work as well or better on mobile devices than at your desk. If you have upgraded your Apple device to IOS 6, you have both a microphone and Siri. The same may not be true of your desktop gear.

And so I conclude that speech recognition is a big part of the future of search.

What will that allow? Since talking is easier than typing, speech is a way to get longer text strings as search inputs, or more of them. It’s plausible that people might speak queries as complex as:

“I want to buy a recharger for an iPad 3 with delivery this week.”
“Where is 10gen’s Northern California office?” … “Which nearby restaurants have good Yelp reviews?”
“Tell me about the David Reed who went to the Kennedy School of Government around 1977, went to Dartmouth before that, and worked for the Federal Communications Commission.”

Getting search engines to the point that they can handle such queries will be difficult but straightforward — but even more progress is needed. Search results for various queries will be greatly improved if the search engine “knows” things like:

The location of your home and office, and the distance you’re willing to go from them to eat or shop.
Your tastes in food, clothing, and gadgetry.
The level of sophistication at which you like to read about medicine, finance, or electronics.
Which people are or might be in your extended social network.

And that will cement internet search squarely in the world of — for once I approve of the term — big data.

Data marts in the world of text

Curt Monash — Sun, 20 Sep 2009 09:08:53 +0000

CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried “Shadow IT”, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he’s talking about data marts, only for documents rather than tabular data.

Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts, in the tabular and document worlds alike. For example:

Price/performance. Your main/central data manager might be too expensive to support additional large specialized databases. Or different databases and applications might have sufficiently different profiles so as to get great price/performance from different kinds of data managers. This is particularly prevalent in the relational world, where each of column stores, sequentially-oriented row stores, and random I/O-oriented row stores have compelling use cases.
Different SLAs (Service-Level Agreements). Similarly, different applications may have very different requirements for uptime, response time, and the like. (In the relational world, think of operational data stores.)
Different security requirements. Different subsets of the data may need different levels of security. This is particularly prevalent in the document world, where security problems are not as well-solved as in the tabular arena, and where it’s common for a search engine to index across different corpuses with radically different levels of sensitivity.
Integrated application and user interfaces. In the relational world, there’s a pretty clean separation between data management and interface logic; most serious business intelligence tools can talk to most DBMS. The document world is quite different. Some search engines bundle, for example, various kinds of faceted or parameterized search interfaces. What’s more, in public-facing search, a major differentiator is the facilities that the product offers for skewing search results.
Different text applications require different thesauruses or taxonomy management systems. Ideally, those should all be integrated — but the requisite technology still doesn’t exist.

Bottom line: Text data marts, much like relational data marts, are almost surely here to stay.

Related link

The future of data marts

Where “semantic” technology is or isn’t important

Curt Monash — Tue, 30 Dec 2008 00:59:55 +0000

At Lynda Moulton’s behest, I spoke a couple of times recently on the subject of where “semantic” technology is or isn’t likely to be important. One was at the Gilbane conference in early December. The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. The actual Gilbane slides may be found here.

My opinions about the applicability of semantic technology include:

The big bucks in web search are for “transactional” web search, and semantics isn’t the issue there. (Slides 3-4)
When UIs finally go beyond the simple search box — e.g. to clusters/facets or to voice — semantics should have a role to play. (Slide 5)
Public-facing site search depends — more than any other area of text analytics — on hand-tagging. (Slide 7)
“Enterprise” search that searches specialized external databases could benefit from semantic technologies. (Slide 8)
True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. (Slides 10-11)
Semantics — specifically extraction — is central to custom publishing. (Slide 12 — upon review I regret using the word “sophisticated”)
Semantics is central to text mining. (Slide 18)
Semantics could play a big role in all sorts of exciting future developments. (Slide 19)

So what would your list be like?

Worst search UI ever

Curt Monash — Mon, 06 Oct 2008 01:48:34 +0000

On the whole, the Barack Obama campaign has been very internet-savvy. Maybe their web site JohnMcCainRecord.com is yet another example of same. But to my eyes, it has such an appallingly bad search interface that people going to the site are apt to be annoyed. To wit:

There a huge search box in the center of the screen.
All the search box ever does is take you to one of the 13 categories listed right below it.
Usually, it doesn’t even do that. Instead, it just fails. For example, I entered terrorism and hit “Go”, and got no response. Ditto nuclear energy.
When it does give you an answer, it’s apt not to be what you were looking for. For example, entering Iran takes you to the Foreign Policy page, which contains nothing about Iran.

And then, of course, there’s the funny stuff. For example, if you search on foo, you are taken to Rural Issues.

In general terms, I like the idea of the site. But absent some serious changes, JohnMcCainRecord.com should not have a search interface.

Edit: More here in my post on The Obama campaign’s Search Engine to Nowhere

Attivio update

Curt Monash — Sat, 20 Sep 2008 05:00:06 +0000

I talked w/ Andrew McKay of Attivio for 2 ½ hours Thursday. I’ve also been working with some Attivio engineers on a blog search engine. I think it’s time to post about Attivio.

In its full conception, the Attivio Intelligence Engine is something like Endeca + RDBMS + search engine + XML store + cool extra features. And all with seamless, lightweight, integrated installation and administration. That’s the goal, anyway. At this point, naturally, each individual piece is far from complete. For example:

Sufficient SQL support to handle most BI tools is still a matter for future releases — apparently in 2009, although Attivio is one of those agile companies for which pinning down product releases is somewhat difficult.
The same goes some basic GUI features (such as most non-programmatic search tuning).
ACID compliance is not a high priority for Attivio. I actually think it should be higher, just because it’s increasingly become an “OK, we don’t have to worry about THAT” checkmark item.

Even in its early days, Attivio has had some nice-sounding customer successes. There are 8 paying Attivio customers, including 2 > $1 million deals, one half-millionish dollar deal, and 1 large OEM. 3 represent actual deployments, with the rest in development. More sales are on the way, as are permissions to disclose customer names that people will actually recognize. Customer application stories Andrew told me about include:

A web-business parameterized, adjustable-weight search that’s starting with tabular data and only getting to free-text later.
An enterprise that’s using Attivio for content management, enterprise search, public-facing search, and data warehousing.
Something big/mysterious/classified, with large document volumes.
Something to do with compliance, about which Andrew was going to forward a lot more detail that evening (Hint, hint).

Since the major RDBMS (Oracle, Microsoft SQL Server, DB2) all have text search and XML subsystems, they can in principle do everything Attivio does on the back end, and with a lot more features and maturity. The same would go for Marklogic. Performance and overhead might be different matters, however; Andrew certainly believes so.

Except that Lucene is included on the search side, I haven’t actually figured out how Attivio stores data. The fact that SQL features are being added incrementally suggests Attivio is rolling its own relational database capability, but how it’s organized I don’t really know.

The Text Analytics Marketplace: Competitive landscape and trends

Curt Monash — Thu, 19 Jun 2008 07:35:39 +0000

As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:

1. Web search

2. Public-facing site search

3. Enterprise search and knowledge management

4. Custom publishing

5. Text mining and extraction

Three are more standalone:

6. Spam filtering

7. Voice recognition

8. Machine translation

This list comes from a talk I gave Monday at the Text Analytics Summit called The Text Analytics Marketplace: Competitive landscape and trends. In half an hour, I covered the first five areas (in Sue Feldman’s word, at a “gallop”). The slide deck has been uploaded to the link below. I plan to break out the material from the talk into a series of blog posts over the next few (or perhaps not-so-few) weeks.

Slides:

The Text Analytics Marketplace: Competitive landscape and trends

Other posts based on those slides:

Three specialized markets for text analytics (based on Slide 2)
6 trends that could shake up the text analytics market (based on Slide 19)
Why search technologies are going to recombine (in A World of Bytes, based on Slide 19)

How text search has evolved over the past 15 years

Curt Monash — Sun, 15 Jun 2008 07:26:50 +0000

I just stumbled across a brilliant summary of evolution in text search technology, written four years ago. It’s equally valid today (which in itself says something). I found it on the Prism Legal blog, but the actual author is Sharon Flank. My own comments are interspersed in bold.

“There are several underlying important developments over the last decade or so:

Incorporating user feedback to refine search results, usually indirectly rather than explicitly, making results better through machine learning. [Amazon.com is the most-often cited example of this with it’s “if you like A, you’ll also like B.”] [CAM] Technically, that’s not a search example, but the general point is correct even so.

Assessments based on usage or referral. This is what makes Google so useful and popular. This approach gives higher rankings if other web sites point to a target or if that target gets a lot of hits.

Various approaches to using taxonomies. The better applications use taxonomies as a navigation guide but don’t force it or require administrators to implement it. Vivisimo.com is an example of interesting, automated clustering approach. [CAM] “Faceted search” seems to be the buzzword here. It’s a big part of what I call “structured search.” But taxonomy use is probably more trivial in search than it is in, say, text mining.

Better handling of phrases. Google automatically parses phrases and deals with search terms as phrases. This now seems natural but in the AltaVista days, you couldn’t tell a Venetian blind from a blind Venetian [example courtesy of Prof. George Miller, Princeton Univ. – too good not to cite].

Context-sensitive search is now an emerging trend. Systems track what users have previously searched for and infer interest in the same domain to refine search result. So if you look for “line” and a system knows you’ve just looked for “tacklebox,” then it infers you mean “fishing line.” Or if you search for bagels and the system knows you are in 20009, it tells you that you can buy them at Comet Liquors (which happens to sell bagels). [CAM] That happens a lot with ad serving. But I’m not convinced it hit actual search until Google’s personal search kicked off, and that was quite recent.

“More generally in natural language processing, the statistical and linguistic approaches are converging in a new way: use massive amounts of data (i.e. the Web) to get statistical answers to deep linguistic questions, like “How do we figure out what the most likely referent is for the pronoun ‘they’?” Or “How do we determine the correct sense for ambiguous words?” These things aren’t in search engines yet, but you can expect to see more “intelligent” features coming out of this approach.

“Looking at this list, you can see that the conceptual changes (breakthroughs?), with the exception of better phrase handling, are primarily focused around Web searches. When dealing with one-of-a-kind document collections behind the corporate firewall, many of these developments turn out not to add much to older approaches. So, at least for enterprise search, I too remain partial to some of the older products you mention, though I am disappointed that most of the old-time vendors have not updated their approaches beyond adding taxonomy support.” [CAM] Yep, web search and enterprise search are very different things.

The original blog post did have one error — Sharon’s PhD isn’t in Computational Linguistics, but rather Slavic Linguistics, as I recently noted in my post about text analytics careers for humanities majors.

Powerset is mildly interesting

Curt Monash — Mon, 12 May 2008 14:17:22 +0000

Powerset has done a great job of generating buzz for it’s version of smart search. That said, its current demo is mediocre — and that’s being polite. Powerset currently indexes little more than just Wikipedia, and the quality of its search results is about comparable to that of Wikipedia’s justly reviled internal search engine. To determine this, I did searches on both sites on five strings. Wikipedia typically had more total junk ranking higher, but it also put the very best hits of all higher than Powerset did. The strings were:

Drosophila research
Bill Clinton foreign policy
Home run hitters
Innocents on death row
Text data mining

Powerset does have a nice set of UI features in terms of automatic faceted search and so on, but these days who doesn’t?

Some discussion of Powerset:

Michael Arrington seems impressed with Powerset
Dan Farber thinks Microsoft may be impressed
Vanessa Fox definitely isn’t
VentureBeat is taking a wait and see attitude
So is Om Malik, who notes that Powerset performance is a bear

Implications of Microsoft’s bid for Yahoo

Curt Monash — Fri, 01 Feb 2008 13:32:22 +0000

As I write this, Microsoft has just announced an offer to acquire Yahoo. Early responses from the likes of Danny Sullivan, Henry Blodget, the Download Squad, TechCrunch, Raven SEO, Mashable, and others seem to boil down to:

Wow.
Both sides needed it.
Yahoo wasn’t going anywhere fast on its own.
Microsoft wasn’t going anywhere fast in search on its own.
This may be enough critical mass to matter.
Conference call at 8:30 am

I’ll try to be a bit more analytical than that, but this is still going to be quick. Assuming the deal goes through:

Microsoft will recombine both parts of the old FAST/alltheweb.com Therefore, Microsoft will be able to use the same technology for web and enterprise search, to the extent that such commonality makes sense.
I’d expect Microsoft to try to differentiate its technology via faceted/structured search. That’s a FAST strength.
The old FAST search-as-BI dream might become pretty appealing to Microsoft/Yahoo.
In a non-search point, Microsoft is strong in games and Yahoo is strong in fantasy sports. Look for some synergies.
There sure would be a whole lot of non-Windows technology inside Microsoft.

Basically, Microsoft is a company that’s a lot more sophisticated in its thinking about user interfaces and experiences than Yahoo is. That’s where the really interesting competitive innovation would be most likely to occur.

More on Microsoft in enterprise search

Curt Monash — Tue, 08 Jan 2008 19:24:50 +0000

Following up on my prior posts about Microsoft’s impending acquisition of FAST, they’ve now had the conference call. By custom and indeed antitrust law, such calls are very light on content. But here are a few tidbits and takeaways, all from Jeff Raikes of Microsoft:

Jeff talked solely about FAST as adding to enterprise search, and rightly contrasted that with web search.
However, he deflected questions about web search with “We aren’t talking about that much detail right now” rather than with a firm “Well, we aren’t allowed to use FAST that way.”
Specifically, enterprise search is all about integration with SharePoint (portal).
Jeff said Microsoft’s current search could handle millions or maybe tens of millions of documents, but thought there was demand for FAST’s ability to handle billions.
He positioned FAST as an application development platform, giving an example of structured search (the actual word was “pivot”) in consumer electronics. … Well, at least he’s looking in the right direction.