Enterprise search – Text Technologies

The future of search

Curt Monash — Mon, 26 Nov 2012 03:07:34 +0000

I believe there are two ways search will improve significantly in the future. First, since talking is easier than typing, speech recognition will allow longer and more accurate input strings. Second, search will be informed by much more persistent user information, with search companies having very detailed understanding of searchers. Based on that, I expect:

A small oligopoly dominating the conjoined businesses of mobile device software and search. The companies most obviously positioned for membership are Google and Apple.
The continued and growing combination of search, advertisement/recommendation, and alerting. The same user-specific data will be needed for all three.
A whole lot of privacy concerns.

My reasoning starts from several observations:

Enterprise search is greatly disappointing. My main reason for saying that is anecdotal evidence — I don’t notice users being much happier with search than they were 15 years ago. But business results are suggestive too:
- HP just disclosed serious problems with Autonomy.
- Microsoft’s acquisition of FAST was a similar debacle.
- Lesser enterprise search outfits never prospered much. (E.g., when’s the last time you heard mention of Coveo?)
- My favorable impressions of the e-commerce site search business turned out to be overdone. (E.g., Mercado’s assets were sold for a pittance soon after I wrote that, while Endeca and Inquira were absorbed into Oracle.)
- Lucene/Solr’s recent stirrings aren’t really in the area of search.
Web search, while superior to the enterprise kind, is disappointing people as well. Are Google’s results any better than they were 8 years ago? Google’s ongoing hard work notwithstanding, are they even as good?
Consumer computer usage is swinging toward mobile devices. I hope I don’t have to convince you about that one.

In principle, there are two main ways to make search better:

Understand more about the documents being searched over. But Google’s travails, combined with the rather dismal history of enterprise search, suggest we’re well into the diminishing-returns part of that project.
Understand more about what the searcher wants.

The latter, I think, is where significant future improvement will be found.

So how does a search engine understand what you want? It can listen to you directly, parsing your search string. It can ask for more clarity, through some kind of disambiguation interface. Or it can make inferences, based on — well, based on just about any kind of information that might exist about you and your online behavior.

Search strings are short, typically four words or less. That doesn’t leave room for a lot of innovative parsing. Not a lot of progress can be made until search strings get a lot longer, and that is unlikely except perhaps through the convenience of speech recognition.

Faceted/parameterized selection has its place. For example, when I search on Amazon.com, the site encourages me to also select a department from its dropdown menu; otherwise, it refuses to rank the search results. And when I buy shirts from Land’s End, I just click through and never search at all. Still, Google’s been around for 15 years, and about all its successes in searcher-does-the-work disambiguation boil down to is:

A list of a few major subcategories to search (News, YouTube, etc.).
Spelling correction.
A desultory list of related/more specific searches, perhaps just longer search strings other people have recently entered.
Well-hidden “Advanced Search” features, which look much like AltaVista’s and AllTheWeb’s similar features did late in the 20th Century.

Whatever the user attitudes and behaviors are that constrain Google’s or its competitors’ success in this area, I can’t imagine them changing much — except, once again, in the event that speech recognition leads to richer human-computer conversations.

I’ve now highlighted two different ways in which there’s a search-interface challenge that will be tough to beat without turning to speech recognition. But the case for speech recognition is even stronger than that. We’re moving to small, mobile devices, and:

Traditional search interfaces work worse on mobile devices than on desktop computers. Typing is harder. So is dealing with picky forms.
Speech may work as well or better on mobile devices than at your desk. If you have upgraded your Apple device to IOS 6, you have both a microphone and Siri. The same may not be true of your desktop gear.

And so I conclude that speech recognition is a big part of the future of search.

What will that allow? Since talking is easier than typing, speech is a way to get longer text strings as search inputs, or more of them. It’s plausible that people might speak queries as complex as:

“I want to buy a recharger for an iPad 3 with delivery this week.”
“Where is 10gen’s Northern California office?” … “Which nearby restaurants have good Yelp reviews?”
“Tell me about the David Reed who went to the Kennedy School of Government around 1977, went to Dartmouth before that, and worked for the Federal Communications Commission.”

Getting search engines to the point that they can handle such queries will be difficult but straightforward — but even more progress is needed. Search results for various queries will be greatly improved if the search engine “knows” things like:

The location of your home and office, and the distance you’re willing to go from them to eat or shop.
Your tastes in food, clothing, and gadgetry.
The level of sophistication at which you like to read about medicine, finance, or electronics.
Which people are or might be in your extended social network.

And that will cement internet search squarely in the world of — for once I approve of the term — big data.

Data marts in the world of text

Curt Monash — Sun, 20 Sep 2009 09:08:53 +0000

CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried “Shadow IT”, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he’s talking about data marts, only for documents rather than tabular data.

Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts, in the tabular and document worlds alike. For example:

Price/performance. Your main/central data manager might be too expensive to support additional large specialized databases. Or different databases and applications might have sufficiently different profiles so as to get great price/performance from different kinds of data managers. This is particularly prevalent in the relational world, where each of column stores, sequentially-oriented row stores, and random I/O-oriented row stores have compelling use cases.
Different SLAs (Service-Level Agreements). Similarly, different applications may have very different requirements for uptime, response time, and the like. (In the relational world, think of operational data stores.)
Different security requirements. Different subsets of the data may need different levels of security. This is particularly prevalent in the document world, where security problems are not as well-solved as in the tabular arena, and where it’s common for a search engine to index across different corpuses with radically different levels of sensitivity.
Integrated application and user interfaces. In the relational world, there’s a pretty clean separation between data management and interface logic; most serious business intelligence tools can talk to most DBMS. The document world is quite different. Some search engines bundle, for example, various kinds of faceted or parameterized search interfaces. What’s more, in public-facing search, a major differentiator is the facilities that the product offers for skewing search results.
Different text applications require different thesauruses or taxonomy management systems. Ideally, those should all be integrated — but the requisite technology still doesn’t exist.

Bottom line: Text data marts, much like relational data marts, are almost surely here to stay.

Related link

The future of data marts

Where “semantic” technology is or isn’t important

Curt Monash — Tue, 30 Dec 2008 00:59:55 +0000

At Lynda Moulton’s behest, I spoke a couple of times recently on the subject of where “semantic” technology is or isn’t likely to be important. One was at the Gilbane conference in early December. The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. The actual Gilbane slides may be found here.

My opinions about the applicability of semantic technology include:

The big bucks in web search are for “transactional” web search, and semantics isn’t the issue there. (Slides 3-4)
When UIs finally go beyond the simple search box — e.g. to clusters/facets or to voice — semantics should have a role to play. (Slide 5)
Public-facing site search depends — more than any other area of text analytics — on hand-tagging. (Slide 7)
“Enterprise” search that searches specialized external databases could benefit from semantic technologies. (Slide 8)
True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. (Slides 10-11)
Semantics — specifically extraction — is central to custom publishing. (Slide 12 — upon review I regret using the word “sophisticated”)
Semantics is central to text mining. (Slide 18)
Semantics could play a big role in all sorts of exciting future developments. (Slide 19)

So what would your list be like?

Lynda Moulton prefers enterprise search products that get up and running quickly

Curt Monash — Sun, 12 Oct 2008 02:46:07 +0000

Lynda Moulton, to put it mildly, disagrees with the Gartner Magic Quadrant analysis of enterprise search. Her preferred approach is captured in:

Coveo, Exalead, ISYS, Recommind, Vivisimo, and X1 are a few of a select group that are marking a mark in their respective niches, as products ready for action with a short implementation cycle (weeks or months not years).

By way of contrast, Lynda opines:

Autonomy and Endeca continue to bring value to very large projects in large companies but are not plug-and-play solutions, by any means. Oracle, IBM, and Microsoft offer search solutions of a very different type with a heavy vendor or third-party service requirement. Google Search Appliance has a much larger installed base than any of these but needs serious tuning and customization to make it suitable to enterprise needs.

In particular, her views about FAST (now Microsoft) are scathing.

Attivio update

Curt Monash — Sat, 20 Sep 2008 05:00:06 +0000

I talked w/ Andrew McKay of Attivio for 2 ½ hours Thursday. I’ve also been working with some Attivio engineers on a blog search engine. I think it’s time to post about Attivio.

In its full conception, the Attivio Intelligence Engine is something like Endeca + RDBMS + search engine + XML store + cool extra features. And all with seamless, lightweight, integrated installation and administration. That’s the goal, anyway. At this point, naturally, each individual piece is far from complete. For example:

Sufficient SQL support to handle most BI tools is still a matter for future releases — apparently in 2009, although Attivio is one of those agile companies for which pinning down product releases is somewhat difficult.
The same goes some basic GUI features (such as most non-programmatic search tuning).
ACID compliance is not a high priority for Attivio. I actually think it should be higher, just because it’s increasingly become an “OK, we don’t have to worry about THAT” checkmark item.

Even in its early days, Attivio has had some nice-sounding customer successes. There are 8 paying Attivio customers, including 2 > $1 million deals, one half-millionish dollar deal, and 1 large OEM. 3 represent actual deployments, with the rest in development. More sales are on the way, as are permissions to disclose customer names that people will actually recognize. Customer application stories Andrew told me about include:

A web-business parameterized, adjustable-weight search that’s starting with tabular data and only getting to free-text later.
An enterprise that’s using Attivio for content management, enterprise search, public-facing search, and data warehousing.
Something big/mysterious/classified, with large document volumes.
Something to do with compliance, about which Andrew was going to forward a lot more detail that evening (Hint, hint).

Since the major RDBMS (Oracle, Microsoft SQL Server, DB2) all have text search and XML subsystems, they can in principle do everything Attivio does on the back end, and with a lot more features and maturity. The same would go for Marklogic. Performance and overhead might be different matters, however; Andrew certainly believes so.

Except that Lucene is included on the search side, I haven’t actually figured out how Attivio stores data. The fact that SQL features are being added incrementally suggests Attivio is rolling its own relational database capability, but how it’s organized I don’t really know.

One overview of e-discovery

Curt Monash — Sat, 13 Sep 2008 09:17:21 +0000

I just found a year-old (almost) blog post from EMC executive Andrew Cohen that succinctly lays out his view (which he believes to mainly be a consensus stance) on e-discovery. Cohen is evidently both a lawyer and a honcho in document management system vendor EMC’s Compliance Division, which is probably relevant to interpreting his outlook, in the spirit of the old Kennedy School dictum that “Where you stand depends upon where you sit.”

Highlights included:

Information management is central to e-discovery.
In particular, auditability (my word) is central, if you want electronic documents to hold up as evidence in court.
Search is good enough, but it’s not the biggest issue in e-discovery.
E-mail archiving has reached the tipping point, and is increasingly a must-have, largely for its e-discovery benefits.

How good does e-discovery search need to be?

Curt Monash — Mon, 01 Sep 2008 04:44:58 +0000

Two years ago, CEO Mike Lynch of Autonomy tried to persuade me that Autonomy was and would remain dominant in the e-discovery search market because:

The essence of the buying decision was that enterprises wanted to fulfill obligations to make their information available in a way that would would satisfy the courts.
Autonomy had some high-profile traction (e.g., the Enron case) that made it the default decision, and hence in particular a choice that met the requirement.

Recently, I ran that theory by David Ferris, whose firm Ferris Research has long been a/the leading small analyst firm covering e-mail and related technologies. He wasn’t buying. David believes courts are getting more sophisticated in their understanding of search technology. Even more to the point, David cited several other buying motivations that would lead enterprises to want best-available rather than just-good-enough e-discovery search technology, such as:

Enterprises want to know what information is available to be discovered against them.
Enterprises want to discover the information that will best aid their legal defense.
If they’re archiving the material for one purpose (e-discovery) anyway, enterprises want to get the most possible value out of it for other purposes while they’re at it.

The Attivio angle on the FAST story

Curt Monash — Tue, 08 Jul 2008 19:16:50 +0000

Attivio CEO Ali Riaz was previously CFO and COO of FAST. He tried to avoid involvement in the recent expose’ of his former employer. For his troubles he got a parking lot ambush, a big photograph, and some unflattering coverage. Adriaan Bloem and Stephen Arnold have been hotly debating Ali’s culpability.

There are two general issues here, based on the fact that Ali and a couple of other key Attivio executives come from FAST. First, they were at a corrupt company — but resigned before the worst (and perhaps all) of the corruption happened. Second, they were at a company that did very well in some respects, but very badly in others, so it’s a mixed-quality resume item.

So far, no biggie. Lots of executives exude overoptimism about their companies products and business prospects. And I haven’t identified anything which suggests to me as a former stock analyst that the controls Ali put in place as CFO/COO were inadequate. (If he’d been long-time CEO, it would have been a different matter, as he would have been more responsible for the general ethical culture of the company — but he wasn’t.)

So the main serious charge is that FAST funneled a lot of sales through small reseller companies owned by its executives, including Ali. Such arrangements could be used either for misappropriation of funds, or to inflate revenue. In the article, Ali denies involvement in any reseller until after he left FAST’s employment, but the reporter purports to have discovered proof to the contrary. I couldn’t quite get Ali to reiterate his denial to me — or, indeed, to talk with me directly about the matter — but did get an emailed statement which reads:

Mr. Riaz categorically denies any wrongdoing during his tenure at FAST or in any relationship with FAST thereafter. He has not been an employee of FAST for almost two years now, and therefore must defer all further comments to Microsoft’s official 2006 and 2007 statements on the matter.

I’ve advised my clients at Attivio that they should be clearer and more specific, but so far I’m not carrying the day. So for now, we’ll go with that.

Recent reporting on the shenanigans at FAST

Curt Monash — Tue, 08 Jul 2008 19:16:18 +0000

A Norwegian newspaper did an expose’ on FAST, dated June 28. Helpful search industry participants quickly distributed English translations to a variety of commentators, including me. TechCrunch posted a scan of part of the article.

The gist is that FAST followed a pattern very common in the packaged enterprise software industry:

It had trouble meeting its growth targets.
It inflated reported revenue (in the high-margin software industry, inflating license revenue has a huge impact on profits).
One technique whereby it inflated revenue was to count deals that actually closed after quarter end.
Another technique was to count deals as closed in which the customer hadn’t actually fully committed to buy.

There’s nothing new here. Back in the 1980s, we used to joke that MSA made 10% of its annual revenue and 100% of its profits between the 32nd and 40th of December.

Often, such problems are associated with difficulties getting product installations to succeed. Stephen Arnold suggests that’s exactly what happened in the case of FAST:

So, Fast Search’s problems began as soon as the company decided to push into the enterprise search market. The adjustments were, as noted in the documents I cited in my previous Fast Search analyses and in the TechCrunch article, small at the outset. Who knew that a customer would not pay his license fee installment? Then more customers groused about slow installs and the up front payments were not followed by any other payments. One Fast Search licensee told me that his Global 1000 company would not pay until Fast Search produced an engineer who could complete the installation per the task order. Well, Fast Search got an engineer to the client, but it was six months after I heard the complaint. Not surprisingly, this big outfit turned to a smaller vendor who got a different system up and running in three weeks.

Related links

The Attivio angle on this story
Edit: The actual article

6 trends that could shake up the text analytics market

Curt Monash — Thu, 19 Jun 2008 08:33:31 +0000

My last two posts were based on the introductory slide to my talk The Text Analytics Marketplace: Competitive landscape and trends. I’ll now jump straight ahead to the talk’s conclusion.

Text analytics vendors participate in the same trends as other software and technology vendors. For example, relational business intelligence and data warehousing products are increasingly being sold to departmental buyers. Those buyers place particularly high value on ease of installation. And golly gee whiz, both parts of that are also true in text mining.

But beyond such general trends, I’ve identified six developments that I think could radically transform the text analytics market landscape. Indeed, they could invalidate the neat little eight-bucket categorization I laid out in the prior post. Each is highly likely to occur, although in some cases the timing remains greatly in doubt.

These six market-transforming trends are:

Web/enterprise/messaging integration
BI integration
Universal message retention
Portable personal profiles
Electronic health records
Voice command & control

I’ll explain briefly.

1. Google and Microsoft are two of the three leaders in web search. Now that Microsoft has bought FAST, they are also two of the leaders in enterprise search. They are also two of the leaders in hosted email. Ditto instant messaging. So there’s a good chance these various disciplines will converge.

2. There are a number of ways text analytics and traditional analytics can and are being integrated:

Enterprise search and business intelligence are akin; both involve digging information out of the data you already have.
Text mining is naturally integrated with business intelligence and/or data mining.
There’s a trend toward using text search to dig up business intelligence documents such as specific reports, spreadsheets, etc.

To date the latter is focused on reports that already exist, rather than queries that could be run on the fly, but I hope and trust the technology will be extended over time. Natural language queries have merit anyway; I’d like to see the search box be extended in functionality to a true data-retrieval command line.

3. One of the big purchase drivers of storage, search, and clustering technology is mandates to preserve information and make it available to auditors, regulators, and/or people who want to sue you. Email in particular is changing from being ephemeral to becoming part of the permanent record. Well, if the information is being retained anyway, then maybe it’s time to see how to get useful insight from it.

Right now, a company’s overall text archives aren’t being leveraged in the same way data warehouses are. That will change.

4. For over a decade, online companies have fought to exploit the fact that users were registered with their sites or services, but not with others. Huge amounts of investment money were wasted in the dot-com bubble because people thought “registered users” was a significant metric, or that ISP subscribers could be directed to proprietary content. Enormous valuations are being assigned to Facebook and LinkedIn on similar theories today.

But as site owners and other marketers get ever more aggressive about exploiting user-specific information, users will get ever more sophisticated about controlling it. The obvious solution is for each internet user to control a sophisticated database of their contact information, presence information, actions, preferences, and writings, and to be very selective about which online services are allowed to see which portions of the data. I think that will come about some day, but I don’t know when. When it does, text analytics will be affected in a variety of interesting ways.

5. Electronic health records are almost unique in IT. What other enterprise app can you think of for which relational DBMS aren’t the default underpinning? (Intersystems’ object-oriented DBMS Cache’ has huge share in the clinical records market.) Normal tabular data, text, images, sensor output streams – health records have it all. What’s more, the health records area is coming upon some very interesting times in the area of data sharing, at least in the US.

Just as retailing went from being an IT backwater (through the mid-1980s), to a sophisticated user of database technology (1990s), to the leader of the internet revolution (rise of e-commerce), I think health care is due to take a leadership role in IT advances. And when it does, search, text mining, and voice recognition will all play important roles.

6. Most people reading this far have probably watched Star Trek. Well, what is keeping us from being able to command computers in a Star Trek fashion? Not really that much. Sure, there are some big missing pieces. We need a mapping from commands to the specific applications that would carry them out. We also need a more structured kind of analytic middle tier so that there’s something to map questions to. But those are solvable problems. And by the way – when everybody wears headphones, voice commands emanating from the next cubicle are no longer the big annoyance they would be today. Mobile/small devices only add to the business case for voice recognition advances.

When voice becomes a primary mode of human/device communication, “text” analytics will be affected in any number of ways.

Related links:

The introductory post in this series
19 possible Microsoft/Yahoo synergies, many of them related to text technology convergence, e.g. between web search and enterprise search
The compelling case for letting Google handle your enterprise email
An old post on why BI vendors flocked to integrate with Google OneBox
A proposal to refactor social networks
An old post in which I outlined some of the criteria for Profiles 2.0
Why text technologies are going to recombine (in A World of Bytes)