Attivio – Text Technologies

Attivio update

Curt Monash — Sat, 20 Sep 2008 05:00:06 +0000

I talked w/ Andrew McKay of Attivio for 2 ½ hours Thursday. I’ve also been working with some Attivio engineers on a blog search engine. I think it’s time to post about Attivio.

In its full conception, the Attivio Intelligence Engine is something like Endeca + RDBMS + search engine + XML store + cool extra features. And all with seamless, lightweight, integrated installation and administration. That’s the goal, anyway. At this point, naturally, each individual piece is far from complete. For example:

Sufficient SQL support to handle most BI tools is still a matter for future releases — apparently in 2009, although Attivio is one of those agile companies for which pinning down product releases is somewhat difficult.
The same goes some basic GUI features (such as most non-programmatic search tuning).
ACID compliance is not a high priority for Attivio. I actually think it should be higher, just because it’s increasingly become an “OK, we don’t have to worry about THAT” checkmark item.

Even in its early days, Attivio has had some nice-sounding customer successes. There are 8 paying Attivio customers, including 2 > $1 million deals, one half-millionish dollar deal, and 1 large OEM. 3 represent actual deployments, with the rest in development. More sales are on the way, as are permissions to disclose customer names that people will actually recognize. Customer application stories Andrew told me about include:

A web-business parameterized, adjustable-weight search that’s starting with tabular data and only getting to free-text later.
An enterprise that’s using Attivio for content management, enterprise search, public-facing search, and data warehousing.
Something big/mysterious/classified, with large document volumes.
Something to do with compliance, about which Andrew was going to forward a lot more detail that evening (Hint, hint).

Since the major RDBMS (Oracle, Microsoft SQL Server, DB2) all have text search and XML subsystems, they can in principle do everything Attivio does on the back end, and with a lot more features and maturity. The same would go for Marklogic. Performance and overhead might be different matters, however; Andrew certainly believes so.

Except that Lucene is included on the search side, I haven’t actually figured out how Attivio stores data. The fact that SQL features are being added incrementally suggests Attivio is rolling its own relational database capability, but how it’s organized I don’t really know.

The Attivio angle on the FAST story

Curt Monash — Tue, 08 Jul 2008 19:16:50 +0000

Attivio CEO Ali Riaz was previously CFO and COO of FAST. He tried to avoid involvement in the recent expose’ of his former employer. For his troubles he got a parking lot ambush, a big photograph, and some unflattering coverage. Adriaan Bloem and Stephen Arnold have been hotly debating Ali’s culpability.

There are two general issues here, based on the fact that Ali and a couple of other key Attivio executives come from FAST. First, they were at a corrupt company — but resigned before the worst (and perhaps all) of the corruption happened. Second, they were at a company that did very well in some respects, but very badly in others, so it’s a mixed-quality resume item.

So far, no biggie. Lots of executives exude overoptimism about their companies products and business prospects. And I haven’t identified anything which suggests to me as a former stock analyst that the controls Ali put in place as CFO/COO were inadequate. (If he’d been long-time CEO, it would have been a different matter, as he would have been more responsible for the general ethical culture of the company — but he wasn’t.)

So the main serious charge is that FAST funneled a lot of sales through small reseller companies owned by its executives, including Ali. Such arrangements could be used either for misappropriation of funds, or to inflate revenue. In the article, Ali denies involvement in any reseller until after he left FAST’s employment, but the reporter purports to have discovered proof to the contrary. I couldn’t quite get Ali to reiterate his denial to me — or, indeed, to talk with me directly about the matter — but did get an emailed statement which reads:

Mr. Riaz categorically denies any wrongdoing during his tenure at FAST or in any relationship with FAST thereafter. He has not been an employee of FAST for almost two years now, and therefore must defer all further comments to Microsoft’s official 2006 and 2007 statements on the matter.

I’ve advised my clients at Attivio that they should be clearer and more specific, but so far I’m not carrying the day. So for now, we’ll go with that.

19 bullet points about the difference between enterprise and web search

Curt Monash — Mon, 14 Jan 2008 18:51:21 +0000

Eric Lai wrote in this week’s Computerworld about “Why is enterprise search harder than Google Web search?” Highlights included:

He described enterprise search as consisting mainly of a search box plus faceted searching, with maybe some automated tagging as well.
He observed that off-page factors such as PageRank don’t work nearly as well in an enterprise as they do on the Web, and that manual tagging by enterprise users falls far short of closing the gap.
He stumbled a bit compare/constrasting search engines and “structured” DBMS.
He basically endorsed the worldview of Ali Riaz, late of FAST, now of Attivio.

On the whole, that’s not bad. If this were an easy subject to write about, I’d have explained it a lot more clearly in the past myself. OK. Let me get off my duff and give it a whirl now.

Actually, when writing, I generally stay on my duff. At least, that’s true if I’m guessing correctly what a “duff” is. And this is not just a vaguely humorous digression — it’s also an example of why information retrieval is so hard if you only have the text itself to go by.

With that said, here are some notes on web search, enterprise search, single-site search, and database management.

Web search has a huge problem that enterprise search doesn’t — adversarial information retrieval. Enterprise document creators generally don’t try to game search results.
Single-site search sometimes works very well, and sometimes works laughably badly. (I often use regular Google after a site-search engine offers up a long list of bug reports and minor upgrade notes, instead of product overviews.) For an egregious example see Oracle.com, where search seems to have gotten even worse than it was a couple of years ago.
The difference in almost every case, I think, is whether or not the site owner has done a good job of manual tagging. That would explain why changing a choice of search engine can make a site worse, if you don’t rebuild the tags; I suspect this is what happened in the Oracle case, after some acquisition.
Full structured search technology can be the difference between “pretty good search” and an e-commerce site that really rocks, but it doesn’t work at all without a lot of manual tagging.
When you have a revenue-generating e-commerce site, it’s easy to justify the work of manual tagging. When you have a marketing site, it can make sense. But inside an enterprise, the tagging isn’t going to happen much.
Documents in an enterprise lie in a broad range of disparate corpuses. There are spreadsheets, PowerPoint presentations, structured-field database entries, free-text-field database entries, email, instant messages, and so on. And there also are many different corpuses of traditional text documents. A large-enterprise search engine needs tools to dial up or dial down the relevance of different corpuses, in general and to a specific search.
Each corpus may also have its own kinds of metadata that are helpful in ranking and summarizing search results.
As Guy Creese already pointed out in a comment to Eric’s article, security comes into play in enterprise search. I think of that as different users seeing different corpuses.
Link analysis doesn’t work inside enterprises. Indeed, most of the documents aren’t in HTML and don’t link to each other.
In web and enterprise search alike, you’re often satisfied by a page that doesn’t help you directly, if it points you at a better resource. But the details of the two cases differ. In enterprises, the better resource is usually a person — i.e., a document author — that you can contact directly. (This is pretty much the main part of knowledge management that actually works.)
I wrote two years ago about a key missing ingredient to enterprise search technology — an ontology management system. Unfortunately, I’m not aware of significant progress in that direction subsequently.

Finally, I’d like to lay out a few points about the integration of text search and database management.

Text search became an option in object/relational database management systems a little over a decade ago. The basic paradigm is that a document is stored in a single field, and a full-text index on it is integrated into the RDBMS. SQL syntax was extended to include text operators. Oracle, IBM, and Informix all did this cleanly; Microsoft found a workaround. Although all these vendors were very disappointing in the quality and performance of their search, these engines get a lot of use in application areas where it’s obviously beneficial to integrate search and relational queries.
Search engines — both standalone and DBMS-integrated — have in recent years gained the capability to query any kind of database field. Why not? They typically can handle 1-200+ other document types as well. Anyhow, business intelligence based on that capability is part of the FAST and now Attivio stories.
The fundamental technology of full-text indexing is similar to a number of things that go on in the relational/tabular/structured worlds, such as ADABAS’s inverted lists, or anything in the bitmapped and columnar areas. As just one example, SAP’s columnar BI Accelerator is directly built on TREX technology.
Mark Logic makes a good case that if you’re going to integrate search and DBMS, XML may be the way to go rather than relational.

Attivio tries to do it all

Curt Monash — Wed, 12 Dec 2007 04:38:55 +0000

When Andrew McKay was at FAST, I grumped about his search/BI integration story. Now that he’s trying to do the same thing at a startup called Attivio, it sounds more plausible.

Attivio is having a house party and product rollout in the latter part of January, and details are scarce in the mean time. But here are some highlights.

Attivio was founded in August. It has 21 people and 1 VC. The VC has invested >$6 million and committed >$12 million total.
Attivio has ambitious plans for a fully integrated data management/real-time BI stack. It’s currently called the “Active Intelligence Engine.”
The data management part combines tabular, text, and XML data. The tabular part is some kind of bitmap. The text part is fairly traditional, and based on Lucene.
One point of this architecture is that one can more or less seamlessly join different kinds of data.
Another point is surely that — with everything being more or less like a column or bitmap — memory management and administration are manageable issues.
Despite containing all these wonders, the code is under 10 megs total. At least right now. But then — how much code can one write in a few months?
Andrew didn’t want me to repeat everything he said about target markets, but clearly Wall Street is one of the top possibilities.

Stay tuned.