January 14, 2008

19 bullet points about the difference between enterprise and web search

Eric Lai wrote in this week’s Computerworld about “Why is enterprise search harder than Google Web search?” Highlights included:

On the whole, that’s not bad. If this were an easy subject to write about, I’d have explained it a lot more clearly in the past myself. OK. Let me get off my duff and give it a whirl now.

Actually, when writing, I generally stay on my duff. At least, that’s true if I’m guessing correctly what a “duff” is. And this is not just a vaguely humorous digression — it’s also an example of why information retrieval is so hard if you only have the text itself to go by.

With that said, here are some notes on web search, enterprise search, single-site search, and database management.

Finally, I’d like to lay out a few points about the integration of text search and database management.

Comments

15 Responses to “19 bullet points about the difference between enterprise and web search”

  1. David Eddy on January 15th, 2008 7:05 pm

    Curt –

    Do you have any thoughts on why enterprise search vendors (to the best of my knowledge) have so far assiduously ignored the existence and importance of program source code?

    Source code is where the rules of the business have been automated, not MSWord policy documents.

    I agree with you that ontology management would help tremendously, but to date anything remotely practical remains a distant dream.

    – David

  2. Curt Monash on January 16th, 2008 12:12 am

    David,

    Are you talking more about:

    A. A non-programmer using a search engine and wanting source code as an answer?

    B. A programmer not having a better way to find source code than via a search engine?

    A strikes me as unlikely. B, if it’s likely, would be disappointing to me. I understand that configuration management systems use weird file types, or at least did a decade ago when I was more fully up to speed on them. But geez; I’d think they would have had that capability even then, let alone now.

    CAM

  3. David Eddy on January 16th, 2008 4:34 pm

    Curt –

    I could certainly see a savvy systems analyst (perhaps having come up through the ranks as a programmer so they have code reading/comprehension skills) being capable of searching code.

    It is my assumption that since the rules of the business are buried in source code, some one (likely NOT a programmer) is going to need to find all the places across systems where changes need to be made. Example: I’ve heard that NASDAQ is expanding its trading symbol from 4 to 6 characters. At an absolute minimum that impacts both Windows based systems and mainframes (with probably some Unix & AS400 thrown in for fun). First pass through for a project guesstimation needs to somehow understand how many different places need to be changed. A programmer typically isn’t going to have the sort of span-of-view to worry about both Windows and mainframe, but a business analyst would (should?).

    Configuration management systems aren’t much help since they’re typically like the silos they hold… isolated & hard to get at. COBOL & .NET code are not likely to be stored in the same place or format.

    In any case… by your answer I will (safely?) assume that this crazy idea—of there being value in searching source code more effectively—has not appeared on your enterprise search radar screen…?

    I’ve found a few companies that appear to be stepping up to at least part of the challenge… Krugle (www.krugle.com), codefetch (www.codefetch.com), codase (www.codase.com), and koders (www.koders.com). I have no idea how strong these offerings are, but a first thought would be that they seem to only look at software languages that have risen to popularity in the past 10 years, such as C, HTML, JavaScript, Java, PHP, Ruby, etc. No mention of legacy—but still in widespread use—languages such as COBOL, JCL, CICS, Assembler, etc.

    These vendors appear to be entirely disconnected from the enterprise search space.

    – David

  4. Curt Monash on January 16th, 2008 5:40 pm

    Such vendors should be pretty disconnected from enterprise search. There’s been the occasional attempt at an exception, but basically they’re very different problems. Search generally starts by identifying — tokenizing — words. Then it looks at syntax (grammatical structure) and semantics (synonyms and clusters). And it goes on from there.

    There’s very little overlap between how that’s done in human languages and how it’s done in computer languages.

    CAM

  5. Text Technologies»Blog Archive » Lynda Moulton on enterprise search on January 17th, 2008 2:11 pm

    [...] search quite similarly, as I discovered when she called me yesterday to praise my post on the many differences between enterprise and web search, and followed up with this one of her own. One of Lynda’s big themes is that large [...]

  6. David Eddy on January 17th, 2008 3:59 pm

    Curt –

    >
    > There’s very little overlap between how that’s done in human languages and how it’s done in computer languages.
    >

    Acknowledged.

    Obviously there’s been a huge amount of work done with human languages—proximity, stemming, probabilities, etc.—that simply will not work when applied to software languages.

    My point is that the rules of the business are deeply buried and highly fragmented in extremely difficult to comprehend software, not in scores of well written document formats (MSWord, email, PowerPoint, etc.) intended for human consumption.

    Enterprise search is going about the problem by looking for the lost keys under the street light (“Because that’s where the light is… but I lost the keys over by the car which is in the dark.”) simply because it’s easy.

    My primary beef here is that software (source code) is simply not considered to be a document… and therefore is not worthy of being brought to the search table.

    How are people going to know what’s really happening “behind the firewall” if source code is not included in the searching process?

    When the CEO issues the command “I want social security number either encrypted or removed from use where not necessary.” you’re going to rely on what your easily findable word processing format documents tell you? Surely you’re not telling me that?

    – David

  7. Curt Monash on January 17th, 2008 4:23 pm

    What I’m saying, David, is that a specialized tool is needed. It’s not realistic to ask one search product to find EVERYTHING.

    General search products don’t even work well across the full range I think they should cover. And the specific problem you’re referring to falls outside that range.

    Configuration management and app dev tools have ever more understanding of software’s syntax and semantics. I’d start from them as a base, rather than from traditional inverted-file text-string indexing.

    CAM

  8. David Eddy on January 18th, 2008 12:51 am

    Curt –

    >
    > a specialized tool is needed. It’s not realistic to ask one search product to find EVERYTHING.
    >

    Again, agreed.

    The ultimate enterprise search tool is obviously going to need a passel of highly specialized tools under the covers. Making it look as easy & slick as Google is going to be interesting. A major challenge, of course, is that most enterprises have only the foggiest idea of their applications inventory.

    First, source code has to be considered a valuable & important BUSINESS search resource before we start thinking about what sort of exotic tools are needed.

    It is my argument (clearly a voice of one) that the corporate knowledge buried in source code needs to be recognized as worthwhile to mine… rather than to leave it walking around in the heads of soon-to-retire experts. Currently, changing systems is slow, expensive, manual work, far too often highly dependent upon domain experts. It is my belief that through enterprise search could be a significant help in whittling away at the “80% of my IT budget goes to legacy systems” problem.

    Obviously (after a lot of non-obvious rat holes) you have to approach enterprise search with a “white list” approach… first pass you identify what it is you’re trying to read (e.g. PowerPoint, MSWord, COBOL, dBase, etc.), second pass you process it with the appropriate reader. If you can’t identify what it is, then don’t try to read more than a few lines. Probably best not to rely on extensions (.exe, .doc) as gospel as to what the document truly is.

    I’m not aware of any application development tools that bring semantic understanding to the table. But then maybe I’m quibbling over the definition of “semantics.”

    Development tools (Xcode/ObjectiveC being my most current knowledge) that I’m familiar with are equally happy with:

    a = b * c or

    weeklyPay = hoursWorked * payRate.

    If an Eclipse plug-in has brought something more robust to the table, please point me in the right direction.

    – David

  9. Curt Monash on January 18th, 2008 8:54 pm

    David,

    Why don’t you take a look at the tools that purported to automate the finding (if not fixing) of Y2K 2-character data fields? That was, er, 8+ years ago, so they’ve had a lot of time to evolve since then.

    CAM

  10. David Eddy on January 18th, 2008 11:08 pm

    Curt –

    That’s precisely why I’m interested in enterprise search.

    I was a Y2K inventory/impact analysis tool vendor, getting into the market in late 1994. We had a tool that explicitly handled “odd” languages (beyond the biggies of COBOL & PL/1)… things like EasyTrieve, EasyTrieve Plus (they’re not related), Natural and others long forgotten. It was a hoot chasing down folks who had these bizarre languages I’d never heard of before. Ever heard of Extracto?

    I knew we were onto something interesting/challenging in when Capers Jones sent me his Function Points languages list in 1995. There were 400+ software languages on the list. By 2005 that list had expanded to 650 before being “pruned” back to a more manageable 500.

    To the best of my knowledge none of the Y2K inventory/impact analysis tools have survived. I know ours didn’t (I know of a single surviving site).

    We got to 1/1/00. The world didn’t end. The tools & systems inventory knowledge went into the bit-bucket. End of story.

    It’s my belief that most “civilians” see Y2K as a giant techie hoax. I’m sure a lot of IT departments did not cover themselves in glory in the eyes of business executives for heavily porking up IT budgets under the dodge of “we need it for Y2K.”

    The business value of actively maintaining a complete, accurate & edge-to-edge inventory of an organization’s applications portfolio (with the additional benefit of being able to trace how the pieces are interrelated) is a very hard sell.

    The high-school dropout running the local 7-11 knows how many candy bars & jugs of milk he has on hand (inventory). Why doesn’t IT keep an inventory? There was a news item last year about EDS doing an outsourcing contract for the US Navy. They went in believing there were 5,000 systems. EDS ultimately found 100,000+.

    What is different now is that we have the delight of Google… which means people now want to have the same ease-of-use access to knowledge/answers/information behind the firewall. The fact that serious analysts clearly emphasize that Google & enterprise search are not even remotely comparable problems just falls on the floor as useless noise.

    Thanks for being interested.

    – David

  11. Spectate Swamp on March 7th, 2008 7:27 pm

    I use my Desktop Search to search source code at a Telephone billing software company. .

    It is a non indexing search. The first step is to “Merge/Append” all the source code into
    1 file. Then search that file. When merging the files have a start and stop header is put
    in the merged file. When a match is found the originating file name is displayed in the
    form title bar. It searches text at 20,000,000 cps. Any system worth it’s salt can export
    data to text. I have all my emails since 1996 in large text files. I can even use the
    search to extract lists of email addresses.

    The search has evolved to randomly play mpg video and mp3 audio as well as pictures.

    I have been arguing search with everybody on the net, for years now.

    http://channel9.msdn.com/showuserthreads.aspx?userid=31672

    http://forums.thedailywtf.com/forums/t/7593.aspx

  12. DBMS2 — DataBase Management System Services » Blog Archive » The 4 main approaches to datatype extensibility on April 25th, 2008 12:10 am

    [...] Text search is a huge business on the web, and a separate big business in enterprises. And text doesn’t fit well into the relational paradigm at [...]

  13. How text search has evolved over the past 15 years | Text Technologies on June 15th, 2008 3:26 am

    [...] “Looking at this list, you can see that the conceptual changes (breakthroughs?), with the exception of better phrase handling, are primarily focused around Web searches. When dealing with one-of-a-kind document collections behind the corporate firewall, many of these developments turn out not to add much to older approaches. So, at least for enterprise search, I too remain partial to some of the older products you mention, though I am disappointed that most of the old-time vendors have not updated their approaches beyond adding taxonomy support.” [CAM] Yep, web search and enterprise search are very different things. [...]

  14. 19 bullet points about the difference between enterprise and web search | Text Technologies :: Kelvin Tan - Lucene Nutch Consulting on June 17th, 2008 10:47 am
  15. Wally on September 26th, 2008 1:19 pm

    You guys are geeks.

Leave a Reply




Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.