October 19, 2005

Linkage among different text technologies

The first post in this blog describes its subject as “a group of interrelated, linguistics-based technology sectors, including text mining, search, speech recognition, and text command-and-control.” So I might as well kick off the discussion by summarizing some reasons why I think these sectors really are connected to each other. Very quickly:

IBM says so, and nobody (that I know of) is contradicting them.
The essence of the UIMA story is that a lot of different pieces of technology need to be swapped in and out, not just among different brands of the same text applications, but among different kinds of text app. The vendors I checked with are uniformly skeptical about whether UIMA will have a real market impact, but none disputes UIMA’s underlying premise.

The tokenization chain. This general industry agreement is only one of three major reasons I believe in the general UIMA premise (while sharing the skepticism about that particular framework’s early adoption). A second dates back to when I was first learning about text search. At a Verity User Conference in, I think, April 1997, I had a very interesting conversation about Verity’s new architecture. (Probably with Phil Nelson, maybe with somebody else, such as Hugh Njemanze or Nick Arnett.) Basically, the system had been modularized, and the way it had been modularized was to create a flow of tokenization after tokenization after tokenization. The third reason is the observation that Inxight, so central to the tokenization strategies of text search vendors, plays pretty much the same role for the text mining companies.

The centrality of concept ontologies. I don’t currently have an opinion about the Semantic Web, but in a more limited sense it’s clear that ontologies will rule text applications. Whether for search, text data mining, or application command/control, it just doesn’t suffice to identify, find, weigh, or respond to individual words. Rather, you need to add other words indicating similar meaning – or a similar user “intent” — into the mix.*

This is a big deal, because simple minded ontologies don’t work. They can’t just be automatically generated, and they can’t just be hand-built. They can’t just be custom to each user or user enterprise, but they also can’t be provided entirely by technology vendors. Almost no large enterprises currently have a good system of ontology building and management, but in the near future most will have to. Evolution in this area will be a crucial determinant of how multiple text technology submarkets are shaped.

In particular, this is a big enough deal that I think search and text data mining and other text technologies will, for each enterprise, tend to use the same ontology.

*Note: There’s a whole other question as to how long we’ll be able to get by just looking at semantics, or whether syntactic analysis absolutely also should be in the mix. But first things first; without a good ontology, syntactic analysis is a pretty hopeless endeavor.

The use of text data mining in other areas. The automated part of the ontology building process involves a lot of text data mining. Large search engine companies generally do a lot of data mining to establish and validate tweaks to their search algorithms. The same goes for spam filters and more questionable forms of censorware. You can’t act intelligently without learning, and machines don’t learn well without doing statistical analyses.

I hope to post soon on each of these issues at more length, and I encourage comments on any of them as inputs to further work. But for now, I’ll just claim to have provided strong evidence for my initial point: Seemingly different text technologies are indeed closely related.


2 Responses to “Linkage among different text technologies”

  1. Francesco Sclano on November 20th, 2006 6:10 pm

    Hi everybody!
    TermExtractor, my master thesis, is online at the
    address http://lcl2.di.uniroma1.it.

    TermExtractor is a FREE and high-performing software package for Terminology
    Extraction. The software helps a web community to
    extract and validate relevant domain terms in their
    interest domain, by submitting an archive of
    domain-related documents in any format
    (txt, pdf, ps, dvi, tex, doc, rtf, ppt, xls, xml,
    html/htm, chm, wpd and also zip archives.)

    TermExtractor extracts terminology consensually
    referred in a specific application domain. The
    software takes as input a corpus of domain documents,
    parses the documents, and extracts a list of
    “syntactically plausible” terms (e.g. compounds,
    adjective-nouns, etc.).
    Documents parsing assigns a greater importance
    to terms with text layouts (title, bold, italic,
    underlined, etc.). Two entropy-based measures, called
    Domain Relevance and Domain Consensus, are then used.
    Domain Consensus is used to select only the terms
    which are consensually referred throughout the corpus
    documents. Domain Relevance to select only the terms
    which are relevant to the domain of interest, Domain
    Relevance is computed with reference to a set of
    contrastive terminologies from different domains.
    Finally, extracted terms are further filtered using
    Lexical Cohesion, that measures the degree of
    association of all the words in a terminological

    Francesco Sclano
    home page: http://lcl2.di.uniroma1.it/~sclano
    msn: francesco_sclano@yahoo.it
    skype: francesco978

  2. Derived data, progressive enhancement, and schema evolution | DBMS 2 : DataBase Management System Services on September 6th, 2011 3:10 am

    […] been the case that a simple text processing pipeline could have >15 steps of extraction; indeed, I learned about the “tokenization chain” in 1997. If all the “progression” in  the data enhancement occurs in a single processing run, […]

Leave a Reply

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Warning: include(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/texttechnologies/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include(http://www.monash.com/blog-promo.php): failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/texttechnologies/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include(): Failed opening 'http://www.monash.com/blog-promo.php' for inclusion (include_path='.:/usr/lib/php:/usr/local/lib/php') in /home/texttechnologies/public_html/wp-content/themes/monash/static_sidebar.php on line 29