Categorization and filtering
Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:
- Any subcategory
- Text mining
- Search engines
Are denial-of-insight attacks a threat to search logs and/or VOTC/VOTM apps?
TechTaxi points out that it’s at least theoretically possible to, by polluting the Web, pollute somebody’s web-wide information gathering. (Hat tip to Daniel Tunkelang.) They further assert this is a relatively near-term threat.
The theory can’t be denied. What’s more, bad actors have other motives to pollute the Web. For example, if they plant favorable automated comments about their own products or unfavorable about the competition’s, Voice of the Customer/Market applications will naturally be confused. And if automated reputation-checkers get more prominent, there will be a major incentive to game them, just as there has been for Google’s PageRank. So VOTC/VOTM market research tools could polluted as a side effect.
Similarly, if somebody wants to test your e-commerce site by throwing a ton of searches at it, your search logs will lose value.
But disinformation of competitors for the sake of disinformation? Or, as the article suggestions, vandalism/extortion? Off the top of my head, I’m not thinking of a serious near-term threat scenario.
| Categories: Competitive intelligence, Search engines, Spam and antispam, Voice of the Customer | 2 Comments |
The phrase “business intelligence” was COINED for text analytics
Late last year, there was a little flap about who invented the phrase business intelligence. Credit turns out to go to an IBM researcher named H. P. Luhn, as per this 1958 paper. Well, I finally took a look at the paper, after Jeff Jones of IBM sent over another copy. And guess what? It’s all about text analytics. Specifically, it’s about what we might now call a combination of classification and knowledge management.
Half a century later, the industry is finally poised to deliver on that vision.
| Categories: BI integration, Categorization and filtering, IBM and UIMA | 2 Comments |
3 specialized markets for text analytics
In the previous post, I offered a list of eight linguistics-based market segments, and a slide deck surveying them. And I promised a series of follow-up posts based on the slides.
| Categories: Language recognition, Natural language processing (NLP), Spam and antispam, Speech recognition | 2 Comments |
The Text Analytics Marketplace: Competitive landscape and trends
As I see it, there are eight distinct market areas that each depend heavily on linguistic technology. Five are off-shoots of what used to be called “information retrieval”:
1. Web search
2. Public-facing site search
3. Enterprise search and knowledge management
4. Custom publishing
5. Text mining and extraction
Three are more standalone:
6. Spam filtering
7. Voice recognition
8. Machine translation
How text search has evolved over the past 15 years
I just stumbled across a brilliant summary of evolution in text search technology, written four years ago. It’s equally valid today (which in itself says something). I found it on the Prism Legal blog, but the actual author is Sharon Flank. My own comments are interspersed in bold. Read more
| Categories: Enterprise search, Ontologies, Search engines, Structured search | Leave a Comment |
Expert System S.p.A. update
I chatted with Brooke Aker, the new CEO of Expert System’s US subsidiary, for quite a while last week. Unfortunately, we had some cell phone problems, and email followup hasn’t gone well, so I’m hazy on a few details. But here are some highlights, as best I understood them.
| Categories: Application areas, Competitive intelligence, Coveo, Expert System S.p.A., Ontologies, Text mining | 2 Comments |
Google seems to have rehabilitated us
As previously noted, we were de-indexed by Google, due to the injection of a whole lot of spammy hidden links. We’re back now, after about two weeks, even on the blog (this one) where there was no official de-indexing notice and hence no way to apply for re-consideration. And thus we once again have high rankings for search terms such as Netezza, DATAllegro, Clarabridge, and Attivio.
We’re designing a new blog theme — the current one is just an emergency stopgap — that will (among myriad more important virtues) be more SEO-friendly. I’ll be curious to see whether that makes much actual difference from a search ranking standpoint.
| Categories: Google, Search engine optimization (SEO), Spam and antispam | 1 Comment |
Drive-by Google de-listing
As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …
We’ve now upgraded to Wordpress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)
All these hours of aggravation because some criminal wanted a bit of SEO advantage …
| Categories: Google, Search engine optimization (SEO), Spam and antispam | 1 Comment |
Over 80 percent of blog posts are probably spam
Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on Wordpress.com were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:
- Wordpress.com provides costless hosting; using your own domain costs money.
- Besides being free, Wordpress.com hosting may provide a little “google juice”, which is the whole SEO point of spam blogging.
But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:
- 10% of all blogs are spam.
- 10% of all blogs are actively written by humans.
- 80% of all blogs belong to humans, but are updated very infrequently if at all.
In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.
| Categories: Blogosphere, Search engine optimization (SEO), Social software and online media, Spam and antispam | Leave a Comment |
19 Microsoft/Yahoo synergies that could revolutionize the Internet
Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.
The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.
Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie.
