Sunday, May 6, 2012

Content Analytics - Crossing the Chasm?

The overabundance of information is one of the greatest challenges today. Things have sure changed a bit since the days of the 90s when Microsoft used to promise us a PC on every desk and information at our fingertips. Today, we have plenty of information at our fingertips. In fact, we have so much information coming at us from all sides that making sense of it became one of the key information management challenges.

That’s where information analytics come in with all their applications such as semantics and auto-classification. Simply put, these technologies are analyzing the actual content of information, deriving insights, meanings, and understanding. It is done by applying powerful algorithms that analyse the content and complete complex tasks such as concept and entity extraction, similarities, trend identification, and sentiment analysis.

Therein lies the problem with the technology. The greater the volume of information, the more content analytics are useful - if the volume is small, humans can do it themselves. But to apply analytics on a large volume of information, one needs a significant computing power.

That’s the reason why the actual use of analytics was confined to very few scenarios where money was not an issue. The US intelligence agencies used supercomputers with analytics to weed through millions of intercepted messages from suspected terrorists. Similarly, IBM’s Watson, the Jeopardy winning machine, was a supercomputer. It was fed with encyclopedic knowledge and optimized for a single task: winning Jeopardy. Even though content analytics have been around for well over a decade, most organizations simply could not afford the computing power required.

That may be changing now, thanks to a couple of market shifts. First, Moore’s Law is helping - computing power is becoming more and more affordable and the algorithms are becoming more powerful and efficient.

The other improvement comes from our understanding of what the technology is expected to accomplish. The early requirements for analytics and classification asked for ultra-accuracy. One of the key objections used to be the lack of dependability on automatic classification - if it isn't 100% accurate, it is no good. But today, we understand that the alternative is not perfect by any stretch. The alternative is to rely on humans who are actually pretty pathetic at analyzing and classifying content. In fact, humans are notoriously poor and inconsistent at the job. Getting to 60% accuracy is a typical result of human classification.

That means that a technology that can get us to 80% or even 90% of accuracy is actually much more accurate than any humans and this kind of approach doesn’t require as much computing power as the attempt to reach 99.9% accuracy.

For example, one of the common applications for analytics is legal discovery - the need to quickly produce any electronic evidence requested by a court subpoena. Here the goal today is to produce all the relevant documents and emails with a defensible level of accuracy. Of course we don’t want to pay the expensive lawyers for manual review of thousands of documents. They are paid by the hour - and paid rather well. But we also don’t want to stand accused of failing to produce an important piece of evidence. Until recently, the fear was that unless we can prove that we have electronically discovered all the pertaining documents, the approach would not be defensible in a court of law. And only humans (ehm, lawyers) can guarantee such accuracy - for a hefty fee...

Today, that has changed. The courts increasingly understand the futility of aiming for 100% accuracy and instead accept statistical sampling as a way to confirm accuracy at a reasonable level. After all, both both opposing parties - the plaintiff and the defendant - are in the same boat when it comes down to reviewing a mountain of electronic evidence. That means that the lawyers no longer have to review every document. Instead, statistical evidence of accuracy is considered defensible.

As a result, auto-classification no longer has to aim for 100% accuracy. Instead, a more reasonable level of accuracy backed by statistical sampling has become acceptable - because it is still much better than what humans could ever do manually. And cheaper, of course. That makes analytics much more effective and affordable. With that, analytics are no longer confined to the world of super-computers and finding real-world use cases. Analytics may indeed have crossed the chasm.

No comments:

Post a Comment