Monday, November 7, 2011

Parallels between Document Capture and Voice Recognition

I was doing some research about the history of document capture last week. As I was reading about the early imaging machines capable of scanning 30 checks or lottery tickets per second, I came to realize an interesting parallel between the world of document capture and voice recognition.

At first, the purpose of document capture was just creating a readable image of a paper document which could be electronically stored and shared. That alone was a big improvement in efficiency. The analogy in the audio world would be the creation of the MP3 standard which allowed us to make inexpensive recordings of music and share them easily via services such as Napster. Too easily, complained the entertainment industry over and over, until Apple came and took over their business.

The next milestone in image capture was optical character recognition (OCR) which allowed us to extract the text from the image and make it searchable. Intelligent character recognition (ICR) augmented these capabilities by extracting hand-written text. That was particularly important to those high-volume imaging systems processing millions of checks or lottery tickets. In the audio world, the OCR and ICR capabilities are akin to the speech recognition software such as Naturally Speaking by Nuance or IBM’s ViaVoice. The purpose of this software is to convert speech into searchable text - just like OCR.
OCR and voice recognition are both about searchable text
Finally, document capture evolved to the point where it became possible to automatically detect the document type through document recognition (i.e invoice, application, job application, or travel expenses) and subsequently extract the actual data value from the document. Not just text, but rather metadata fields such as billing address, date, total, or payment terms. As a result, document capture can be connected directly with process automation software such as workflow or business process management (BPM) to gain even greater efficiencies from automated document processing.  

In the audio world, the analogous technology is voice control or the recently introduced personal voice assistant Siri by Apple. The idea of this software is to issue voice commands together with the dictation (voice-to-text capture). The commands can make the computer perform a task or a process step. Many phones understood basic voice operations such as “Call home” but those are just shortcut commands comparable to bar-codes and QR codes in the document capture world.


Understanding the meaning from natural language without learning predefined commands takes voice recognition to a different level. Such voice control has been featured in many sci-fi movies from Space Odyssey to Avatar but remains so far mostly in the experimental stage. Microsoft promised to ship a new version of Xbox with voice control for task such as movies or music search which could be extremely useful. Siri appears to be the first intelligent voice control-based software entering the mass market with capabilities such as scheduling appointments, searching for music, sending messages, or checking the weather.

The voice recognition technology has been following a similar innovation trajectory as document capture. Today, software such as Siri raises voice technology onto a level that is on par with the state of the art in document capture. It will be interesting to see what innovations will emerge in both of these worlds. In the mean time, we should practice the interaction with a computer in natural language because Voice Recognition is about to Re-Wire our Brains.

No comments:

Post a Comment