The source for solving your business content challenges.

AIIM - OCR and Robots?

OCR and Robots?

Some non-conventional uses of OCR technology that will make it outlast the need to convert text from a paper document into electronic form.

— Chris Riley

How optical character recognition (OCR) technology is used on paper documents is obvious and clear. Those who currently use OCR speculate about the future of the technology as Electronic Data Interchange ( EDI ) becomes more popular and paper less so. The debate of whether or not paper will exist in ten, twenty, thirty years, continues and the arguments on both sides are good. A recent AIIM study found that imaging actually increased paper consumption due to increased printing. This would seem to indicate that the rapid vanishing of paper will not happen. While I don't believe that paper is going to just magically disappear, one thing I do know is that the need to OCR paper is diminishing rapidly and is obvious on certain vertical types. For example in healthcare, document types EOBs and medical claims and increasingly available via EDI to the tune of a 30% increase per year. So what happens to OCR technology when we no longer need it to get information from paper?

I think OCR will be useful for a long time. Here are some things you might not have thought about regarding ongoing use of the tool.
 
Creation and Detection of Viruses and Spam
OCR can and is used to thwart spammers, and even detect or create viruses. For years spammers have realized that by embedding images with text in their messages they are avoiding the text analysis processes that detect the keywords that identify spammers—but, there is away to get around this. You can run the images with text through an OCR engine, using your same text analysis process, catching even more spammers. This is deployed in some anti-spamming applications and its usage will become even more popular as the technology becomes more and more a commodity. Today, this is primarily done on server-side anti-spam detection vs. client-side applications.

This trick seems obvious when you think about it, but how does OCR prevent viruses? If you are familiar with how viruses work, you know that, occasionally, viruses come to your machine as an invited friend to an already installed malware application. The reason this works is because already installed applications are granted greater access to machine resources than applications that are yet installed. Usually the virus portion of the attack (the payload) is received from a website or silently downloaded. Virus protection applications are very good at spotting both the malware and the payload when it comes across as a text stream. But when the payload comes across as an image containing the code for the payload it's a little trickier. The attacker is banking on the fact that the image passes the virus checking, the malware converts the image to text, uses OCR, compiles and runs it secretly. Now Anti-Virus engines are getting privy to this process and can OCR the image first to see if there is any code in it, and stop the payload before it even has a chance.

Screen Scraping - Legacy System Migration

One of the biggest challenges in the IT space is migration from legacy systems, often mainframes, to modern operating systems and applications. Legacy systems are often classic green screen UNIX, or DOCS systems. They are still in operation because of the critical nature of the data they contain. The vendors who make these systems have every intention of making it very hard to migrate from them. But there is a way, and it works very well.

OCR.

When you don't have any of the great standards that allow the exchange of data from one system to another, you always have what you can see. If you can see it, you can OCR it. By taking “pixel perfect” screenshots and reading them with OCR, data can be moved no matter the system or communication. Either manually or programmatically, OCR is used to read screenshots from legacy systems into a new system just as you would copy and paste data from one application to the next. But OCR of screenshots is not just for migration to legacy systems. A big complaint from customers of enterprise software is how complex integration is and lack of communication from one software package to another. Communication between expensive enterprise applications is critical, and the cost to purchase or develop connectors is very expensive. Developers have to learn new APIs, which is time consuming, and sometimes very frustrating depending on vendor-to-vendor support. With screen scraping and OCR, you can write one method to get data off the screen of ANY active application window, search for the relevant content, and—presto—you never have to do it again. Using OCR to move date from one location to another is one of the most ingenious ways to ensure the neutrality of your data. Vendor lock down attempts, old technology, or lack of connectivity should not prevent you from getting to what you own, the information.

Language Detection
By converting a document to image for OCR, I can check the language of each word in the document. What is surprising in our digital world is that the use of a font (language character set) in a digital document does not indicate its language. Confused? You should be. Just because you write in a particular language does not mean your digital document has the language encoded in it. Documents represent a language within their code via a language encoding. This is how software can determine very quickly and accurately a language. But when this encoding is absent there is not much that can be done digitally. There are cases when OCR's ability to read shapes of fonts to recognize language is necessary to determine a language. Another unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past five years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engine, if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Yes I said print the digital file to image just to OCR it, crazy huh? While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume yet.

Normalization of Digital Formats
While a PDF created in Acrobat and a PDF created in a third-party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file, you have to have a standard format that is used. If you do not have a standard format, you are dealing with variations in the document visually and its infrastructure. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices multiplied by as many PDF-generating applications exist. However, if you were to OCR the PDF to parse, versus digital parsing, then you are dealing with only the number of variants that exist in the invoices themselves. This is true with other formats as well, but PDF the most common.

File Compression
Storing ASCII text takes up far less space than does an image or video file. As part of the future of compression technologies, expect that OCR will be used to extract the text from an image and saved as ASCII or Hexadecimal data versus its image equivalent, thus dramatically reducing file size. Viewers will convert the text back to an image during viewing.

Robots
How else do you expect the robots of the future to read text? OCR. Of course. The eyes of the robot are essentially a camera that rapidly takes pictures of images. When the robot is faced with the comprehension of text, the image will be converted using OCR and fed through an engine to gain meaning from the text and act on it. On the more practical side is the reading of licenses plates, signs, etc. This particular use of the technology will ensure its existence forever.

When you become an expert in OCR, you find yourself using the technology in the oddest places. While there is no question the need for the technology will diminish on paper I've shown you some interesting ways in which the technology is beneficial and will continue to be.

Chris Riley (chris.riley@livinganalytics.com) is founder of Living@nalyitcs (www.livinganalytics.com) where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools.


 




Preferred Solution Providers


  • AnyDoc Software
  • Kodak
  • KOFAX



Learn how to take control of your information assets and how to do it Green.



Information Zen - the network for more intelligent information management.