Some non-conventional uses of OCR technology that will make it outlast the need to convert text from a paper document into electronic form.
How optical character recognition (OCR) technology is
used on paper documents is obvious and clear. Those who currently use OCR
speculate about the future of the technology as Electronic Data Interchange (
EDI ) becomes more popular and paper less so. The debate of whether or not paper
will exist in ten, twenty, thirty years, continues and the arguments on both
sides are good. A recent AIIM study found that imaging actually increased paper
consumption due to increased printing. This would seem to indicate that the
rapid vanishing of paper will not happen. While I don't believe that paper is
going to just magically disappear, one thing I do know is that the need to OCR
paper is diminishing rapidly and is obvious on certain vertical types. For
example in healthcare, document types EOBs and medical claims and increasingly
available via EDI to the tune of a 30% increase per year. So what happens to OCR
technology when we no longer need it to get information from paper?
I think OCR will be useful for a long time. Here are some things you might
not have thought about regarding ongoing use of the
tool.
Creation and Detection of Viruses and
Spam
OCR can and is used to thwart spammers, and even detect or
create viruses. For years spammers have realized that by embedding images with
text in their messages they are avoiding the text analysis processes that detect
the keywords that identify spammers—but, there is away to get around this. You
can run the images with text through an OCR engine, using your same text
analysis process, catching even more spammers. This is deployed in some
anti-spamming applications and its usage will become even more popular as the
technology becomes more and more a commodity. Today, this is primarily done on
server-side anti-spam detection vs. client-side applications.
This trick seems obvious when you think about it, but how does OCR prevent
viruses? If you are familiar with how viruses work, you know that, occasionally,
viruses come to your machine as an invited friend to an already installed
malware application. The reason this works is because already installed
applications are granted greater access to machine resources than applications
that are yet installed. Usually the virus portion of the attack (the payload) is
received from a website or silently downloaded. Virus protection applications
are very good at spotting both the malware and the payload when it comes across
as a text stream. But when the payload comes across as an image containing the
code for the payload it's a little trickier. The attacker is banking on the fact
that the image passes the virus checking, the malware converts the image to
text, uses OCR, compiles and runs it secretly. Now Anti-Virus engines are
getting privy to this process and can OCR the image first to see if there is any
code in it, and stop the payload before it even has a chance.
Screen Scraping - Legacy System Migration
One of the biggest challenges in the IT space is migration from legacy
systems, often mainframes, to modern operating systems and applications. Legacy
systems are often classic green screen UNIX, or DOCS systems. They are still in
operation because of the critical nature of the data they contain. The vendors
who make these systems have every intention of making it very hard to migrate
from them. But there is a way, and it works very well.
OCR.
When you don't have any of the great standards that allow the exchange of
data from one system to another, you always have what you can see. If you can
see it, you can OCR it. By taking “pixel perfect” screenshots and reading them
with OCR, data can be moved no matter the system or communication. Either
manually or programmatically, OCR is used to read screenshots from legacy
systems into a new system just as you would copy and paste data from one
application to the next. But OCR of screenshots is not just for migration to
legacy systems. A big complaint from customers of enterprise software is how
complex integration is and lack of communication from one software package to
another. Communication between expensive enterprise applications is critical,
and the cost to purchase or develop connectors is very expensive. Developers
have to learn new APIs, which is time consuming, and sometimes very frustrating
depending on vendor-to-vendor support. With screen scraping and OCR, you can
write one method to get data off the screen of ANY active application window,
search for the relevant content, and—presto—you never have to do it again. Using
OCR to move date from one location to another is one of the most ingenious ways
to ensure the neutrality of your data. Vendor lock down attempts, old
technology, or lack of connectivity should not prevent you from getting to what
you own, the information.
Language Detection
By converting a document to image for
OCR, I can check the language of each word in the document. What is surprising
in our digital world is that the use of a font (language character set) in a
digital document does not indicate its language. Confused? You should be. Just
because you write in a particular language does not mean your digital document
has the language encoded in it. Documents represent a language within their code
via a language encoding. This is how software can determine very quickly and
accurately a language. But when this encoding is absent there is not much that
can be done digitally. There are cases when OCR's ability to read shapes of
fonts to recognize language is necessary to determine a language. Another unique
aspect of OCR engines is that they contain morphology and dictionaries. This is
where OCR has improved its accuracy in the past five years. OCR engines attempt
to identify the language of text in order to better read the document. Because
this mechanism is already built into the engine, if I convert a digital file to
image and OCR it, I can tell you what languages exist in that document. Yes I
said print the digital file to image just to OCR it, crazy huh? While I would
much prefer to use a language detection tool on a digital file, there is no
robust tool that exists to do this at volume yet.
Normalization of Digital Formats
While a PDF created in Acrobat and a PDF
created in a third-party tool look identical to the viewer, internally these PDF
files are very different. In order to accurately digitally parse a PDF file, you
have to have a standard format that is used. If you do not have a standard
format, you are dealing with variations in the document visually and its
infrastructure. This becomes an overwhelming number of variations. For example,
a collection of invoices has as many variations as there are invoices multiplied
by as many PDF-generating applications exist. However, if you were to OCR the
PDF to parse, versus digital parsing, then you are dealing with only the number
of variants that exist in the invoices themselves. This is true with other
formats as well, but PDF the most common.
File Compression
Storing ASCII text takes up far less
space than does an image or video file. As part of the future of compression
technologies, expect that OCR will be used to extract the text from an image and
saved as ASCII or Hexadecimal data versus its image equivalent, thus
dramatically reducing file size. Viewers will convert the text back to an image
during viewing.
Robots
How else do you expect the robots of the future to
read text? OCR. Of course. The eyes of the robot are essentially a camera that
rapidly takes pictures of images. When the robot is faced with the comprehension
of text, the image will be converted using OCR and fed through an engine to gain
meaning from the text and act on it. On the more practical side is the reading
of licenses plates, signs, etc. This particular use of the technology will
ensure its existence forever.
When you become an expert in OCR, you find yourself using the technology in
the oddest places. While there is no question the need for the technology will
diminish on paper I've shown you some interesting ways in which the technology
is beneficial and will continue to be.
Chris Riley (chris.riley@livinganalytics.com) is founder of Living@nalyitcs (www.livinganalytics.com) where
he uses his in-depth knowledge of data capture technologies to advise clients
and proselytize the value of these tools.