A document scan is worth a thousand words, a digital photograph? Maybe seven hundred. You can capture and scan a document with your cell phone, but do you really want to?
You wouldn’t think that a conversation about optical character recognition
(OCR) and the data capture accuracy of a document scan versus a digital photo
would be controversial. It is. When I explain the reason why digital photographs
are a long way from producing the same recognition results as a document scan,
I'm faced with contention. Here’s why photo scans differ.
Just Because We Say It Doesn’t Make It Real
The pace of technology has us believing that technology can do anything. When
you see some of the presentations of cutting-edge technology from an
organization like TED, you can see why. Some of the technologies look so obvious
and fantastic that they just have to be out there. Well, I generally hate
clichés, but here’s an appropriate one: “If it seems too good to be true, yeah,
it probably is.”
With enough of these supposed uses of technology floating around, it’s no
wonder the general public has, some, misconceptions. You can use digital
photography to capture and convert documents to text. It’s more of a reality
than, oh, teleportation, but it’s not a replacement for document scanners. There
are five areas that prevent digital photography from taking over: repeatability,
angle, layers, recognition technology, and, finally, practicality. The type of
digital photography I'm discussing here is not that found in high speed photo
scanners used often for books, I'm talking about using your digital camera, cell
phone, or industry-specific PDA to take a photo of a document.
Repeatability
The number of variables present in digital document capture versus a document
scan is substantially more. Because of this, it's nearly impossible to gain
consistency. For example, take a picture of a document page, then seconds later
retake it. Load both photos onto a computer and compare. You will immediately
notice a difference. Do the same with a document scanner and, yes there will be
differences, but you will have to really dig. The lack of consistency can play
havoc on a data capture and OCR setup. In order to accurately process documents
with large variance, you have to allow the software to be as general as
possible. Increasing generalities in recognition technology decreases the
overall accuracy. Thus you will find that no matter the resolution of the
digital photograph, the lack of consistency alone will mean a lower
accuracy.
Angle
In a document scanner the angle of scan is direct so that the top horizontal
border of the document and the bottom are exactly the same width. With a digital
photograph this is naturally not the case and a special consideration that has
to be taken into account during capture. Depending on the angle of the photo the
text in the document from top to bottom will either be decreasing or increasing
in size. During recognition the technology looks for uniformity in font height
and width, which isn’t there with a digital photo. The solution to this problem
is spending the time to make sure a shot is as direct as possible. Some of the
industry-specific PDAs out there have gone with the approach of projecting via
laser the borders of the document to remove any angle. Your job, if you have
this, is to place the document within the laser borders. Some cell phone capture
applications have something similar, with boxes on the capture screen showing
you where the document should fit. Both are useful for traditional size
documents, but time consuming.
Layers
The problem of layers is similar to angles, it's solved with patience and
awareness. In a document scan, the subject image is the entire image. A digital
photo will have additional elements; a table, the floor, perhaps a finger. The
goal when taking a digital photograph should be to complete your camera's screen
with the documents borders both height and width, this reduces the amount of
layers. If the size is awkward it is best to put the document on a surface that
will complete the rest of the screen so at the very most there are only be two
layers that have to be identified.
Recognition Technology
The core of the OCR engines that exist today were built upon the concept of a
document scan. Because it takes about 50 man-years to create a new OCR engine
with current approaches, changing these cores is really not an option. In order
to really excel at photographed document recognition, an engine specific for the
task would need to be created. I alluded in my “Operation OCR Re-Birth” article
that this will, at some point, happen. However, until it does OCR technology's
foundation is document scans (of course, this is where the engines will be most
accurate). While there are OCR engines that have spent specific effort on
fine-tuning their engine for digital photographs, the core of these engines are
still built for documents scanners. These fine-tuned engines are better, but
still not at the same quality. These engines incorporate color, image
pre-processing, and, strangely enough, remove some of the “experts” in the
engine that improve document scan OCR but perhaps damage digital photograph OCR.
Practicality
It may not be obvious until you try to capture several documents with a
camera, but once you have you realize very quickly this is a job for ad-hoc
capture only not multiple page documents or double-sided documents. If you
consider all my above points on angle, and layers and the additional
consideration now the photographer must take, no shaking, minimal layers,
consistent angles a good capture will take between 15 to 30 seconds. A document
scan of a double-sided page at 300 DPI takes about 2 seconds. That is a time
savings per side of 15 to 30 times less.
Testing Is Believing
We all want to believe in technology,
but sometimes have to pay more attention to the reality of use. Digitally
capturing documents is a practice I use regularly on ad-hoc documents that I
want to remember and when I’m not terribly concerned about the recognition
accuracy. The biggest argument I receive to the above lecture is resolution
(you’ll not that I haven’t discussed resolution, because it’s not really an
issue). Yes, a 12 megapixel camera will have a higher resolution than a 300 dpi
scan. BUT, resolution isn’t everything. (see, Scanning
Can Make or Break Recognition
) Document scans are not going away, and, in industries that are highly regulated
and/or where accuracy is key, document scanning will remain the best
practice.
Chris Riley (chris.riley@livinganalytics.com)
is founder of Living@nalyitcs (www.livinganalytics.com)
where he uses his in-depth knowledge of data capture technologies to advise
clients and proselytize the value of these tools.