Document Versus Photo Capture and Scanning

A document scan is worth a thousand words, a digital photograph? Maybe seven hundred. You can capture and scan a document with your cell phone, but do you really want to?

You wouldn’t think that a conversation about optical character recognition (OCR) and the data capture accuracy of a document scan versus a digital photo would be controversial. It is. When I explain the reason why digital photographs are a long way from producing the same recognition results as a document scan, I'm faced with contention. Here’s why photo scans differ.

Just Because We Say It Doesn’t Make It Real
The pace of technology has us believing that technology can do anything. When you see some of the presentations of cutting-edge technology from an organization like TED, you can see why. Some of the technologies look so obvious and fantastic that they just have to be out there. Well, I generally hate clichés, but here’s an appropriate one: “If it seems too good to be true, yeah, it probably is.”

With enough of these supposed uses of technology floating around, it’s no wonder the general public has, some, misconceptions. You can use digital photography to capture and convert documents to text. It’s more of a reality than, oh, teleportation, but it’s not a replacement for document scanners. There are five areas that prevent digital photography from taking over: repeatability, angle, layers, recognition technology, and, finally, practicality. The type of digital photography I'm discussing here is not that found in high speed photo scanners used often for books, I'm talking about using your digital camera, cell phone, or industry-specific PDA to take a photo of a document.

Repeatability
The number of variables present in digital document capture versus a document scan is substantially more. Because of this, it's nearly impossible to gain consistency. For example, take a picture of a document page, then seconds later retake it. Load both photos onto a computer and compare. You will immediately notice a difference. Do the same with a document scanner and, yes there will be differences, but you will have to really dig. The lack of consistency can play havoc on a data capture and OCR setup. In order to accurately process documents with large variance, you have to allow the software to be as general as possible. Increasing generalities in recognition technology decreases the overall accuracy. Thus you will find that no matter the resolution of the digital photograph, the lack of consistency alone will mean a lower accuracy.

Angle
In a document scanner the angle of scan is direct so that the top horizontal border of the document and the bottom are exactly the same width. With a digital photograph this is naturally not the case and a special consideration that has to be taken into account during capture. Depending on the angle of the photo the text in the document from top to bottom will either be decreasing or increasing in size. During recognition the technology looks for uniformity in font height and width, which isn’t there with a digital photo. The solution to this problem is spending the time to make sure a shot is as direct as possible. Some of the industry-specific PDAs out there have gone with the approach of projecting via laser the borders of the document to remove any angle. Your job, if you have this, is to place the document within the laser borders. Some cell phone capture applications have something similar, with boxes on the capture screen showing you where the document should fit. Both are useful for traditional size documents, but time consuming.

Layers
The problem of layers is similar to angles, it's solved with patience and awareness. In a document scan, the subject image is the entire image. A digital photo will have additional elements; a table, the floor, perhaps a finger. The goal when taking a digital photograph should be to complete your camera's screen with the documents borders both height and width, this reduces the amount of layers. If the size is awkward it is best to put the document on a surface that will complete the rest of the screen so at the very most there are only be two layers that have to be identified.

Recognition Technology
The core of the OCR engines that exist today were built upon the concept of a document scan. Because it takes about 50 man-years to create a new OCR engine with current approaches, changing these cores is really not an option. In order to really excel at photographed document recognition, an engine specific for the task would need to be created. I alluded in my “Operation OCR Re-Birth” article that this will, at some point, happen. However, until it does OCR technology's foundation is document scans (of course, this is where the engines will be most accurate). While there are OCR engines that have spent specific effort on fine-tuning their engine for digital photographs, the core of these engines are still built for documents scanners. These fine-tuned engines are better, but still not at the same quality. These engines incorporate color, image pre-processing, and, strangely enough, remove some of the “experts” in the engine that improve document scan OCR but perhaps damage digital photograph OCR.

Practicality
It may not be obvious until you try to capture several documents with a camera, but once you have you realize very quickly this is a job for ad-hoc capture only not multiple page documents or double-sided documents. If you consider all my above points on angle, and layers and the additional consideration now the photographer must take, no shaking, minimal layers, consistent angles a good capture will take between 15 to 30 seconds. A document scan of a double-sided page at 300 DPI takes about 2 seconds. That is a time savings per side of 15 to 30 times less.

Testing Is Believing
We all want to believe in technology, but sometimes have to pay more attention to the reality of use. Digitally capturing documents is a practice I use regularly on ad-hoc documents that I want to remember and when I’m not terribly concerned about the recognition accuracy. The biggest argument I receive to the above lecture is resolution (you’ll not that I haven’t discussed resolution, because it’s not really an issue). Yes, a 12 megapixel camera will have a higher resolution than a 300 dpi scan. BUT, resolution isn’t everything. (see, Scanning Can Make or Break Recognition ) Document scans are not going away, and, in industries that are highly regulated and/or where accuracy is key, document scanning will remain the best practice.

Chris Riley (chris.riley@livinganalytics.com) is founder of Living@nalyitcs (www.livinganalytics.com) where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools.