Scanning Can Make or Break Recognition

While 300 dpi is a good enough setting for most documents, there are other variables to consider. Here’s a look at resolution and bit depth and what you need to be thinking about to maximize the recognition efficiency of your capture process.

By Chris Riley

While I’m not a great fan of such cliché slogans as “junk in, junk out;” I’m forced to admit that it’s 100% accurate when it comes to scanning and OCR (optical character recognition).

OCR and scanning are linked; any conversation I have about OCR includes a discussion of scanning. The goal is to give the OCR engine the best chance with the best possible image while balancing all the additional requirements. While scanning is its own distinct process, the influence it has on OCR accuracy is phenomenal. Most users want there to be a universal (that is, a single best) scan setting. I’m sure that this concept is fed by the fact that all the scanning hardware out there usually has a setting specifically for OCR scan, alluding to the fact that it could be just that easy. Ironically, in my own practice I’ve never used default OCR scan settings and have actually discovered instances where that setting hurts the results.

Unfortunately, there is not a “catch all” scanner setting that will always give you the best image for OCR. I’ll point out some methodologies and best practices to consider when scanning for OCR.

First let’s start with the key settings resolution and bit depth.

When you talk about resolution you are not just talking about the best resolution for OCR, but you are also making a decision about the image file saved after OCR. Some questions to ask are:

  • How are the input images being saved after OCR?
  • Will there be a future need to re-purpose the images?
  • Is storage space a concern?

These questions will quickly frame an answer to the question “what resolution ought I scan?” The magic number is 300 dpi (dots per inch); in general, 300 dpi is the best solution for both OCR and storage. A 300 dpi image

  • can be repurposed,
  • can produce a generally acceptable file size,
  • and, for OCR, it’s generally high enough quality that if you go any higher, accuracy is not affected or is only nominally affected.

Below 300 dpi, you will notice usually a substantial decrease in accuracy. Some say that 300 dpi files are too large. You do not have to feed your OCR engine the same image you output. It is not uncommon to OCR a 300 dpi image but output a 150 dpi file for reference. This is often overlooked. The contrary point of view is that 300 dpi isn’t high enough for re-purposing. The trick: use the “dual stream” functionality you find in most modern document scanners. What this means is that one scanned image will go the path of OCR at 300 DPI; another image will go the path of storage, viewing, and re-purposing at say 600 dpi. The reason you typically do not want to OCR images at a higher resolution than 300 dpi is because, first, the modern commercial engines are tested, trained, and built on the 300 dpi images. Second, the decrease in OCR speed at a higher resolution is often not worth the fraction of a percent increase in accuracy or less verification. The last most common argument I’ve heard opposing 300 dpi is speed of scan. Generally people will argue that 300 dpi reduces scanning speed too much. If you were to spend the time to research the time it takes a 300 dpi image to go through scan, OCR, verify, and store compared to say a 150 dpi image you might be surprised to find that the 300 dpi image actually takes less time. The reason: while 150 dpi capture speeds the scanning process; you are reducing speed of verification and storage because more errors must be corrected. The scenario I recommend higher than 300 dpi is small fonts and handprint processing.

The next major setting to consider is bit depth.

Bit depth, like resolution, not only effects OCR but also effects storage and future use of the image file. Obviously a color scan is the best for future use, and there is impressive compression technology out there that makes it very feasible to always scan in color. When it comes to OCR, the value of color or grayscale scanning does not impact the OCR algorithms, but may impact some imaging algorithms that preceed them, as modern OCR engines today always work on a black and white image anyways.

A grayscale image may solve some problems companies face with poor despeckeling, deskew, and background removal. These imaging tools usually increase OCR accuracy so this may be a reason to scan at grayscale. Color scanning is useful when OCRing images produced not by a scanner, but by a digital camera. OCR engines that have specific tuning for digital images use color to determine layers. For example, am I looking at a page or a desk that the page is sitting on, which in the end helps isolate the image and improves OCR accuracy.

Color also adds the ability for document analysis to run better. Document analysis is the process of breaking a document down into its parts such as graphics, text, paragraphs, columns, lines, etc. What color does is allow graphics to be easily separated from text. Sometimes however you don’t want this, occasionally you want to force OCR to even read the images on a page, not just the text. Usually when it comes to bit depth I encourage that if you are just OCRing the image then use black & white, if you believe there will ever be a future need, scan in color. There are arguments for all areas in-between, it really depends on your business process.

Resolution and bit depth are arguably the largest contributing factor in image quality feeding OCR, but both these settings can easily be negated or enhanced by image cleanup and or file format/compression. Typically you want to feed OCR with a TIFF Group 4 image because, just before the OCR algorithm runs, the image will be converted to this anyway. Compressed images can sometimes take away valuable information that is useful to OCR. Similarly, image clean up can often hurt OCR. This could be an article in its own, and may very well be, but for now just realize that when OCR was invented there was not such advanced image clean-up as character regeneration and specialized thresholding, you can imagine then that these algorithms could very well remove useful things for OCR.

What I want you to remember amidst all this detail and technical jabber is that there is no one magical setting; it depends on your scanner and the documents you are scanning. And finally, test, test, test.

Chris Riley is founder of  Living Analytics where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools.

Chris recently was the feature speaker for our webinar on March 5; Tips and Tricks to Help You Automate your Office Documents (for Effective Data Capture). Listen.