While 300 dpi is a good enough setting for most documents, there are other variables to consider. Here’s a look at resolution and bit depth and what you need to be thinking about to maximize the recognition efficiency of your capture process.
By Chris Riley
While I’m not a great fan of such cliché slogans as “junk in, junk out;” I’m
forced to admit that it’s 100% accurate when it comes to scanning and OCR
(optical character recognition).
OCR and scanning are linked; any conversation I have about OCR includes a
discussion of scanning. The goal is to give the OCR engine the best chance with
the best possible image while balancing all the additional requirements. While
scanning is its own distinct process, the influence it has on OCR accuracy is
phenomenal. Most users want there to be a universal (that is, a single best)
scan setting. I’m sure that this concept is fed by the fact that all the
scanning hardware out there usually has a setting specifically for OCR scan,
alluding to the fact that it could be just that easy. Ironically, in my own
practice I’ve never used default OCR scan settings and have actually discovered
instances where that setting hurts the results.
Unfortunately, there is not a “catch all” scanner setting that will always
give you the best image for OCR. I’ll point out some methodologies and best
practices to consider when scanning for OCR.
First let’s start with the key settings resolution and bit depth.
When you talk about resolution you are not just talking about the best
resolution for OCR, but you are also making a decision about the image file
saved after OCR. Some questions to ask are:
- How are the input images being saved after OCR?
- Will there be a future need to re-purpose the images?
- Is storage space a concern?
These questions will quickly frame an answer to the question “what resolution
ought I scan?” The magic number is 300 dpi (dots per inch); in general, 300 dpi
is the best solution for both OCR and storage. A 300 dpi image
- can be repurposed,
- can produce a generally acceptable file size,
- and, for OCR, it’s generally high enough quality that if you go any
higher, accuracy is not affected or is only nominally affected.
Below 300 dpi, you will notice usually a substantial decrease in accuracy.
Some say that 300 dpi files are too large. You do not have to feed your OCR
engine the same image you output. It is not uncommon to OCR a 300 dpi image but
output a 150 dpi file for reference. This is often overlooked. The contrary
point of view is that 300 dpi isn’t high enough for re-purposing. The trick: use
the “dual stream” functionality you find in most modern document scanners. What
this means is that one scanned image will go the path of OCR at 300 DPI; another
image will go the path of storage, viewing, and re-purposing at say 600 dpi. The
reason you typically do not want to OCR images at a higher resolution than 300
dpi is because, first, the modern commercial engines are tested, trained, and
built on the 300 dpi images. Second, the decrease in OCR speed at a higher
resolution is often not worth the fraction of a percent increase in accuracy or
less verification. The last most common argument I’ve heard opposing 300 dpi is
speed of scan. Generally people will argue that 300 dpi reduces scanning speed
too much. If you were to spend the time to research the time it takes a 300 dpi
image to go through scan, OCR, verify, and store compared to say a 150 dpi image
you might be surprised to find that the 300 dpi image actually takes less time.
The reason: while 150 dpi capture speeds the scanning process; you are reducing
speed of verification and storage because more errors must be corrected. The
scenario I recommend higher than 300 dpi is small fonts and handprint
processing.
The next major setting to consider is bit depth.
Bit depth, like resolution, not only effects OCR but also effects storage and
future use of the image file. Obviously a color scan is the best for future use,
and there is impressive compression technology out there that makes it very
feasible to always scan in color. When it comes to OCR, the value of color or
grayscale scanning does not impact the OCR algorithms, but may impact some
imaging algorithms that preceed them, as modern OCR engines today always work on
a black and white image anyways.
A grayscale image may solve some problems companies face with poor
despeckeling, deskew, and background removal. These imaging tools usually
increase OCR accuracy so this may be a reason to scan at grayscale. Color
scanning is useful when OCRing images produced not by a scanner, but by a
digital camera. OCR engines that have specific tuning for digital images use
color to determine layers. For example, am I looking at a page or a desk that
the page is sitting on, which in the end helps isolate the image and improves
OCR accuracy.
Color also adds the ability for document analysis to run better. Document
analysis is the process of breaking a document down into its parts such as
graphics, text, paragraphs, columns, lines, etc. What color does is allow
graphics to be easily separated from text. Sometimes however you don’t want
this, occasionally you want to force OCR to even read the images on a page, not
just the text. Usually when it comes to bit depth I encourage that if you are
just OCRing the image then use black & white, if you believe there will ever
be a future need, scan in color. There are arguments for all areas in-between,
it really depends on your business process.
Resolution and bit depth are arguably the largest contributing factor in
image quality feeding OCR, but both these settings can easily be negated or
enhanced by image cleanup and or file format/compression. Typically you want to
feed OCR with a TIFF Group 4 image because, just before the OCR algorithm runs,
the image will be converted to this anyway. Compressed images can sometimes take
away valuable information that is useful to OCR. Similarly, image clean up can
often hurt OCR. This could be an article in its own, and may very well be, but
for now just realize that when OCR was invented there was not such advanced
image clean-up as character regeneration and specialized thresholding, you can
imagine then that these algorithms could very well remove useful things for OCR.
What I want you to remember amidst all this detail and technical jabber is
that there is no one magical setting; it depends on your scanner and the
documents you are scanning. And finally, test, test, test.
Chris Riley is
founder of Living Analytics where
he uses his in-depth knowledge of data capture technologies to advise clients
and proselytize the value of these tools.
Chris recently was the feature speaker for our webinar on March 5; Tips
and Tricks to Help You Automate your Office Documents (for Effective Data
Capture). Listen.