They're scanned images, not flowers

Community Topic(s):

Keywords: image cleanup, OCR. ICR, quality, Image

Current Rating:
(0 ratings)

Of course you want your scanned images to look as pretty as possible on the screen, but who’s to say what the OCR engines agree with your conditions for beauty.  This blog post, perhaps a slap in the face, is about why you can over clean scanned images, to the point where your recognition accuracy decreases.

There is software that cleans up images so that document recognition technology has the best fighting chance at accuracy, there is also software makes scanned images look as if they originated digitally.  Very often these two technologies come bundled together.

This sounds good, but has gotten many companies in trouble.  Especially those doing large volume document scanning.  Why?  Because it would not be unheard of for the technology that makes the image look good on the screen, hurts the image for recognition technologies OCR, ICR, etc.  The logic is simple.  The algorithms to do this image manipulation were created with two very different purposes in mind.

Image cleanup for viewing was created by looking at before and after images. Much like your eye doctor asks you very softly, “One or two? Two, or three?”  the developers of this technology opened the original image on one monitor, the new image with the proposed new algorithm on another, and saw which was better.  The assumption was that if it looks better it will recognize better, which we will find out was not the case.

Image cleanup for OCR was a similar before and after scenario.  The developers took images, often on a character level, and test the OCR engine on them before and after the proposed new image cleanup algorithm.  If the accuracy was better ( i.e. correct recognition and percentage of uncertainty decreased ), then it was implemented.

To confuse you a little, I’ve yet to find a case where image cleanup for OCR was not also good for viewing, but found many cases where image clean-up for viewing was bad for OCR. The reality is, you can clean your images too much.  Here is how you know.

If you cleanup typographic text too much, it looks to the OCR engine like a graphic.  Because of this the OCR engine skips it, this results in what is called a “high confidence blank”.  If you cleanup handprint too much, well, just don’t do it.  Image cleanup for hand printed text is removing portions of a hand stroke, the very information Intelligent Character Recognition (ICR, technology for reading handprint) uses to figure out what letters are.

Here are some tips. Stick to image clean up for the desired purpose of the scan.  If the image is simply for viewing, clean it up to perfection.  If it’s for OCR, stick to those settings most conducive to OCR.  If it’s both, that is not a problem as many scanners and software support what is called “dual stream”, one image going two paths.  Enhanced for OCR goes to recognition software, enhanced for viewing to a ECM system. Cleanup that is good for OCR and ICR is:

  1. Despeckle ( unless dot-matrix font )
  2. Line Straightening
  3. Basic Thresholding
  4. Background removal
  5. Correction of Linear Distortion
  6. Dropout
  7. Line Removal ( sometimes )

Bad for OCR and ICR is:

  1. Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”'s will be “e”'s. For handprint you often remove portions of characters.
  2. Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
  3. Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

So there you have it.  It is not as clear cut as, the prettier the image the better the recognition.  Although this can be used as a general guide it’s not the fact, it is an assumption that has limited the success of recognition projects.  The simple answer to all of this is, drum roll, test!  Test just as the developers did when creating the technology.  Alternatively, you can become like me, and slowly develop a built in OCR result predictor.  I do not recommend the later, as it does not promote social life.

Report

Rate Post

You need to log in to rate blog posts. Click here to login.

Add a Comment

You need to log in to post messages. Click here to login.

Comments

Sanooj Kutty

If I had come across something like this before, I could have been relieved from the challenges faced in earlier projects...
...and that Cyrillic Character would not have been misread as a Barcode leading to errors. Alas, the painful backtrack was inevitable.
...and that image cleanup would not have cleared some valuable information too.

Must read for all!
Report
Was this helpful? Yes No
Reply

Amila Hendahewa

Good read. I had the bitter experience of loosing some characters due to line removal.
Report
Was this helpful? Yes No
Reply

Mike Morper

Great article, but I must admit I was surprised to see the adaptive thresholding remark. If it wasn't for adaptive thresholding, the ability to successfully extract text from screened areas in documents would be quite difficult. Think bills of lading, etc. Adaptive thresholding effectively re-evaluates the point in which a pixel should stay black or drop out. If that decision is made only in the first few scan lines of a page, OCR results may not be quite what we'd all like to see later in the document when a screened (especially a graduating screen) is introduced. My two cents :)
Report
Was this helpful? Yes No
Reply
Chris Riley, ECMp, IOAp

Mike,

I think you are referring simply to binarization. The conversion of all images to 2-bit. This of course is necessary as that is the level at which all OCR engines look at images. Unfortunately the term adaptive threshold has come to mean many things, but it is not binarization. What I'm talking about here happens after the binaraized threshold has been determined already by the software, but in certain places in the document the threshold altered. What happens when characters are printed on a gray background of some sort or next to a color, they color ends up staying because in that case the "adaptive" part of the algorithm determines that those pixels belonged with the text which met the original threshold.

Were OCR the core OCR engines created at a time when these types of conversions were common place, I'm sure it would not make a difference. Because the foundation of the engines were create when basic binarization technology was around, some of the new imaging approaches just don't work well.

As a followup to the article. The critical mistake with these technologies is the assumption that just becuase it looks good/better on the screen it's optimum for OCR.
Report
Was this helpful? Yes No
Reply

This post and comment(s) reflect the personal perspectives of community members, and not necessarily those of their employers or of AIIM International