Send your OCR to College

Study hard. Get smarter. Make few mistakes. A trio of less-frequently used optical character recognition techniques can further fine-tune your data acquisition efforts.

So, you’ve done everything possible to ensure that your OCR integration is as good as it can be. You are:

  • scanning the best possible images
  • doing all the right image clean-up
  • utilizing all the best settings in your OCR product

That’s it, right? Nope, there are just a few more things that you can do: time to send your OCR to college. For the last stages of optimizing OCR companies should consider use of:

  1. Dictionaries
  2. Voting
  3. Pattern training

Even though this particular article is a little on the technical side, high-level business constituents should consider these techniques for their technical team.

Dictionaries
While all modern OCR engines use dictionaries in their default processes; you can use specialized dictionaries for yours. A dictionary limits the number of possibilities for each word or character for the OCR engine, thus giving it fewere chances to make a mistake. For every character, combination of three characters, and word; the OCR engine will assign a confidence value for each guess it has on content. The more choices, the more possible guesses. When you implement a dictionary you are limiting and weighting the guesses to only what is present in the dictionary. The most obvious form of a dictionary is the base language a document is written in, such as English, but there are many more possibilities especially in the world of data capture. For example if you are processing invoices and already have a list of all vendor information, then your dictionary for the vendor field will be just the vendors you have in your database and nothing more. Most products allow you to point to any ODBC-compatible database and use the information stored there as a dictionary. Most common are addresses, names, product names/descriptions. If you are processing medical or scientific documents consider augmenting the traditional English dictionary to include special terms in these fields. If your product doesn't support dictionaries consider a change—or, if you’re a developer, write your own implementation. Dictionaries are rather obvious and it's hard for me to picture a case other than some full-page conversion where they should not be used. The next two techniques are used with more deliberation.

Voting
I'm talking about voting that works, not the one advertised by many products. When it comes to voting, I don’t advise you to have two separate OCR products vote with each-other. Why not? Simple: you are not voting like things so the result will be weighted to one engine, even if it's the less accurate one. You can, but the amount of research, studies, and tests on a single particular document type to make this work is phenomenal and only possible in a specific project and not in general. However, voting the same engine against itself is very successful. Most modern OCR engines actually already have an internal system of voting. What you as a user can do is use the same engine but vary the settings. For example, run a document through the engine with document analysis enabled, then run it through again with line straightening enabled. You could conceivably create the same number of instances of the engine as there are settings. The downside is that with every new vote/expert you are increasing the processing time by the same factor. When you compare the text you will have a more accurate voting result. Voting full-page text is a challenge and for the scientist only, but to vote data capture fields can be done relatively easily for any organization and will often improve the OCR results. The type of improvement to expect is typically less than 5% reduction in errors, often fractions of a percent. The organization wanting to do this type of voting would most likely be very high volume. In-line with voting is another even more cautiously implemented technique and that is pattern training.

Pattern Training
Pattern training has the potential to get you from 85% accurate to 99.999999% accurate. It also has the opportunity to take your 85% accuracy down to 40%. Pattern training is the process of running entire documents, or just select characters through a manual training process that creates a training file. The training process is where a human sees a character on the screen and manually types what that character is. They repeat this process for a set number of instances. You can under train, and you can over train. Under training is OK, the net result is that your training has no effect on the production environment. If you over train you introduce so much variance that your trained and untrained characters are now overwhelming misrecognized with high confidence as something that has been trained. For example: if you over train a 1 and a 0 then all your “I”s,”L”s, “O”s, “U”s, “D”s, and “Q”s will be read as a 1 or 0. When training is done there is a compiled training file that the OCR engine uses during recognition. Often organizations will train for that one pesky character that is always misrecognized. Additionally pattern training is usually used in data capture and NOT full page conversion, unless you are working with specialized fonts. In full-page conversion if you choose training the best practice is to train all characters whether problematic or not an equal number of times. In all cases of pattern training you should train on not just one document, but several copies of production documents. The kicker, because you are so specifically dialing in the settings if you change ANYTHING in the scanning and imaging process you will need to start over from scratch.

Regardless of your familiarity with OCR, the idea has been to bring to the fore some of the less frequently thought about methods for improving OCR. All three of these techniques should only be considered once you have determined your scan and imaging settings, and you are either in or past the initial stage of integration where you have a good understanding of the possibilities, only then is your OCR ready for college.

Chris Riley is founder of http://www.livinganalytics.com  where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools. Chris recently was the feature speaker for our webinar on March 5; Tips and Tricks to Help You Automate your Office Documents (for Effective Data Capture). Listen at www.aiim.org/webinararchive.