Study hard. Get smarter. Make few mistakes. A trio of less-frequently used optical character recognition techniques can further fine-tune your data acquisition efforts.
So, you’ve done everything possible to ensure that your OCR integration is as
good as it can be. You are:
- scanning the best possible images
- doing all the right image clean-up
- utilizing all the best settings in your OCR product
That’s it, right? Nope, there are just a few more things that you can do:
time to send your OCR to college. For the last stages of optimizing OCR
companies should consider use of:
- Dictionaries
- Voting
- Pattern training
Even though this particular article is a little on the technical side,
high-level business constituents should consider these techniques for their
technical team.
Dictionaries
While all modern OCR engines use
dictionaries in their default processes; you can use specialized dictionaries
for yours. A dictionary limits the number of possibilities for each word or
character for the OCR engine, thus giving it fewere chances to make a mistake.
For every character, combination of three characters, and word; the OCR engine
will assign a confidence value for each guess it has on content. The more
choices, the more possible guesses. When you implement a dictionary you are
limiting and weighting the guesses to only what is present in the dictionary.
The most obvious form of a dictionary is the base language a document is written
in, such as English, but there are many more possibilities especially in the
world of data capture. For example if you are processing invoices and already
have a list of all vendor information, then your dictionary for the vendor field
will be just the vendors you have in your database and nothing more. Most
products allow you to point to any ODBC-compatible database and use the
information stored there as a dictionary. Most common are addresses, names,
product names/descriptions. If you are processing medical or scientific
documents consider augmenting the traditional English dictionary to include
special terms in these fields. If your product doesn't support dictionaries
consider a change—or, if you’re a developer, write your own implementation.
Dictionaries are rather obvious and it's hard for me to picture a case other
than some full-page conversion where they should not be used. The next two
techniques are used with more deliberation.
Voting
I'm talking about voting that works, not the one
advertised by many products. When it comes to voting, I don’t advise you to have
two separate OCR products vote with each-other. Why not? Simple: you are not
voting like things so the result will be weighted to one engine, even if it's
the less accurate one. You can, but the amount of research, studies, and tests
on a single particular document type to make this work is phenomenal and only
possible in a specific project and not in general. However, voting the same
engine against itself is very successful. Most modern OCR engines actually
already have an internal system of voting. What you as a user can do is use the
same engine but vary the settings. For example, run a document through the
engine with document analysis enabled, then run it through again with line
straightening enabled. You could conceivably create the same number of instances
of the engine as there are settings. The downside is that with every new
vote/expert you are increasing the processing time by the same factor. When you
compare the text you will have a more accurate voting result. Voting full-page
text is a challenge and for the scientist only, but to vote data capture fields
can be done relatively easily for any organization and will often improve the
OCR results. The type of improvement to expect is typically less than 5%
reduction in errors, often fractions of a percent. The organization wanting to
do this type of voting would most likely be very high volume. In-line with
voting is another even more cautiously implemented technique and that is pattern
training.
Pattern Training
Pattern training has the potential to
get you from 85% accurate to 99.999999% accurate. It also has the opportunity to
take your 85% accuracy down to 40%. Pattern training is the process of running
entire documents, or just select characters through a manual training process
that creates a training file. The training process is where a human sees a
character on the screen and manually types what that character is. They repeat
this process for a set number of instances. You can under train, and you can
over train. Under training is OK, the net result is that your training has no
effect on the production environment. If you over train you introduce so much
variance that your trained and untrained characters are now overwhelming
misrecognized with high confidence as something that has been trained. For
example: if you over train a 1 and a 0 then all your “I”s,”L”s, “O”s, “U”s,
“D”s, and “Q”s will be read as a 1 or 0. When training is done there is a
compiled training file that the OCR engine uses during recognition. Often
organizations will train for that one pesky character that is always
misrecognized. Additionally pattern training is usually used in data capture and
NOT full page conversion, unless you are working with specialized fonts. In
full-page conversion if you choose training the best practice is to train all
characters whether problematic or not an equal number of times. In all cases of
pattern training you should train on not just one document, but several copies
of production documents. The kicker, because you are so specifically dialing in
the settings if you change ANYTHING in the scanning and imaging process you will
need to start over from scratch.
Regardless of your familiarity with OCR, the idea has been to bring to the
fore some of the less frequently thought about methods for improving OCR. All
three of these techniques should only be considered once you have determined
your scan and imaging settings, and you are either in or past the initial stage
of integration where you have a good understanding of the possibilities, only
then is your OCR ready for college.
Chris Riley
is founder of http://www.livinganalytics.com
where he uses his in-depth knowledge of data capture technologies to advise
clients and proselytize the value of these tools. Chris recently was the feature
speaker for our webinar on March 5; Tips and Tricks to Help You Automate
your Office Documents (for Effective Data Capture). Listen at www.aiim.org/webinararchive.