What IT should know when preparing for OCR

Community Topic(s):

Keywords: OCR, hardware, architecture, cpu, memory

Current Rating:
(0 ratings)

Every now and then, you have to roll up your sleeves and get down to business.  This post is that, targeted at IT managers tasked with implementing high performance OCR environments. A properly architected OCR environment will maximize license value, improve turnaround, and help improve accuracy.

One challenge organizations face when performing OCR at volume is that in order to scale, they need to purchase additional licenses.  However, there are ways to get the most out of a single license by maximizing hardware.  The goal is to process as many pages as possible with a single OCR license before you add new ones.

Along the same lines when you increase the efficiency of machines doing OCR, you are able to enable advanced recognition features that increase the overall accuracy of your OCR.  The more accurate an OCR engine, the more computer power it requires.  Many organizations are faced with the decision of turning off accuracy improving features. Not only can you improve the use of existing licenses and use all the accuracy enhancing features, with better hardware the time it takes an image to enter the OCR process and return a result is reduced.

The bottom line is, IT has a lot of control over the efficacy of a production level OCR environment.  OCR takes up maximum resources of a machine.  Many organizations choose to just throw the latest and greatest technology to improve accuracy, and are surprised when it does not have the expected gains.  Below is a list of the top things to consider when enhancing your OCR environment.

  1. Bus speed. OCR processes move images in and out of memory, and serialize to the hard drive more times than you can imagine.  This process alone can really slow down a machine. Let’s try an analogy. San Francisco, and New York are two very large cities. They have quite an amazing capacity for people, and things.  Let's say San Francisco is the best computer memory, and New York the best largest hard-drive. If I and 200 of my friends want to move from San Francisco to New York with all our stuff, driving 100 or so VW Beatles cross country, it would take a LONG time.  This is a poor BUS. But if we were to all load on a jumbo jet, we would be there in a matter of hours. The slower the BUS speed on memory, hard-drive, and CPU the greater a delay for image files to be moved from one location to another. Server grade hardware often has fast BUS speeds but have a tremendous amount of overhead that gets in the way.  BUS speed is a very important consideration when looking at hardware components and how they benefit OCR.  You might maximize memory size, but if it takes too long to write the images to memory, it’s never utilized.
  2. OCR is a CPU HOG. It will take 99% of any single thread when it is running, so putting energy into a more powerful CPU with more threads is not a bad idea. However, assuming that a server grade CPU such as the Xeon is better than a Desktop CPU such as the i7 might be a mistake. The reason for this is simple and two fold. Again servers have more overhead, which can get in the way of processes that have a lot of moving from one place to another. Most importantly is that the chip-set of the older established CPUs is just that, older.  Because OCR is so math intensive, the chips optimized for math operations outperform.  Because of this it’s not surprising that the chips that run the latest video game amazingly well, tend to also do very well with OCR. Two chips may have the same Mhz speed, but they don't deploy some of the faster math processing that is very good for OCR and found in the new chip sets.  It’s like the difference between a diesel engine, and a Ferrari engine.  The diesel engine is a power house once it gets going, but out of the gate just not as fast.
  3. Hard-Drive speed is the same story as BUS speed. You want your hard drives to write quickly. Images are being serialized very often with OCR. Not only do you want it to be fast, but you want its connection to the motherboard to be fast. Serial ATA so far is the proven fastest way. Servers tend to implement SCSI, which is great for redundancy, but not a promoter of speed because of the overhead.  On the flip side the promise of solid state drives is great.  In tests the solid state drive does magic for OCR performance. However, the reliability is not there yet.
  4. Memory is important but amount of memory is less important than the memory speed. 4 GB should be sufficient for most activity any machine can handle. The difference between DDR speed and DDR3 is a huge difference.

If you keep it simple and focus on those tools that REALLY increase OCR performance you may be surprised that you have to pay less to get more in this case.  Often a desktop machine with the right considerations will outperform a server, because OCR uses and abuses a system in quick spurts versus a steady draw of resources.  The above four items, in orde,r are the top considerations when architecting your OCR production environment to provide the greatest efficiency and quality.

Report

Rate Post

You need to log in to rate blog posts. Click here to login.

Add a Comment

You need to log in to post messages. Click here to login.

Comments

Michael Jahn

I normally like to use the idea of water when it comes to explaining data, and straws, garden hoses, fire hoses and sewer mains as the different ways water can move - while that covers band width ( as in, we really do not have ANY issues with bandwidth, there is PLENTY of band width options ) - where the problem is in BUS SPEED. For that, I use the same 'think of water as data' story, and then suggest that if we think of our bodies as the computer, then our BUS is our mouth, and our stomach is our hardrive.

We can manage fine with the bandwidth of a straw - even a couple of straws - but while a garden hose turned on 1/8 of the way, might work, you need to let things 'catch up' and buffer, and there is no way you will last even a second with a fire hose strapped to your face.

The fact is, USB scanners can do a pretty good job sending about 100 double side pages onto most hard drives without any real problems, and converting these scans into OCRed searchable archives might range from 3 sec per page (for poor quality) to 'some amount of time much higher' depending on if you are double typing to achieve the ultimate in accuracy.

But that is besides the point - i love analogies, and wanted to share the one I like to use, hope that helps !
Report
Was this helpful? Yes No
Reply
Chris Riley, ECMp, IOAp

Michael,

Absolutely. The Hose vs. Straw analogy here is great. Were I do start preparing my posts several days in advance i might not come up with silly ones on the spot :)
Report
Was this helpful? Yes No
Reply

and you were building a scanning/indexing/OCR station for a production imaging environment, what would it include? Don't have to name names, but what type of processor, what bus, etc.? Also, what are your thoughts on hardware acceleration?

Best,

jesse
Report
Was this helpful? Yes No
Reply
Chris Riley, ECMp, IOAp

Jesse,

Good question. I will answer it a few different ways, what I would include and what I would not include.

1.) I would include a folder driven recognition product that allowed me to distribute processing across multiple threads.
2.) I would have a dedicated physical machine not VM. I would assign all threads but one to OCR processes. Keeping the remaining thread for interrupts, and resources to check logging or setup new jobs etc.
3.) I would feed images to this machine over the network, but keep the machine itself untouched. If there is verification, I would have this parsed out to desktops on the network.
4.) I would consider the use of a processor with good math optimization and many cores. Only if the OCR engine I runs Nativity 64-bit would I get a 64-bit processor and OS, otherwise I would stick with 32-bit.
5.) I would get DDR3 memory and focus on fewer sticks of memory. For example if I have 4 slots and want 8 GBs I would try to get 2x4gb sticks versus 4x2gb sticks.
6.) I would find the best internal solid state drive and dedicate it to the OCR services for serialization ONLY, not for use in storing the input or results.
7.) I would always favor accuracy settings in the OCR product over performance settings.
8.) I would purchase a Ferrari so that I could ensure if anything should go wrong I would be able to arrive to the server as quickly as possible.
Report
Was this helpful? Yes No
Reply

This post and comment(s) reflect the personal perspectives of community members, and not necessarily those of their employers or of AIIM International