AIIM — The Enterprise Content Management Association

The source for solving your business content challenges.

SharePoint Micro Site

Data Capture: Export Is the Least of Your Worries

Don't make the mistake of purchasing your data capture solution based on the product's export functionality. Almost always, export is easy. Instead, focus on how accurate the capture product will be for you.

Nov 25, 2009

Companies that use optical character recognition (OCR) and data capture applications see the technology as a necessary evil to get to a final result. The thinking goes: something goes in, magic happens in the middle, and, finally, an electronic file comes out. Consequently, organizations focus on the resulting file format, instead of how that data is generated. Such thinking can result in picking a product based only on its export functionality, rather than the accuracy of data extraction. It can also result in neglecting aspects of the technology that could save time, and never really understanding what is possible.

Format Addicted
I first discovered this thinking through my experiences showing these technologies to end-users. I've given hundreds of technology demos during the last six years, showcasing various data capture and OCR products. I was surprised that the conversation about export format was often the cause of my missing the next meeting. While my demos would focus on extraction, the hour-long demo would stretch to an hour and half -- with the last thirty minutes all about export. Now either the demo of the technology was so awesome that it left no questions, or, more likely, export was all the receiving audience really cared about. It often made me very nervous when I got to the point of export and had no questions about the accuracy and difficulty of setup for the recognition step. I always at this point would force the topic by pushing the importance of quality assurance and business rules to check data.

If You Have the Data, You Have It However You Want
The simple fact is that nearly any XML or text file format can be converted very quickly and accurately to any other format -- as long as it contains the proper data. The place where export nearly always takes precedent over the extraction process is in healthcare. In Heathcare there are three specific file formats 837, 857, and HL7. These format have very specific methods to report and load data. I've seen clinics and hospitals purchase inaccurate recognition technology simply because these formats were supported. The fact is all three of these formats can be generated from an XML file, so the hospital resulted in getting a lesser product for more money. But, healthcare is not the only industry to have specific file formats, and as Electronic Data Interchange ( EDI )** becomes more popular there are more formats being developed.

Because these industry-specific file formats are the largest portion of any business process they feed, I'm not surprised at the focus given to them. However, when a company asks if a text file export can have two tab spacing instead of one, or use a field separator that is an underscore instead of tab, or if the first two lines of the export file can contain a static header, I start to scratch my head. I understand the need to get the right output, but I don't understand how this need is more important than how data extraction occurs and how accurate it is.

Most of the products out there actually have the concept of “Custom Export.” This is a tremendous tool; pretty much if you can dream an export you can create an export. If it's not available, I've yet to find a product that does not have an export that cannot be easily translated. XML is probably the most popular format and there are many XML translation tools available. The only time export format comes with barriers is when the data on the form has to be replaced with normalized values from a database. However, a simple database lookup should fix this. For example, if your accounting system only accepts numbers without a dollar sign or comma on invoices, then you will have to do some simple find and replace rules.

In an ideal world, the product with the best export will also have the best extraction capabilities for your needs. But there is no excuse to get a data capture product that is a poor fit for you because of its excellent export format functionality. I've been witness to integrations done very poorly with low accuracy just because of a proper file format. While the format of the data coming out of a recognition system is the key to all downstream processes, it's the easiest part of the whole document automation process. Organizations should first consider the accuracy and abilities to extract information first and then how to get the data extracted into the correct format.

Chris Riley ( chris.riley@livinganalytics.com) is founder of Living@nalyitcs ( www.livinganalytics.com) where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools.

**Note, this is a Wikipedia link. A good description, but, as always with Wikipedia, accuracy may not be 100%.

Preferred Solution Providers