Don't make the mistake of purchasing your data capture solution based on the product's export functionality. Almost always, export is easy. Instead, focus on how accurate the capture product will be for you.
Companies that use optical character recognition (OCR)
and data capture applications see the technology as a necessary evil to get to a
final result. The thinking goes: something goes in, magic happens in the middle,
and, finally, an electronic file comes out. Consequently, organizations focus on
the resulting file format, instead of how that data is
generated. Such thinking can result in picking a product based only on its
export functionality, rather than the accuracy of data extraction. It can also
result in neglecting aspects of the technology that could save time, and never
really understanding what is possible.
Format Addicted
I first discovered this thinking through
my experiences showing these technologies to end-users. I've given hundreds of
technology demos during the last six years, showcasing various data capture and
OCR products. I was surprised that the conversation about export format was
often the cause of my missing the next meeting. While my demos would focus on
extraction, the hour-long demo would stretch to an hour and half -- with the
last thirty minutes all about export. Now either the demo of the technology was
so awesome that it left no questions, or, more likely, export was all the
receiving audience really cared about. It often made me very nervous when I got
to the point of export and had no questions about the accuracy and difficulty of
setup for the recognition step. I always at this point would force the topic by
pushing the importance of quality assurance and business rules to check
data.
If You Have the Data, You Have It However You Want
The
simple fact is that nearly any XML or text file format can be converted very
quickly and accurately to any other format -- as long as it contains the proper
data. The place where export nearly always takes precedent over the extraction
process is in healthcare. In Heathcare there are three specific file formats
837, 857, and HL7. These format have very specific methods to report and load
data. I've seen clinics and hospitals purchase inaccurate recognition technology
simply because these formats were supported. The fact is all three of these
formats can be generated from an XML file, so the hospital resulted in getting a
lesser product for more money. But, healthcare is not the only industry to have
specific file formats, and as Electronic Data
Interchange ( EDI )** becomes more popular there are more formats being
developed.
Because these industry-specific file formats are the largest portion of any
business process they feed, I'm not surprised at the focus given to them.
However, when a company asks if a text file export can have two tab spacing
instead of one, or use a field separator that is an underscore instead of tab,
or if the first two lines of the export file can contain a static header, I
start to scratch my head. I understand the need to get the right output, but I
don't understand how this need is more important than how data extraction occurs
and how accurate it is.
Most of the products out there actually have the concept of “Custom Export.”
This is a tremendous tool; pretty much if you can dream an export you can create
an export. If it's not available, I've yet to find a product that does not have
an export that cannot be easily translated. XML is probably the most popular
format and there are many XML translation tools available. The only time export
format comes with barriers is when the data on the form has to be replaced with
normalized values from a database. However, a simple database lookup should fix
this. For example, if your accounting system only accepts numbers without a
dollar sign or comma on invoices, then you will have to do some simple find and
replace rules.
In an ideal world, the product with the best export will also have the best
extraction capabilities for your needs. But there is no excuse to get a data
capture product that is a poor fit for you because of its excellent export
format functionality. I've been witness to integrations done very poorly with
low accuracy just because of a proper file format. While the format of the data
coming out of a recognition system is the key to all downstream processes, it's
the easiest part of the whole document automation process. Organizations should
first consider the accuracy and abilities to extract information first and then
how to get the data extracted into the correct format.
Chris Riley ( chris.riley@livinganalytics.com)
is founder of
Living@nalyitcs ( www.livinganalytics.com) where he uses
his in-depth knowledge of data capture technologies to advise clients and
proselytize the value of these tools.
**Note, this is a Wikipedia link. A good
description, but, as always with Wikipedia, accuracy may not be
100%.