Documents. With Class.
Document classification is a key part of any data capture strategy. However, it can also be used in advance of rolling out your entire capture strategy. A few thoughts on the importance of document classification.
By Chris Riley
Within any type of advanced technology there are several components of the
technology that could stand alone for other purposes. Data mining has basic
search. Content management has basic tagging. Data capture is no different.
While most people consider “data capture” a single thing, a trend is evolving,
as the market demands more education and explanation, to start looking at the
sub-components of data capture. This trend allows organizations to deploy only
those pieces that make the most sense and have a clear path to success. Once
success is achieved they then can move to the entirety of data capture.
One component of data capture that has been overlooked and extremely
underestimated is document classification.
What Class Is Your Document In?
Before data capture
technologies can do the magic of field location and extraction using optical
character recognition (OCR) or intelligent character recognition (ICR), they
must first decide the page type (sometimes, the type of an entire document).
Types might be obvious to the world, or only specific to an organization. Types
can be determined by layout (lines, barcodes, graphics), or by context (words,
codes, dictionaries). All data capture solutions have this built in as a part of
the template matching or document identification process. When companies
deployed data capture packages, classification was geared towards feeding the
data capture process, not necessarily to stand alone as a function.
Interestingly enough, however, many organizations have bought data capture
applications just for the purpose of classification. They have done so with a
success rate that seems to dwarf the overall data capture process. Let’s look at
why.
One major challenge with data capture is the human labor associated with
putting documents into groups. With documents automatically classed, this
expense and time suck goes away. Because of this, I think using document
classification only going to become more popular as companies see that they can
first tackle that one major problem. Once successful, a company can then embark
on the next, laborious steps towards data capture – but with a better chance of
success. This approach also allows a company to better frame the process
step-by-step for the technology vendors – tightly nailing down a well-defined
problem and then moving outward from there with the technology. Vendors are
often inclined to be helpful because they want the license value (for their
bottom line) of the company’s entire data capture process.
Politics? What Politics!?
Classification can be a dream
or a true nightmare to setup. It all depends on the documents (I'm using the
term “document” to mean a record which could be single or multiple pages, but
each page somehow relating to all the others.) If you are a little confused, you
should be. Understanding your documents is the greatest stumbling block to
classification. Sometimes, documents are very clear. Take accounts payable
processing as an example. A document could be a purchase order that connects to
a received invoice: this is the entire document. Within this document are the
types purchase order, and vendor invoice. That was not so bad. Now what happens
if you scan in duplex and the invoice on the back has payment instructions or
disclaimers? What do you do with this page? That’s still probably not too
complicated as you may just decide to omit the page if it does not have
pertinent payable data from the document. The point: just a small illustration
of the rate at which the definition of a document for an organization gets
complicated.
The desired approach would be a study of what your objective types (page
level understanding) are. This could be as deep as disclaimers, waivers, and
descriptor pages. Once this is done, determine the rules that combine the pages
together. In most environments the rules are flexible. For example, an invoice
from a vendor can be 1 to 10 pages – the first page will have a header and the
last page will have a total, everything in between is a detail page. When you do
this you allow the ability to use all the cool tools automated document
classification has to offer. Your only problem with this approach is the
possibility of never-ending objective page level types.
Why Is Class Important?
What is so cool about
classification is there is an even tighter control of the quality of the
automatic classification because it's much easier to toggle what is right or
wrong. This allows an organization, once they have a clear understanding of
their documents and then an understanding of their complexity relative to
automated classification, the ability to determine an actual ROI (or at least
get close). Also because it's just a component of the whole data capture
process, classification allows the organization to deploy exceptions faster, and
perform initial setup faster with less expertise. Document classification –
whether acknowledged or not – is a mandatory step in any data capture process
and cannot be avoided. Why not excel at it?
As I mentioned before, the trend of tackling data capture's pieces rather
than as a whole is becoming increasingly popular as the market education on this
type of technology increases. Companies are seeking a path to success in
document automation. The step-by-step path is much less overwhelming than taking
on an entire data capture process. When an organization makes the determination
to do this and truly understand their documents, they are taking the accuracy of
an automated system into their own hands and really giving technology the best
chance to work for them.
Chris Riley is founder
of http://www.livinganalytics.com
where he uses his in-depth knowledge of data capture technologies to advise
clients and proselytize the value of these tools. Chris recently was the feature
speaker for our webinar on March 5; Tips and Tricks to Help You Automate your
Office Documents (for Effective Data Capture). Listen.