For a successful imaging rollout, you need to understand your documents. Not only how they fit in the process (that’s important too), but also understanding if they are fixed, semi-structured, or unstructured. What? Read on.
Knowing the nature of your documents is not only critical to the initial
decisions you will make at time of integration, it's one of the greatest
challenges. Why is it so challenging? Not only is it a somewhat difficult task
in and of itself, vendors don’t help. Often, vendor terms and definitions are
used in different ways and people do get confused. Eliminate one source of
confusion by understanding your documents and if your forms are fixed,
semi-structured, or unstructured.
I find myself educating even industry
peers on the topic regularly. Sometimes, the discussion begins with hearing some
variant of “unstructured document processing” and how it works/exists. Or an
organization explaining to me that their forms are fixed when they clearly are
not. Understanding what is meant when talking about document structure is very
important.
Your first, and greatest, challenge is to be picky with your terms and their
definitions. “Unstructured” is the sexy term du jour for the capture crowd; akin
to “Cloud” and “SaaS” in other software spaces. It has its place in marketing,
but technically it tells you very little.
Definitions
First, let’s define a document: a document
is a collection of one or many pages that has a business process associated with
it. Documents of a single type can vary in length but the content contained
within or the possibility of it existing is constrained.
Organizations see the single entity that is a document, but what is often
overlooked is that during document automation all processes are page level, and
then their results roll up to the document level using rules. This could be why
people believe unstructured document processing exists.
While a document in its entirety may be unstructured because the pages can go
in any order and may exist or not, each individual page is not. Unstructured
documents are rare; an example could be contracts and agreements. The test is
simple, if at any moment in time you can pull a page from the document and state
what that page is and what information it would contain, then it IS NOT
unstructured. Take a page out of the middle of an agreement and try to identify
it. With certain static contracts this will be possible, but for the most part
it is impossible as there are as many versions of agreements as there are
instances. A page may start with a paragraph from another page. Page 3 might
have WARRENTY info on one contract and it's page 7-8 on another. Another example
of unstructured would be corporate annual reports. While in an annual report
there will be a balance sheet somewhere, you don't know where and each company
will format it differently. The type most often mistakenly taken to be
unstructured is mortgage documents. Popular opinion is that mortgage documents
are unstructured. In reality, you can take any page and objectivity determine
its type; making it semi-structured.
The ability to processes unstructured documents is limited to very concrete
scenarios, and the organization doing the processing usually has a close
connection to the generation of that document’s original content. In general,
the ability to process unstructured documents does not exist. As a conversation
progresses, I quickly get to the fact that we’re actually talking about
semi-structured forms.
Now: the difference between semi-structured and fixed.
Is It Fixed?
The difference between fixed and
semi-structured is fairly easy; though the lines can be blurred. Some, if not
all, fixed forms can be processed as semi-structured; some semi-structured forms
can be processed as fixed. Lets define these two types.
In fixed form processing you use coordinates x/y (height/width) for each
field to tell where the field is located. But before coordinates can be created
for a fixed form each page needs to be normalized to a template. This
normalization requires comer stones and reference marks that allow the software
to align fields. On a fixed form the number of fields and location of fields
does not change. This includes field width and height. An invoice page from a
single vendor IS NOT a fixed form, but a survey from an airplane magazine is.
Semi-structured forms comprise 80% of documents. A page from a
semi-structured form may or may not contain information from a static list of
fields. The location of the fields can change location. Instead of coordinates,
rules are used to locate information by, for example, looking for keywords or
graphics to indicate field location.
The confusion between form types is created when there is a form where
information seemingly is located in the same general place. On an invoice the
invoice number from vendor to vendor moves significantly usually in the top
right quarter of the page. But in a tax form the location where organization
name is printed is pretty well set. Even if a field appears in the same general
location on every page of a particular type, does not make it fixed. For
example, a tax form always has the same general location to print company name.
The printer has to print within a specified range. They can print more to the
left, more to the top, and the length will vary with every input name. This
makes it semi-structured, additionally this document, when scanned, will shift
left/right/up/down by small amounts. While you can process a tax form as fixed
the challenges will be making fields big enough to contain the data. This
results in lower optical character recognition accuracy due to extra white space
and the possibility of shifts that may result in getting additional incorrect
text. The test here is very simple—if your form has registration marks it's
fixed; if it does not it's semi-structured.
Form type is slightly more than a recommendation when it comes to
semi-structured forms. It is possible to process forms as fixed. Doing so
reduces software cost, increases ease of integration, and processing efficiency.
But it also reduces accuracy. Additionally you have to create more fixed
templates instead of fewer semi-structured which increases integration time and
risk of template miss-match and false positives.
It helps I find to define technologies objectively rather than based on
popular opinion. When most people talk about technology they throw out terms in
popular opinion and seldom objectively. When dealing with advanced technologies
such as document automation this can confuse, and when that confusion bleeds
into implementation; the results can be tragic.
Chris Riley is
founder of Living Analytics
where he uses his in-depth knowledge of data capture technologies to advise
clients and proselytize the value of these tools.