If It’s Semi-Structured Why Fixed It?

For a successful imaging rollout, you need to understand your documents. Not only how they fit in the process (that’s important too), but also understanding if they are fixed, semi-structured, or unstructured. What? Read on.


Knowing the nature of your documents is not only critical to the initial decisions you will make at time of integration, it's one of the greatest challenges. Why is it so challenging? Not only is it a somewhat difficult task in and of itself, vendors don’t help. Often, vendor terms and definitions are used in different ways and people do get confused. Eliminate one source of confusion by understanding your documents and if your forms are fixed, semi-structured, or unstructured.

I find myself educating even industry peers on the topic regularly. Sometimes, the discussion begins with hearing some variant of “unstructured document processing” and how it works/exists. Or an organization explaining to me that their forms are fixed when they clearly are not. Understanding what is meant when talking about document structure is very important.

Your first, and greatest, challenge is to be picky with your terms and their definitions. “Unstructured” is the sexy term du jour for the capture crowd; akin to “Cloud” and “SaaS” in other software spaces. It has its place in marketing, but technically it tells you very little.

Definitions
First, let’s define a document: a document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained.

Organizations see the single entity that is a document, but what is often overlooked is that during document automation all processes are page level, and then their results roll up to the document level using rules. This could be why people believe unstructured document processing exists.

While a document in its entirety may be unstructured because the pages can go in any order and may exist or not, each individual page is not. Unstructured documents are rare; an example could be contracts and agreements. The test is simple, if at any moment in time you can pull a page from the document and state what that page is and what information it would contain, then it IS NOT unstructured. Take a page out of the middle of an agreement and try to identify it. With certain static contracts this will be possible, but for the most part it is impossible as there are as many versions of agreements as there are instances. A page may start with a paragraph from another page. Page 3 might have WARRENTY info on one contract and it's page 7-8 on another. Another example of unstructured would be corporate annual reports. While in an annual report there will be a balance sheet somewhere, you don't know where and each company will format it differently. The type most often mistakenly taken to be unstructured is mortgage documents. Popular opinion is that mortgage documents are unstructured. In reality, you can take any page and objectivity determine its type; making it semi-structured.

The ability to processes unstructured documents is limited to very concrete scenarios, and the organization doing the processing usually has a close connection to the generation of that document’s original content. In general, the ability to process unstructured documents does not exist. As a conversation progresses, I quickly get to the fact that we’re actually talking about semi-structured forms.

Now: the difference between semi-structured and fixed.

Is It Fixed?
The difference between fixed and semi-structured is fairly easy; though the lines can be blurred. Some, if not all, fixed forms can be processed as semi-structured; some semi-structured forms can be processed as fixed. Lets define these two types.

In fixed form processing you use coordinates x/y (height/width) for each field to tell where the field is located. But before coordinates can be created for a fixed form each page needs to be normalized to a template. This normalization requires comer stones and reference marks that allow the software to align fields. On a fixed form the number of fields and location of fields does not change. This includes field width and height. An invoice page from a single vendor IS NOT a fixed form, but a survey from an airplane magazine is.

Semi-structured forms comprise 80% of documents. A page from a semi-structured form may or may not contain information from a static list of fields. The location of the fields can change location. Instead of coordinates, rules are used to locate information by, for example, looking for keywords or graphics to indicate field location.

The confusion between form types is created when there is a form where information seemingly is located in the same general place. On an invoice the invoice number from vendor to vendor moves significantly usually in the top right quarter of the page. But in a tax form the location where organization name is printed is pretty well set. Even if a field appears in the same general location on every page of a particular type, does not make it fixed. For example, a tax form always has the same general location to print company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will vary with every input name. This makes it semi-structured, additionally this document, when scanned, will shift left/right/up/down by small amounts. While you can process a tax form as fixed the challenges will be making fields big enough to contain the data. This results in lower optical character recognition accuracy due to extra white space and the possibility of shifts that may result in getting additional incorrect text. The test here is very simple—if your form has registration marks it's fixed; if it does not it's semi-structured.

Form type is slightly more than a recommendation when it comes to semi-structured forms. It is possible to process forms as fixed. Doing so reduces software cost, increases ease of integration, and processing efficiency. But it also reduces accuracy. Additionally you have to create more fixed templates instead of fewer semi-structured which increases integration time and risk of template miss-match and false positives.

It helps I find to define technologies objectively rather than based on popular opinion. When most people talk about technology they throw out terms in popular opinion and seldom objectively. When dealing with advanced technologies such as document automation this can confuse, and when that confusion bleeds into implementation; the results can be tragic.

Chris Riley is founder of Living Analytics where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools.