Avoiding the document imaging Mulligan

Community Topic(s):

Keywords: scanning, OCR, paperless office

Current Rating:
(0 ratings)

One of the worst things that can happen in an imaging environment is the need to re-image documents.  But it happens.  It can happen because of dramatic advances in the technology being used, because poor planning was done initially or very commonly because images are being lost.  The latter is a combination of poor planning yes, but also because imaging does not live in isolation like many believe.  To achieve a successful document imaging environment organizations must not think about it as just a content gathering process, they must also think about how the data will be retrieved.

If there is an opposite of computer rage I have it.  Call it computer mania.  Get a new technology, use it exhaustively, and ignore the obvious consequences.   I have learned a lot of great lessons doing this.  One of those lessons is related to my paperless office.  When I first started imaging ALL my documents about five years ago, I did so blindly.  I assumed that via OCR results, I would be able to find any  documents.  What I neglected to investigate was how I searched.  A blind reliance on the desktop search clients resulted in me losing documents for a period of time. 

Initially, I approached the problem by trying every desktop search client I could find.  I waited for large indexes to build, and tried a search for documents I had identified as existing but could not find.  Some of the missing documents I could find manually, and I could also verify that the OCR results were there to back up the search.  What I found is that the core functionality of each search client was more or less the same, and could not combat the problem I had created. So I was stuck.  I did not the re-image the documents, because in my brilliance, I shredded every piece of paper.  My options were first re-design the system and put some serious manual effort in order to get current documents to comply, or second re-ocr a large volume of already OCRed PDFs.  Because I knew how badly I would compromise the final result with the second option, I rolled up my sleeves, did a re-design, and began a slow process of getting my existing document s to comply.

That was 300K documents (no I did not look at everyone), what if I had a million? This little story illustrates the issues of not thinking about how you will retrieve documents at the same time you are thinking about how you input them.  I ended up refining my system, and it works very well.  There are now some additional input steps on my part, but the assurance that those additional steps provide will save me from ever facing the issue again.  Here are some cool things you should consider bringing to the document imaging table.

  1. Taxonomy.  Build a high level classification for your documents, so that you can at the very least reduce the burden of the search to some subset of images.  Taxonomy is also useful in refining search results.  A well designed Taxonomy will reduce your reliance on search.
  2. Meaningful file names.  When you get your scanner your images may be produced with a name generated by some prefix, date stamp, and maybe an iterated number.  If at all possible, name your documents with meta-data, or some more relevant piece of information.  This could even be on a batch level.  I’m not implying to get rid of dates, they are always useful.  When I had to use brute force search on a large collection of documents, good naming would have saved me time.
  3. Facets / Keywords.   Incorporate into the meta-data of a document, keywords or facets that clearly dignify that document’s topic. These will help in search filtering as well as getting the right documents in the right place.

You will notice a pattern in these three tools; they all provide a quick way to take a large population of documents and create more manageable subsets.  This improves search, and in the event of brute force search, effort required.  Proper implementation of these techniques also gives you the ability to create an endless number of virtual folders on the fly.  Which not only improves search but your ability to perform some more advance analysis such as business intelligence.

If you do have to start over and take a document imaging Mulligan, consider re-imaging the original documents.  Unless you have also saved a TIFF Group 4 version of each image, OCRing already OCRed documents such as PDFs dramatically reduce the quality of the output.  If you have the storage space and plan ahead you can keep a copy of the TIFF Group 4 image of every document, and this will give you the greatest opportunity, should the need to re-ocr ever arise.

I will repeat this mistake with some other new technology until I find some cure for computer mania.  But you can learn from my mistake.  Do it right the first time and plan.

Report

Rate Post

You need to log in to rate blog posts. Click here to login.

Add a Comment

You need to log in to post messages. Click here to login.

Comments

Abhijit Kulkarni

Hi Chris,
A thought came to my mind while searching documents. Can we have search based on image .e .g I can template of image where different sections are highlighted. When I select this image for search , all documents based on such templates will be searched. Here we don't have to remember names as well as index fields for the search. Document will be searched based on outline/template of document. Do you know if such tools are available

Abhijit
Report
Was this helpful? Yes No
Reply
Chris Riley, ECMp, IOAp

Abhijit,

Thank you for calling it out. Image search has been something I've been very passionate about. Out of the box there are a few search engines attempting to incorporate image search. There are also apps by Nokia, Google, Etc that do this mobile. However non are really doing it in the way I've envisioned to be used for an enterprise.

I have independently played heavily with the concept using an image-based classification engine, and other imaging technology. The potential is there for sure, but nothing mainstream.
Report
1 people found this helpful, did you? Yes No
Reply
Abhijit Kulkarni

Hi Chris,
Thansk for the reply . I can also see potential in this concept as people identify document mainly as image than content on that document . If you are doing any research and development on this topic , I would like to be part of it.

Regards,
Abhijit
Report
Was this helpful? Yes No
Reply

This post and comment(s) reflect the personal perspectives of community members, and not necessarily those of their employers or of AIIM International