Losing Metadata during OCR process

Community Topic(s):

Keywords: metadata, ediscovery, job server, captiva, OCR, PDF

During document migration, we are using Captiva/Job Server to add an OCR layer to our non-text PDFs, so they are fully indexed in Documentum. Our current process is stripping the original document metadata from the new OCRed file. For eDiscovery purposes, we need to retain that original metadata. Has anyone experienced this problem? Does anyone have suggestions for a solution?
Report

Add a Response

You need to log in to post messages. Click here to login.

Responses

Shane,

I'm not sure I understand 100% of what you are asking so I will make a lot of assumptions. It sounds like the problem you are considering is the difference between full-page OCR and data capture. Full-text being all text on the document and data capture being specified field-data pairs. If this is indeed the case, most companies will, during OCR, have both processes run where the full-text serves the purpose of search and retrieval ( eDsicovery ) and the field level meta-data to feed some business process. If you are stuck with what you have, then it's not unheard of to use text parsing on the text layer of the PDF. Simple converting in memory the text layer to ASCII and parsing the field level data out similar to how a data capture engine would do it during OCR. The risk is that data capture and full-text OCR are two very different things. The accuracy tuning of the OCR engine is specific for each mode, so by using one for the other you lose OCR accuracy. I hope at this point this gives you some guidance, feel free to elaborate with more details on. What types of documents? What data you need off the documents? What is the data used for?

Report
Was this helpful? Yes No
Reply

I think Shane ment that in the process of adding the full-text information to the original PDF file, the meta data already contained within the original PDF (pdf creator etc.) seem to get overwritten.

As this always is a risk i would tend to archive both, the original PDF AND its OCRed rendition. Of course this doubles the content but after all it would satisfy both needs and the original would stay untouched.

Report
Was this helpful? Yes No
Reply

Thomas,

Thank you for clarifying. Yes by default the OCR products are creating a NEW file, so all new document properties. It does not actually just add the layers to the original. The fix I would see is a pretty simple custom app that copies the properties over.

Report
Was this helpful? Yes No
Reply

Thomas,

Thank you for clarifying. Yes by default the OCR products are creating a NEW file, so all new document properties. It does not actually just add the layers to the original. The fix I would see is a pretty simple custom app that copies the properties over.

Report
Was this helpful? Yes No
Reply

You can add both PDF file and OCRed file to one virtual document, or you can add the OCRed file as a rendition to the original one. Simple script in captive can help in this

Report
Was this helpful? Yes No
Reply