Automated Redaction: Cake Walk? Nightmare?

There’s hardly any middle between the two extremes in automated redaction. Regardless of where you are on the spectrum, automating redaction can increase the efficiency of your capture solution.

In the field of data capture and OCR (optical character recognition) one of the coolest and most effective specific uses of the technology is automated redaction. Redaction is the process of taking a paper document and removing private information. It comes in several flavors: traditional redaction, where you cover the private information in a black rectangle; whiteout where you cover the private information with a white rectangle; or sanitization where you replace real information with fake information.

You typically see redaction with court documents, bank statements, or tax documents, and you typically see people removing social security numbers, addresses, or account numbers. The point of automated redaction is to make this a scanning process whereby the paper is scanned, automatically redacted, and an image file is saved with the private information blacked out. While there are many variations to this process, such as multi-layer PDFs with redaction and un-redacted layers, these are the basics.

In the world of automated redaction, it’s either very easy, fast, and accurate, or very difficult and sometimes overwhelming. So let’s take a look at the differences between the two situations.

If you are dealing with typographic text, good image quality, and structured field information to be redacted (such as a social security number or account number) fully-automated redaction can be very accurate, very fast, and one of the easiest data capture solutions to implement. When I say structured field information I am referring to the fact that the data being redacted comes in a predictable format. I am not so much concerned about its location on the page. For example, if you are redacting a social security number on a typographic document it’s very easy to search the entire document for all instances of the format “NNN-NN-NNNN” where N represents any number 0-9. In an automated fashion what specifically happens here is the document is OCRed, all instances of the social security format are found and their coordinates reported back to the software where it then draws a black square on that location. Those who have technical experience with an OCR engine and a developer can create such a solution in weeks; I know this from experience. Now if you know beforehand the XYHW coordinates of where the to-be-redacted field is located, you don’t even have to run OCR: just slap a black or white rectangle on that spot. This is the easiest scenario, it can get more complex with different field types, and if you have to look in documents that are almost 100% text, like a letter, or where the to-be-redacted field wraps lines, you will usually have to reference other words in the document to find the data. But all of the above still fall into the category of “easy.”

For the most part organizations will choose to redact information rather than replace it with fake information, but there is a very interesting use case for sanitization, which is replacing with fake information instead of masking. One could be re-purposing documents where private information cannot be shared. For instance, if you are seeking a vendor proposal or seeking OCR software you may have to show samples, but your policy is that you cannot give out private information meaning the samples you provide must both be indicative of your population of documents, and also clean of private data. In these scenarios whiting out the private information and overwriting it with fake but typical data is very useful, and just one additional step to the traditional above “easy” redaction.

I am not yet sure what population of documents make up the “easy” category, but in my experience I hazard to guess that it’s the majority and likely above 60 percent. In most data capture technologies there is a nice gradient between easy and difficult, but in the case of redaction the same is not true. What I’ve found is that all redaction projects either fall into the above category or extremely difficult.

If you have hand-printed text in unpredictable format and location, if you have degraded documents with deteriorated characters, and/or if you are redacting fields of non-predictable formats then automated redaction can be one of the most complex tasks you will undertake. This usually pops up in projects where the documents are very old, in poor quality, handwritten, and/or redacting things like an address or proper name. All these challenges have unique solutions. For example, use the focus technique for poor quality documents – first focus on the part of the document where the information likely occurs and then do your search (the goal is to search the smallest possible region of the document). With a handwritten document, focus on prefix and postfix words that would dignify a field, and tools to detect handprint over typographic, the most common side effect here is OVER redaction. The reason that these become so extremely hard is because of the requirement on redaction is you must be 100% accurate (OK, OK: 99.999999998% accurate).

The consequences of bad redaction are huge – akin to the consequences of misreading a prescription in healthcare. You need to get it right or not do it all – regardless of the document quality. Determining if you got it right or not is very difficult. That’s where the QA process comes on; even though QA is a key part and often overlooked or underdeveloped.

It does not matter how easy or complex your redaction documents are; a proper QA process fed by automated redaction is the surefire bet to maximize the utilization of such technology. The QA process has to be easy for an operator to see what is redacted, or see what the software “thinks” should be redacted, and have the ability to change a redaction quickly prior to any commitment. Certain state laws will require that every single redacted page be reviewed thru QA and others will warrant some percentage of confident output. In any case only the “easy” documents should ever fully pass QA. Extra effort put into this QA process should not be feared because it is the key to a quality solution that will hit the optimum ROI and be your “get out of jail” card.

Chris Riley is founder of Living@nalyitcs (www.livinganalytics.com) where he uses his in-depth knowledge of data capture technologies to advise clients and proselytize the value of these tools.

 Chris recently was the feature speaker for our webinar on March 5; Tips and Tricks to Help You Automate your Office Documents (for Effective Data Capture). Listen at www.aiim.org/webinararchive.