There’s hardly any middle between the two extremes in automated redaction. Regardless of where you are on the spectrum, automating redaction can increase the efficiency of your capture solution.
In the field of data capture and OCR (optical character recognition) one of
the coolest and most effective specific uses of the technology is automated
redaction. Redaction is the process of taking a paper document and removing
private information. It comes in several flavors: traditional redaction, where
you cover the private information in a black rectangle; whiteout where you cover
the private information with a white rectangle; or sanitization where you
replace real information with fake information.
You typically see redaction with court documents, bank statements, or tax
documents, and you typically see people removing social security numbers,
addresses, or account numbers. The point of automated redaction is to make this
a scanning process whereby the paper is scanned, automatically redacted, and an
image file is saved with the private information blacked out. While there are
many variations to this process, such as multi-layer PDFs with redaction and
un-redacted layers, these are the basics.
In the world of automated redaction, it’s either very easy, fast, and
accurate, or very difficult and sometimes overwhelming. So let’s take a look at
the differences between the two situations.
If you are dealing with typographic text, good image quality, and structured
field information to be redacted (such as a social security number or account
number) fully-automated redaction can be very accurate, very fast, and one of
the easiest data capture solutions to implement. When I say structured field
information I am referring to the fact that the data being redacted comes in a
predictable format. I am not so much concerned about its location on the page.
For example, if you are redacting a social security number on a typographic
document it’s very easy to search the entire document for all instances of the
format “NNN-NN-NNNN” where N represents any number 0-9. In an automated fashion
what specifically happens here is the document is OCRed, all instances of the
social security format are found and their coordinates reported back to the
software where it then draws a black square on that location. Those who have
technical experience with an OCR engine and a developer can create such a
solution in weeks; I know this from experience. Now if you know beforehand the
XYHW coordinates of where the to-be-redacted field is located, you don’t even
have to run OCR: just slap a black or white rectangle on that spot. This is the
easiest scenario, it can get more complex with different field types, and if you
have to look in documents that are almost 100% text, like a letter, or where the
to-be-redacted field wraps lines, you will usually have to reference other words
in the document to find the data. But all of the above still fall into the
category of “easy.”
For the most part organizations will choose to redact information rather than
replace it with fake information, but there is a very interesting use case for
sanitization, which is replacing with fake information instead of masking. One
could be re-purposing documents where private information cannot be shared. For
instance, if you are seeking a vendor proposal or seeking OCR software you may
have to show samples, but your policy is that you cannot give out private
information meaning the samples you provide must both be indicative of your
population of documents, and also clean of private data. In these scenarios
whiting out the private information and overwriting it with fake but typical
data is very useful, and just one additional step to the traditional above
“easy” redaction.
I am not yet sure what population of documents make up the “easy” category,
but in my experience I hazard to guess that it’s the majority and likely above
60 percent. In most data capture technologies there is a nice gradient between
easy and difficult, but in the case of redaction the same is not true. What I’ve
found is that all redaction projects either fall into the above category or
extremely difficult.
If you have hand-printed text in unpredictable format and location, if you
have degraded documents with deteriorated characters, and/or if you are
redacting fields of non-predictable formats then automated redaction can be one
of the most complex tasks you will undertake. This usually pops up in projects
where the documents are very old, in poor quality, handwritten, and/or redacting
things like an address or proper name. All these challenges have unique
solutions. For example, use the focus technique for poor quality documents –
first focus on the part of the document where the information likely occurs and
then do your search (the goal is to search the smallest possible region of the
document). With a handwritten document, focus on prefix and postfix words that
would dignify a field, and tools to detect handprint over typographic, the most
common side effect here is OVER redaction. The reason that these become so
extremely hard is because of the requirement on redaction is you must be 100%
accurate (OK, OK: 99.999999998% accurate).
The consequences of bad redaction are huge – akin to the consequences of
misreading a prescription in healthcare. You need to get it right or not do it
all – regardless of the document quality. Determining if you got it right or not
is very difficult. That’s where the QA process comes on; even though QA is a key
part and often overlooked or underdeveloped.
It does not matter how easy or complex your redaction documents are; a proper
QA process fed by automated redaction is the surefire bet to maximize the
utilization of such technology. The QA process has to be easy for an operator to
see what is redacted, or see what the software “thinks” should be redacted, and
have the ability to change a redaction quickly prior to any commitment. Certain
state laws will require that every single redacted page be reviewed thru QA and
others will warrant some percentage of confident output. In any case only the
“easy” documents should ever fully pass QA. Extra effort put into this QA
process should not be feared because it is the key to a quality solution that
will hit the optimum ROI and be your “get out of jail” card.
Chris Riley is founder of
Living@nalyitcs (www.livinganalytics.com) where he uses
his in-depth knowledge of data capture technologies to advise clients and
proselytize the value of these tools.
Chris recently was the feature speaker for our webinar on March 5;
Tips and Tricks to Help You Automate your Office Documents (for Effective Data
Capture). Listen at www.aiim.org/webinararchive.