Automating Records Management with Artificial Intelligence

 

How It Works - in Real Life – Today
Just as there are many definitions of intelligence as related to humans, there are also many definitions of artificial intelligence (AI) as related to computers. The Turing Test perhaps best defines the ultimate goal for artificial intelligence: a machine which is indistinguishable from expert humans. But, in more than a half century since the test was conceived by Alan Turing, the recognized founder of computer science, no machine has passed it. (Yes, this is the same Alan Turing – an eccentric British genius known for riding his bicycle with a gas mask to combat hay fever – who played a pivotal role in breaking the Nazi code during World War II.)

Still, there are contemporary scientists who believe that such machines are on the horizon. In The Singularity is Near, inventor and futurist Ray Kurzweil suggests that machines will pass the Turing Test within the next quarter-century.

Regardless of when (if ever) actual Turing-proven machines become available, there is expert consensus and empirical data to prove that some machines are already doing certain types of work as well as, or better than, humans today. Rather than debate the questions “What is AI?” or “Has AI yet been achieved?”, more than a decade ago, computer scientist and science-fiction writer Vernor Vinge proffered a more pragmatic notion of how machines amplify human intelligence, which he calls intelligence amplification or IA.

Vinge stated: "IA is proceeding very naturally, in most cases not even recognized for what it is by its developers. But every time our ability to access information and to communicate it to others is improved, in some sense we have achieved an increase over natural intelligence."

From Fighting Wars . . . to Video Games?
There are many different branches of AI, each with many different applications. Dr. John McCarthy, an AI pioneer at Stanford, identifies the following branches: logical AI, search, pattern recognition, representation, inference, common-sense knowledge and reasoning, learning from experience, panning, epistemology, ontology, heuristics, and genetic programming. He also notes that AI applications include game playing, speech recognition, understanding natural language, computer vision, expert systems, and heuristic classification.

Both the branches of AI as well as the applications are interdisciplinary. That is to say that one branch could engage techniques from other branches – e.g., search might engage pattern recognition, while expert systems might use heuristics, search, and pattern recognition – all of which are integral to game-playing applications, even the video games your children are playing today. And this, by the way, is analogous to interdisciplinary fields such as policy science, economics, statistics and a host of others.

Before a machine passes the Turing Test, however, it will likely require mastery of techniques from all of the AI branches and integration of applications. One major area of work which is successfully advancing various AI branches and application techniques today is enterprise records information management.

Currently, interdisciplinary AI solutions are, with a high degree of accuracy, performing three foundational functions of records management: classification, extraction of structured data, and redaction of data. The following scenario illustrates how innovative records managers armed with the right tools are already integrating the various branches and applications of AI to achieve dramatic results.

How AI Works In Records Management
Assume you are a records manager.On your computer is a collection of millions of enterprise records of various vintage. You have no idea what all the records are. They could be medical records, legal documents, administrative documents, finance records, educational documents or a wide variety of documents from throughout your enterprise.

They could have fixed formats, such as forms labeled with organizational codes, e.g. IRS 1040. Or, they might have a partial format, such as a letter or an email. Or, they could have no format at all – just some sketchy information in a Word document, such as a task list. They could be single page or multi-page. They might or might not have page numbers. They could be electronically-generated documents, or poor-quality scans of hard-copy documents.

What is your task? It is to:

  1. Classify these documents into a taxonomy which contains 1,000 different classification codes based on both document structure and content. The taxonomy requires content differentiation by both type and sub-types of content or various versions of the same form.
  2. Extract structured data from the documents based on the document classification. From document Type 1, you are to extract the date the document was created. From document Type 2, you are to extract the name of the organization which created the document. From document Type 3, you are to extract the diagnostic phrases for a mammography. Note that there are typically multiple data extractions from each document.
  3. Redact data from the documents based on the document classification. From document Type 999, you are to redact the name, social security number, and credit-card account numbers.

Training the Computer
Now, let’s break it down into something the computer understands. How do you approach these tasks? How long will it take you to classify 10,000 of these documents and extract/redact the correct data? How accurately can you perform these tasks?

First, the records manager creates an AI expert-system knowledge base. This knowledge base contains important and unique facts for each classification code in the taxonomy to complete Task 1: classification. In preparation for Tasks 2 and 3, the knowledge base also identifies the elements which are to be extracted or redacted for each classification code. The knowledge base includes customized lexicons for each element to be extracted/redacted as well as a set of logical rules. To create the knowledge base, lexicons, and logical rules, the records manager uses various AI and classical statistical techniques including searching, pattern matching, heuristics, and probabilities.

Second, the records manager uses sophisticated OCR technology – which itself employs AI pattern recognition techniques – to convert every document which was not already electronically searchable into electronically searchable text. The OCR technology provides critical metadata regarding each text element, including font type and size; location on the page; case and context.

Third, the records manager engages AI search and pattern-recognition techniques to match the text of each document to the facts contained in the taxonomy knowledge base. This process provides the best classification code for each document.

Fourth, once a classification code is determined, the records manager queries the knowledge base to confirm the specific data elements to be extracted or redacted. The records manager acquires the appropriate data element lexicons for this task, ensuring, for example, that the right date among the many on the page is extracted.

Fifth, once all the tasks are completed, the records manager provides the classification code and extraction/redaction information for each document in an appropriate format, such as XML, using classic data conversion techniques, if necessary. These results can then be transmitted to the appropriate points in an enterprise to facilitate a wide variety of enterprise content management (ECM) requirements.

Wish You Were This Good?
Using today’s average processors, once the knowledge base is established, steps 2-5 complete in 2-3 seconds per document – about the time it takes a human being to click on a page and begin to read it!

As processor speeds accelerate, this time will reduce to fractions of a second. Classification will be consistent because the same knowledge base is used for every classification decision. It’s difficult to get consistency when multiple individuals are completing classification tasks. But with AI, both extraction and redaction will be comprehensive and accurate – no tired eyes to miss an extraction or redaction element and no keying errors.

This is just one simple scenario of how AI works today. Coming uses will transform ECM and society at large in ever more dramatic ways. Even in its nascent stages, however, the benefits of AI technologies are already increasing human efficiencies and enabling mankind to address increasingly complex challenges.

Dr. Kim Mitchel is Chief Science Officer at ECOMPEX, Inc., where she is responsible for strategic planning and technology development with emphasis on knowledge-based imaging systems and other technology innovations that dramatically improve the integrity, effectiveness, and economics of large-scale governmental and corporate administrative processes. Visit www.ecompex.com for more information or contact Jesse Lake, Director of Marketing, at (703) 288-3382, x1214.