Digging for Data

Deriving new, "disruptive" business intelligence from unstructured data. Leveraging existing reports can lead to a more intelligent and aware organization.

Organizations are long overdue for a fresh look at how they can provide enterprise-wide business intelligence, unbiased by long-held misconceptions about what a business intelligence solution "must" entail. As business intelligence expert Colin White recently observed, disillusionment with existing business intelligence software is already well underway. Some companies are starting to rebel, demanding easier and cheaper solutions. These companies, White notes, typically have fewer IT resources and skills necessary to successfully implement business intelligence projects and may be struggling to implement even basic business intelligence capabilities.

This demand for simpler solutions “disrupts” the traditional business intelligence model and challenges the long-held misconceptions about what successful business intelligence should entail. Companies today must now pay attention to the overlooked unstructured sources of actionable enterprise data, which can dramatically simplify the demands of providing the right information to the right person at the right time.

The Unstructured Definitions of Unstructured Data
Traditional business intelligence systems typically rely on structured data, and there is clear consensus as to what structured data is. To repeat a bit of Data 101, structured data is data that has fields, columns, tables, rows, and indexes; for example, relational databases are structured data. There is a very high degree of predictability and order, computers can reliably maintain and manipulate the information and standardized structured query language (SQL) can be used to access desired information.

However, the vast majority of enterprise information is found in unstructured data sources; that is, information sources outside of databases. While there has been much hype around the concept of harvesting unstructured data, technology writers still disagree about what unstructured data is and whether it has any value at all.

Definitions of unstructured data, unlike those of structured data, are rather, well, unstructured. According to Bill Inmon, unstructured data is data with no particular order to it.(1) True enough. Now, unlike structured data, the extent of "no particular order" unstructured data will possess varies widely. Putting aside alternative genres of information sources such as video, audio, images, and the like, the most unstructured of unstructured data sources is free-form text, such as medical reports, warranties, contracts, etc. There are no rules governing the creation or usage of free-form text, no keys, no indexes, no columns, or attributes. Free-form text is as disorderly as structured data is orderly, says Inmon. Efforts to transform free-form text into some semblance of structured data, some useful source of business intelligence, is somewhat akin to harvesting useful gems from low-grade ore: possible, but frustratingly difficult to yield consistently useful results.

Thankfully, not all unstructured data is as profoundly unstructured, and as dubious a business intelligence source, as free-form text. Some unstructured data, as noted by business intelligence writer David Loshin, might have some implicit structure that is generally followed, but seemingly not enough of a regular structure to "qualify" for the kinds of management usually applied to structured data.(2) This definition is a very useful "sanity check" to assess whether an unstructured data source in question is worth the time and effort to try to transform it into viable, actionable business intelligence: some structure to the unstructured data sources is essential for this purpose. If no such structure exists, further efforts with that source of unstructured data are not warranted.

Identifying Unstructured Data Sources Worth Mining
XML definitely qualifies as a genre of unstructured data with sufficient structure that might make it useful as a business intelligence source. XML purports to be a simple, vendor-neutral textual external representation for hierarchically-structured data. That's a reasonably accurate definition, except for the simplicity bit, says Aaron Crane of the UK-based consulting firm GBDirect. Crane has provided very candid commentaries on the problems of XML: First, XML documents are frustratingly and unnecessarily verbose for human authors. Second, XML is complex. There is no clear view, notes Crane, of what XML is meant to accomplish: Is it for humans or machines?(3)

As Crane also notes, the numerous problems posed by XML yield difficult, often maddening, but ultimately surmountable challenges resulting in useful business intelligence. Unlike free-form text, the gems of useful business intelligence can ultimately be drawn from the higher-grade ore of XML. But is there an even higher grade of unstructured data from which mining of useful data for business intelligence is substantially easier than working with XML?

A useful clue comes from a practical comment from Dan Linstedt: "The best use of Unstructured/Semi-Structured data," Linstedt wrote, "is one that has a predefined business question/business case to answer to."(4)

Yi Chen of Arizona State University recently provided an even more granular definition of unstructured data with some structure: data that does not conform to a fixed database schema, may have an "irregular" structure, and is typically published by relational databases.(5)

Putting these two comments together, what unstructured data source "answers a predefined business question" with some structure, "typically produced by relational databases?" Well, that would most certainly be the existing reports and business documents already produced within every organization.

Putting Existing Reports to Work as a Business Intelligence Source
Organizations running enterprise resource planning (ERP) systems, for example, already own a library of existing "canned" reports. All of the work is already done for them – there is no coding to do, no security to work out, and all the information that they need is embedded within the delivered standard report. For example, SAP offers canned reports numbering in the several thousands.

Organizations maintaining a huge historical database of an ERP (or other core system), often face a frustrating reality: most of the effort and cost associated with their complex business intelligence tool is intended to allow the organization to create and work with customized views of the very same enterprise data that already appears within various existing ERP report outputs.

Another key attribute of existing reports is that they already contain business rules, which transform raw data collected by enterprise applications into actionable information. These business rules do not exist within the database. Instead, they are executed by the application at runtime, when an existing, or canned, report is created.

For example, a healthcare organization trying to enable more effective materials management found that reports produced by its ERP solution included critical and complex FIFO (First In, First Out) “inventory turns” calculations that had been performed at report runtime. As a result, it was virtually impossible to replicate the same calculations using a business intelligence tool or report writer. Imagine the frustration of seeing crucial “inventory turns” data on an existing healthcare report, with no apparent way to actually work with that data interactively.

Report mining enables the intelligent recognition and parsing of data within existing reports and business documents -- typically in plain text or PDF format -- into a valid data table, complete with optional new calculated fields of data and database lookups that allow the inclusion of additional data located elsewhere. Report mining also facilitates sorting, filtering, combination with other data, and summarization with subtotals and grand totals, as well as easy export to Excel, PDF, data marts or warehouses, and other applications. Reports, particularly when intelligently indexed and archived within a report mining-enabled enterprise report management system, a "report warehouse," can become a wellspring of easily accessed and manipulated data for programming-free operational business intelligence.

As organizations seek to free themselves from complicated business intelligence tools, it’s important to note that report mining warrants an in-depth review as one new and "disruptive" business intelligence solution that smartly leverages the most abundant source of viable unstructured enterprise data: existing reports and business documents.

Report mining is not applicable to all enterprise business intelligence needs. It is no substitute, for example, for business intelligence applications that require real time data acquisition and analysis. A report warehouse will never replace the wide functionality of a data warehouse. However, report mining can play a valuable support role as a means for programming-free ETL sourced from reports. More importantly, organizations may find that report mining allows them to achieve pervasive business intelligence at a dramatically lower cost and complexity than either XML conversion technologies or traditional business intelligence solutions.

Mike Urbonas is product marketing manager at Datawatch, a provider of business intelligence tools.

Notes:
1. Bill Inmon, "Structured and Unstructured Data: Bridging the Gap," B-Eye Network, June 21, 2007.
2. David Loshin, "Simple Semi-Structured Data," B-Eye Network, October 17, 2005.
3. Aaron Crane, "Does XML Suck? Or: Why XML is Technologically Terrible, but You Have to Use It Anyway," May 14, 2002.
4. Dan E. Linstedt, "Hidden in the un-structured information...," B-Eye Network, March 6, 2006.
5. Yi Chen, "Data on the Web," Arizona State University, January 17, 2006.