Deriving new, "disruptive" business intelligence from unstructured data. Leveraging existing reports can lead to a more intelligent and aware organization.
Organizations are long overdue for a fresh look at how
they can provide enterprise-wide business intelligence, unbiased by long-held
misconceptions about what a business intelligence solution "must" entail. As
business intelligence expert Colin White recently observed, disillusionment with
existing business intelligence software is already well underway. Some companies
are starting to rebel, demanding easier and cheaper solutions. These companies,
White notes, typically have fewer IT resources and skills necessary to
successfully implement business intelligence projects and may be struggling to
implement even basic business intelligence capabilities.
This demand for simpler solutions “disrupts” the traditional business
intelligence model and challenges the long-held misconceptions about what
successful business intelligence should entail. Companies today must now pay
attention to the overlooked unstructured sources of actionable enterprise data,
which can dramatically simplify the demands of providing the right information
to the right person at the right time.
The Unstructured Definitions of Unstructured
Data
Traditional business intelligence systems typically rely on
structured data, and there is clear consensus as to what structured data is. To
repeat a bit of Data 101, structured data is data that has fields, columns,
tables, rows, and indexes; for example, relational databases are structured
data. There is a very high degree of predictability and order, computers can
reliably maintain and manipulate the information and standardized structured
query language (SQL) can be used to access desired information.
However, the vast majority of enterprise information is found in unstructured
data sources; that is, information sources outside of databases. While there has
been much hype around the concept of harvesting unstructured data, technology
writers still disagree about what unstructured data is and whether it has any
value at all.
Definitions of unstructured data, unlike those of structured data, are
rather, well, unstructured. According to Bill Inmon, unstructured data is data
with no particular order to it.(1) True enough. Now, unlike structured data, the
extent of "no particular order" unstructured data will possess varies widely.
Putting aside alternative genres of information sources such as video, audio,
images, and the like, the most unstructured of unstructured data sources is
free-form text, such as medical reports, warranties, contracts, etc. There are
no rules governing the creation or usage of free-form text, no keys, no indexes,
no columns, or attributes. Free-form text is as disorderly as structured data is
orderly, says Inmon. Efforts to transform free-form text into some semblance of
structured data, some useful source of business intelligence, is somewhat akin
to harvesting useful gems from low-grade ore: possible, but frustratingly
difficult to yield consistently useful results.
Thankfully, not all unstructured data is as profoundly unstructured, and as
dubious a business intelligence source, as free-form text. Some unstructured
data, as noted by business intelligence writer David Loshin, might have some
implicit structure that is generally followed, but seemingly not enough of a
regular structure to "qualify" for the kinds of management usually applied to
structured data.(2) This definition is a very useful "sanity check" to assess
whether an unstructured data source in question is worth the time and effort to
try to transform it into viable, actionable business intelligence: some
structure to the unstructured data sources is essential for this purpose. If no
such structure exists, further efforts with that source of unstructured data are
not warranted.
Identifying Unstructured Data Sources Worth Mining
XML
definitely qualifies as a genre of unstructured data with sufficient structure
that might make it useful as a business intelligence source. XML purports to be
a simple, vendor-neutral textual external representation for
hierarchically-structured data. That's a reasonably accurate definition, except
for the simplicity bit, says Aaron Crane of the UK-based consulting firm
GBDirect. Crane has provided very candid commentaries on the problems of XML:
First, XML documents are frustratingly and unnecessarily verbose for human
authors. Second, XML is complex. There is no clear view, notes Crane, of what
XML is meant to accomplish: Is it for humans or machines?(3)
As Crane also notes, the numerous problems posed by XML yield difficult,
often maddening, but ultimately surmountable challenges resulting in useful
business intelligence. Unlike free-form text, the gems of useful business
intelligence can ultimately be drawn from the higher-grade ore of XML. But is
there an even higher grade of unstructured data from which mining of useful data
for business intelligence is substantially easier than working with XML?
A useful clue comes from a practical comment from Dan Linstedt: "The best use
of Unstructured/Semi-Structured data," Linstedt wrote, "is one that has a
predefined business question/business case to answer to."(4)
Yi Chen of Arizona State University recently provided an even more granular
definition of unstructured data with some structure: data that does not conform
to a fixed database schema, may have an "irregular" structure, and is typically
published by relational databases.(5)
Putting these two comments together, what unstructured data source "answers a
predefined business question" with some structure, "typically produced by
relational databases?" Well, that would most certainly be the existing reports
and business documents already produced within every organization.
Putting Existing Reports to Work as a Business Intelligence
Source
Organizations running enterprise resource planning (ERP)
systems, for example, already own a library of existing "canned" reports. All of
the work is already done for them – there is no coding to do, no security to
work out, and all the information that they need is embedded within the
delivered standard report. For example, SAP offers canned reports numbering in
the several thousands.
Organizations maintaining a huge historical database of an ERP (or other core
system), often face a frustrating reality: most of the effort and cost
associated with their complex business intelligence tool is intended to allow
the organization to create and work with customized views of the very same
enterprise data that already appears within various existing ERP report
outputs.
Another key attribute of existing reports is that they already contain
business rules, which transform raw data collected by enterprise applications
into actionable information. These business rules do not exist within the
database. Instead, they are executed by the application at runtime, when an
existing, or canned, report is created.
For example, a healthcare organization trying to enable more effective
materials management found that reports produced by its ERP solution included
critical and complex FIFO (First In, First Out) “inventory turns” calculations
that had been performed at report runtime. As a result, it was virtually
impossible to replicate the same calculations using a business intelligence tool
or report writer. Imagine the frustration of seeing crucial “inventory turns”
data on an existing healthcare report, with no apparent way to actually work
with that data interactively.
Report mining enables the intelligent recognition and parsing of data within
existing reports and business documents -- typically in plain text or PDF format
-- into a valid data table, complete with optional new calculated fields of data
and database lookups that allow the inclusion of additional data located
elsewhere. Report mining also facilitates sorting, filtering, combination with
other data, and summarization with subtotals and grand totals, as well as easy
export to Excel, PDF, data marts or warehouses, and other applications. Reports,
particularly when intelligently indexed and archived within a report
mining-enabled enterprise report management system, a "report warehouse," can
become a wellspring of easily accessed and manipulated data for programming-free
operational business intelligence.
As organizations seek to free themselves from complicated business
intelligence tools, it’s important to note that report mining warrants an
in-depth review as one new and "disruptive" business intelligence solution that
smartly leverages the most abundant source of viable unstructured enterprise
data: existing reports and business documents.
Report mining is not applicable to all enterprise business intelligence
needs. It is no substitute, for example, for business intelligence applications
that require real time data acquisition and analysis. A report warehouse will
never replace the wide functionality of a data warehouse. However, report mining
can play a valuable support role as a means for programming-free ETL sourced
from reports. More importantly, organizations may find that report mining allows
them to achieve pervasive business intelligence at a dramatically lower cost and
complexity than either XML conversion technologies or traditional business
intelligence solutions.
Mike Urbonas is product marketing manager at Datawatch, a provider of business
intelligence tools.
Notes:
1. Bill Inmon, "Structured and Unstructured Data: Bridging the
Gap," B-Eye Network, June 21, 2007.
2. David Loshin, "Simple
Semi-Structured Data," B-Eye Network, October 17, 2005.
3. Aaron Crane,
"Does XML Suck? Or: Why XML is Technologically Terrible, but You Have to Use It
Anyway," May 14, 2002.
4. Dan E. Linstedt, "Hidden in the un-structured
information...," B-Eye Network, March 6, 2006.
5. Yi Chen, "Data on the
Web," Arizona State University, January 17, 2006.