With Global Data Management methodology and tools, all of your data can be accessed and used no matter where it is or where it is from: on-premises, private cloud, public cloud(s), hybrid cloud, open source, third-party data and any combination of the these, with security, privacy and governance applied as if they were a single entity. Ingenious software products and the economics of computing make it economical to do this. Not free, but feasible.
2. 1
TABLE OF CONTENTS
GOAL OF GLOBAL DATA MANAGEMENT 1
A SHORT HISTORY OF SECURITY 1
THE SITUATION TODAY 2
“ALIEN” DISTANT DATA 3
MULTI-JURISDICTIONAL ISSUES 4
RISKS AND REWARDS OF A TRADE-OFF GOVERNANCE POLICY 4
THE FABRIC 4
THE GLOBAL DATA MANAGEMENT (GDM) PROGRAM 5
METADATA 5
LINEAGE 6
GOVERNANCE 6
SECURITY 7
LIFECYCLE 7
WHAT A GDM PERSON DOES (PERSONALLY AND THROUGH THE TEAM) 8
OTHER KEY ROLES 9
CONCLUSION 9
ABOUT THE AUTHOR 10
3. 1
GOAL OF GLOBAL DATA MANAGEMENT
There is no question that there is a greater, aching desire by organizations to capture
data and draw insight from it for a multitude of improvements and innovations in
operations, customer service, and evenin completely new businesses1. That effort has
become more complicated with the emergence of hybrid, distributed computing and
data architectures (big data, cloud variants, multi-clouds and IoT). To succeed there
is a need to address a broader data management philosophy incorporating
collaboration, standardization, reuse, retention (of data and models) and especially,
security and governance. To illustrate this need, a short history of enterprise security
and governance will help.
A SHORT HISTORY OF SECURITY
Before the cloud, before big data, and even into the present, security was
implemented one application system at a time. If you were in the finance
department, you may be granted access to post manual ledger entries through the
accounting system. If you were in Human Resources, you may be granted access to
view and/or modify an employee’s records through an HR system. These grants were
either embedded in the application logic based on your role, or applied externally.
But the grants and restrictions were all administered through separate application
systems and their security scheme was not transferable from one application to
another. As a result, the overall picture was fractured, inconsistent and difficult to
administer. It was developed from a time when people in organizations had tightly
constrained roles. Today, employees are expected to be agile, adaptable and able to
handle multiple roles in the organization simultaneously.
Again, before the cloud, before big data, before data science, analysts did devise
quantitative methods. In the early days of e-commerce for example, websites already
employed recommendation engines, dynamic decision making based on scoring and
decision trees for next-best-offer or propensity models. They did this by getting
access, usually one data source at a time, from IT. Data warehouses both aided and
hindered their work: aided by integrating data from multiple sources and collapsing
the security model to just one source, hindered by only providing aggregated data and
a rigid design that couldn’t adapt quickly (in fairness, any good data warehouse
designer could enhance a schema, but provisioning new data was a slow process). The
only thing that prevented the data warehouse from ingesting all of the data, internal
and external, that analysts craved was scarcity. The data warehouse could only scale
in terms of volume, throughput and demanding use at extreme cost.
1 We use the term“businesses” loosely as these innovationsalso apply to government, non-profits, charities and NGO’s, and any
type of organization
4. 2
What organizations crave seems to shift over decades. Fifty years ago, computers
were employed for record-keeping. Reporting from these systems was limited to
copious printing of records. The demand for actual reporting generated long backlogs
of systems analysts and programmers creating massive hairball of “interfaces” with no
management. Early Business Intelligence (BI) emerged that shifted the burden to
analysts, freeing IT to focus on new generations of application systems. Data access
and security shifted to the data warehouse.
About ten years ago, Tom Davenport published his landmark book, “Competing on
Analytics2” which put the term “analytics” in play. Suddenly, analytics rose to the top
of enterprise computing. Predictive analytics, data science, machine learning and
Artificial Intelligence became top of mind, but they needed a place to live.
The process of analyzing data in organizations has for decades applied tools designed
for the individual. Spreadsheets, for example, proved to be the de-facto modeling and
reporting tool for thirty years or more, but they never adequately provided services of
security, governance, efficient creation and maintenance of metadata. Other tools for
analysis and reporting, such as BI, provided their own solutions for metadata and
collaboration, version control, etc., but they were point solutions, only useful for the
product itself (Unfortunately, the same can be said for the some of the newer data
science workbench products.)
When Hadoop burst on the scene ten years ago, it too shared the many of the gaps.
That’s not an indictment of DIY (do-it-yourself) analytics or wider analytic practices
based on self-service. Rather, it’s a cautionary tale that in an enterprise, the most
well-meaning and well-crafted analysis by individual contributors will always bog
down with redundancy without adequate
Data Management
THE SITUATION TODAY
With Global Data Management methodology and tools, all of your data can be
accessed and used no matter where it is or where it is from: on-premises, private
cloud, public cloud(s), hybrid cloud, open source, third-party data and any
combination of the these, with security, privacy and governance applied as if they
were a single entity. Ingenious software products and the economics of computing
make it economical to do this. Not free, but feasible.
Large data platforms, such as Hadoop, by their nature contain many different types of
data from many different sources. In past decades, IT organizations built business-
oriented data models and massaged an often unruly collection of data in data
warehouses (frankly, an approach that still has merit), but for today’s technology,
2 Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning. Boston, Mass: Harvard Business
School Press
5. 3
that approach is too slow and too limiting for the hastening digital transformation
facing every industry.
While corporate IT designs for Security and Governance were conceived in an
environment of highly controlled data management and computing, for both
operational and analytical processes, those designs are counterproductive in a hybrid,
distributed, complex and increasingly streaming near-real-time world. Definitions of
security and governance in this environment are quite different. For example:
Old (and still prevalent) meaning of Security: To protect against loss,
malicious, innocent and/or inadvertent access to or distribution of data that
can cause damage. To isolate various organizational entities from each other.
To throttle activity by managing from scarcity.
New meaning of security: Securing that useful and important analysis will not
be missed as a result of too restrictive and or misappropriated restrictions,
usually as a result of a lack of shared understanding between data stewards
and, for example, data scientists
Old meaning of Governance: Is a framework that provides a formal structure
for organizations to produce measurable results toward achieving their
strategies and ensures that IT investments support business objectives. The
most commonly used frameworks are COBIT, ITIL, COSO, CMMI and FAIR.
New meaning for Governance: Governance should be driven by a simple
concept (though hard to practice): trade-offs. Giventhe complexity of the
computing/data environment today, governance should aim toward a shared
understanding of risk-reward for what’s needed and evaluated and managed
across the enterprise by intelligent agents that augment the work of data
professionals and analytics practitioners. For example, it may be in the
organization’s interest to relax some access and use rules derived from simple
assumptions to achieve more productive analytics from data scientists. Trade-
offs are the opposite of rigidity.
“ALIEN” DISTANT DATA
The major issue is that enterprise data no longer exists solely in a data center or even
a single cloud (or more than one, or combinations of both). Edge analytics for IoT, for
example capture, digest, curate and evenpull data from other, different application
platforms and live connections to partners, previously a snail-like process using
obsolete processes like EDI or evenbatch ETL. Edge computing can be thought of as
decentralized from on-premises networks, cellular networks, data center networks, or
the cloud. All of these factors pose a risk of data originating in far-flung
environments, where the data structures and semantics are not well understood or
6. 4
documented3. The risk of easily moving data from place to place or the complexity of
moving the logic to the data while everything is in motion is too extreme for manual
methods.
MULTI-JURISDICTIONAL ISSUES
Currently, organizations, at best, have governance programs for data and use in their
own jurisdictions. But even those organizations that primarily operate in a single
jurisdiction may have exposure to regulatory requirements in many others. The 2018
phase-in of the European Union GDPR (General Data Protection Regulation) is one
such instance. The solution is a Global Data Management scheme that operates as a
single program in in all jurisdictions.
RISKS AND REWARDS OF A TRADE-OFF GOVERNANCE POLICY
The cadence of technology innovation clearly surpasses most organization’s ability to
implement each new or improved technique before the next one arrives. Governance
and data management can never be a pure, complete process. It requires trade-offs;
picking the issues that make the most sense, have the greatest centrality to the
organization’s strategy (ies) and provide both the most protection against danger, as
well as insuring the organization can be as effective as possible.
Governance and data management tools today are not designed for a trade-off
approach. They are layered with rules and restrictions with a “better safe than sorry”
mentality. Governance has to be a continuing process between IT and the rest of the
organization. Modern governance approaches cannot work with the “IT has the last
word” in any discussion. It only leads to dysfunction and missed opportunities. It can’t
be done with tools and methodologies of the past decades.
THE FABRIC
The best way to describe the solution is as a data management “fabric” that
metaphorically drapes over all of these environments and provides the management
and governance services needed. A short description of its functions is:
The Fabric drapes over all the data resources. Is a completely different approach to
enterprise data management. It allows an organization to finally derive more value
from their data management initiatives than the cost of implementing them. Areas of
the organization that previously were denied the insight that could have been
provided by data the organization captured (somewhere) can leverage the latent
value in distributed data stores, enabled by the capabilities the GDM provides. You
can also think of the fabric as an underlying mechanism that orchestrates all of the
functions of the GDM and allows for plugging in new capabilities in an open and
seamless fashion.
3 A trucking company may have more than twentyseparate telematics providersin the cab, each with its own protocols for
applications that require the truckingcompany to absorb and reactto in near-real-time
7. 5
THE GLOBAL DATA MANAGEMENT (GDM) PROGRAM
Metadata, lineage, governance, security, lifecycle - are the components of the GDM.
But just as importantly, are the program, the people and skills.
The first step is to have an actual implementation of the “fabric.” Hortonworks
provides this through its DataPlane service. The common foundation includes the
ability to manage and govern data across distributed data lakes.
METADATA
Has a wide variety of definitions and sub-classes, but in the need for GDM, it powers
both operation and understanding. Accelerating the time to value of your data
investments, metadata democratizes accessibility and improves the understanding of
data and processes across the organization. It rapidly improves the productivity of
analysts and data scientists. While operational metadata is the bedrock for technical
and operational aspects of uptime, performance, cost, etc., it is fundamental in
lifting the productivity of analysts by addressing these six questions:
What does the data mean (semantic)?
Where does it come from (lineage)?
Can I trust it (trust metrics)?
Does its meaning vary by context (interpretation)?
How do I find it?
Who do I ask (Data stewards, SME’s)?
Metadata is the key to governance and use. Metadata has to be developed for both
consistency of use and understanding as well as flexibility as the organizations
8. 6
changes. The scope of the metadata catalogs is beyond the capabilities of data
stewards to develop manually. The GDM must have intelligent software to:
- Capture and catalog metadata for new or modified data assets
- Allow for data stewards to examine the machine-generated metadata and make
adjustments as necessary
- Manage metadata repositories across instances to ensure it is consistent
LINEAGE
Where the data originated and how it has been manipulated; trust metrics (crowd
sourced). A lot of the analytical data wrangling is still a manual process. One
drawback is the issue of keeping track of provenance, i.e., what is the source of the
data and whether it is still current. Data is rarely gathered just once. It can be
reused for multiple versions of the analysis, or evencontinuously updated/refreshed
as models are refreshed for continuous improvement. In addition, outcomes often
need to be tracked to the original data sources for validation.
GOVERNANCE
Taking security and access to a new level. Security, grants and restrictions, are driven
by context, not location. For example, as an analyst, you manage a corpus of work -
data, models, presentations, notebooks. Access to data you need is granted based on
the components you use, no matter where in the world they are. Time-consuming
requests to IT or data stewards are unnecessary as access is driven by intelligent
agents that understand your role.
The Hortonworks Data Steward Studio, which operates which the DataPlane Service,
provides businesses the capability to develop trust in their data and comply with
9. 7
regulations by understanding data provenance, origin, lineage and impact. The GDM
by its nature is too complicated one or more data stewards to manage with current
manual methods. The DSS provides then with the tools to secure, govern and provide
the data for todays distributed, hybrid world.
A popular misconception about data scientists is that all of their work is one-off and
ad hoc, grabbing data and massaging it until it yields answers. In fact, their work is
much more formal than that. They have to assign business friendly and intuitive
names to data files that they create or download and then organize those files into
directories, according to a rational naming convention. When they refresh those files,
they must version them and keep track of their differences. This is a complicated
process. Data doesn’t always reside in logical files. For, example, clinical and
scientific lab equipment can generate hundreds or thousands of data files that
scientists must name and organize before running computational analyses on them.
SECURITY
Previously, data management was highly driven by “silos,” collections of domains in
locations. Schemes for governance were highly localized. Access to a data warehouse
could be broad for an analyst, but deeper analysis requiring access to other data
sources were dependent on data management in place at those sources.
Where most data warehouses disappointed practitioners of advanced modeling and
analysis (data scientists, for example) such as machine learning models was having
access to raw data not otherwise needed in the data warehouse, including detail from
source systems, sensor data streaming from the edge, and all manner of external data
sources. Existing data management and security programs typically allow access to
data sources used by an analyst and cohort of others on a “normal” basis, but
requests beyond that range fire an alert. The paradox is, a productive analyst should
spend more time working “out-of-the-box” than in it. Fractured data management
and security programs thwart their efforts.
Your organization is likely composed of a mosaic of data stores (or will be soon):
Multi-cloud, IoT, data lakes, data warehouses, on-prem, hybrid cloud, at-rest and
streaming. At-rest data can be catalogued and even updated/refreshed according to a
governance scheme, but streaming data presents a more challenging problem, not one
that can be solved manually as the flow can change without notice. GDM should
provide tools to deal with it, but governance policy is the map, software that
implements the policy is the journey.
LIFECYCLE
Everything discussed so far only addresses a scheme of security and governance in
place. A GDM must be able to perform as a lifecycle process. That means putting in
place a program and architecture that is capable of dynamically adjusting to changing
to business realities as well as the rapid cadence of new technology: Integration of
10. 8
new data and features, adjusting governance policy and administration to changing
conditions and doing all of that on a consistent set of tools and metadata.
A robust GDM program cannot be implemented as a “project,” it continues through a
lifecycle. Hortonworks provides the tools to maintain your GDM through its Data
Lifecycle Manager.
WHAT A GDM PERSON DOES (PERSONALLY AND THROUGH THE TEAM)
One thing to keep in mind is that the fortunes of an organizations do not change by
implementing technology. That’s the first step.
The leader of the GDM initiative in the organization (often given the title Chief Data
Officer, or CDO) needs, above all, to inspire confidence among the various
stakeholders in the organization. Above and beyond any particular previous skill and
experience in data management, it is paramount the person in this role has the vision
to motivate and encourage the organization. This requires someone with the gravitas
and communication and political skill to navigate the currents of diverse backgrounds
and requirements.
The GDM role is the keeper of the strategy to ensure it doesn’t flag as the process is
not without challenges. This encompasses all aspects of GDM -- architecture, data
catalogs, quality, lineage and metadata. To establish policies, measures, standards
and requirements that fit the spirit of the initiative, must dismantle obsolete security
and governance methodologies that degrade the vision. Driving the selection process
of the components ensures the program can scale economically from both
implementation and TCO perspectives.
11. 9
The GDM leader owns the initiative, no matter how influential various others are in
the organization. The breaking down of siloes, fiefdoms and data czars is key to
delivering data democratization in support of all services, analytics and data
products. Inevitable change management requires careful and thorough
communication to business owners and their designated data managers and stewards.
The GDM Is the point person with the C-Suite on all matters relating to data for
compliance, privacy and governance, and has responsibility for the initial creation of
control apparatus to ensure integrity in the program. At some point, it is wise for the
GDM to delegate these roles and move on as the project becomes a program.
OTHER KEY ROLES
There are four key roles that you will need to establish and nurture. Many people in
your organization can step up to these roles with training, but will need to re-orient
their practices for a global, elastic governed process:
- Data scientists and data analysts to understand cross source lineage, apply
models across types of data and gain access to data to gain deeper insight into
both pre and post transaction analysis
- Data stewards to investigate lineage, improve quality and eliminate
redundancies across data assets.
- Data engineers to move, backup and restore data assets across environments
and sources, while implementing an efficient data storage tiering policy.
- Data architects to define security and governance policies that are
automatically enforced to meet compliance requirements
CONCLUSION
No organization today is immune from the push for some form of digital
transformation. The late Peter Drucker famously said, “The computer actually may
have aggravated management's degenerative tendency to focus inward…4” That was
almost twenty years ago and is almost certainly not true today. However, it illustrates
how information systems have changed, and how quickly. It is no longer solely
sufficient to thresh through your internal record-keeping systems for insight, and it is
very likely that you already do your analytics in multiple locations, multiple
platforms, multiple clusters and with very different kinds of data. In addition, more
of your staff are engaged in analytics as a result of better software tools and more
will continue to be. It is time to jettison your old piecemeal approach to data
4 Peter F. Drucker (2009). “The Effective Executive: The Definitive Guide to Getting the Right Things Done”, p.16, Harper
Collins.
12. 10
management from the mindset of twenty years ago. Global data management is not
optional.
ABOUT THE AUTHOR
Neil Raden, based in Santa Fe, NM, is an active industry analyst, consultant and
widely published author and speaker and also the founder of Hired Brains Research.
Hired Brains provides thought leadership, context and advisory consulting and
implementation services in Information Management, Analytics/ Data Science,
Machine Learning/AI and IoT for clients worldwide. Hired Brains also provides
consulting, market research, product marketing and advisory services to the software
industry. Neil is the co-author of Smart (Enough) Systems: How to Deliver Competitive
Advantage by Automating Hidden Decisions, Prentice-Hall. He welcomes your
comments at nraden@hiredbrains.com.