This document summarizes key points from a special issue of the IBM Journal that focuses on applications, analytics, software, and hardware technologies for massive-scale analytics and processing big data. It begins with an introduction that defines big data and provides examples of how it is used. The remainder of the document provides summaries of 15 articles in the special issue that cover topics like governing big data, trends in massive-scale analytics stacks, designing systems for big data workloads, platforms for extreme analytics, and using big data for applications like social network analysis and underwater acoustic monitoring.
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Massive-Scale Analytics Journal
1. VOLUME 57, NUMBER 3/4, MAY/JUL. 2013
Journal of Research
and Development
Massive-Scale Analytics
Including IBM Systems Journal
2. Vol. 57, No. 3/4, May/July 2013
Massive-Scale Analytics
BBig Data[ refers to large data sets that are beyond the capability of traditional software tools
to quickly manage, process, and analyze. The development of techniques for gaining insight
from such information provides potential benefits in such arenas as business, science, and
public policy. This special issue of the IBM Journal emphasizes applications, analytics, software,
and hardware technologies that form the foundational building blocks for massive-scale
analytics and the processing of Big Data.
Preface: Massive-scale analytics
A. Soffer, Guest Editor
1 Governing Big Data: Principles and practices
P. Malik
2 Trends and outlook for the massive-scale analytics stack
A. N. Ghoting, J. A. Gunnels, P. Kambadur, E. P. Pednault, and M. S. Squillante
3 Understanding system design for Big Data workloads
H. P. Hofstee, G. C. Chen, F. H. Gebara, K. Hall, J. Herring, D. Jamsek, J. Li, Y. Li, J. W. Shi, and
P. W. Y. Wong
4 A platform for eXtreme Analytics
A. Balmin, K. Beyer, V. Ercegovac, J. McPherson, F. O¨ zcan, H. Pirahesh, E. Shekita, Y. Sismanis,
S. Tata, and Y. Tian
5 GPFS-SNC: An enterprise cluster file system for Big Data
R. Jain, P. Sarkar, and D. Subhraveti
6 Toward a scale-out data-management middleware for low-latency enterprise computing
L. L. Fong, Y. Gao, X. R. Guerin, Y. G. Liu, T. Salo, S. R. Seelam, W. Tan, and S. Tata
Ó Copyright 2013 by International Business Machines Corporation. See individual articles for copying information. Post-1994 articles that
carry a code at the bottom of the first page may be copied, provided the per-copy fee indicated in the code is paid through the Copyright
Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, U.S.A. Pages containing the table of contents may be freely copied and
distributed in any form. ISSN 0018-8646. Printed in U.S.A.
(continued on next page)
3. 7 IBM Streams Processing Language: Analyzing Big Data in motion
M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell,
H. Nasgaard, S. Schneider, R. Soule´, and K.-L. Wu
8 Real-time analysis and management of big time-series data
A. Biem, H. Feng, A. V. Riabov, and D. S. Turaga
9 Novel document detection for massive data streams using distributed dictionary learning
S. P. Kasiviswanathan, G. Cong, P. Melville, and R. D. Lawrence
10 Big Data text-oriented benchmark creation for Hadoop
A. Gattiker, F. H. Gebara, H. P. Hofstee, J. D. Hayes, and A. Hylick
11 Platform and applications for massive-scale streaming network analytics
P. Zerfos, M. Srivatsa, H. Yu, D. Dennerline, H. Franke, and D. Agrawal
12 Scalable community detection in massive social networks using MapReduce
J. Shi, W. Xue, W. Wang, Y. Zhang, B. Yang, and J. Li
13 Visual analysis of large-scale network anomalies
Q. Liao, L. Shi, and C. Wang
14 A statistical approach to mining customers’ conversational data from social media
D. Konopnicki, M. Shmueli-Scheuer, D. Cohen, B. Sznajder, J. Herzig, A. Raviv, N. Zwerling,
H. Roitman, and Y. Mass
15 A real-time stream storage and analysis platform for underwater acoustic monitoring
J. P. Hayes, H. R. Kolar, A. Akhriev, M. G. Barry, M. E. Purcell, and E. P. McKeown
(continued from previous page)
4. Governing Big Data:
Principles and practices
P. Malik
As data-intensive decision making is being increasingly adopted
by businesses, governments, and other agencies around the world,
most organizations encountering a very large amount and variety
of data are still contemplating and assessing their readiness
to embrace BBig Data.[ While these organizations devise various ways
to deal with the challenges such data brings, the impact and
importance of Big Data to information quality and governance
programs should not be underestimated. Drawing upon
implementation experiences of early adopters of Big Data
technologies across multiple industries, this paper explores the issues
and challenges involved in the management of Big Data, highlighting
the principles and best practices for effective Big Data governance.
Introduction
Big Data, a term commonly associated with massive datasets
growing at a rapid pace, also represents complexity and
variety in the types of data that are being collected and
analyzed by organizations. Whether the data is used to
explain global climatic change patterns and their impact,
predict customer behavior, forecast purchasing intent,
retain customers, target political donors, influence voters
for political campaigns, or even help understand the
characteristics of our very early universe via the discovery
of subatomic particles such as the Higgs boson particle [1],
Big Data clearly has a significant role to play.
Many different definitions and theories on what constitutes
Big Data exist today. The most often cited definition [2]
captures its essence: BBig Data exceeds the reach of
commonly used hardware environments and software tools to
capture, manage, and process it within a tolerable elapsed
time for its user population.[ A seminal report [3] from
McKinsey Global Institute in 2011 helped promote benefits
of using Big Data in the corporate environment, but it was
more than ten years earlier that an analyst introduced the
idea in a research note [4], discussing ways and means of
controlling volume (e.g., massive size of datasets), velocity
(e.g., amount of data transferred per unit of time), and variety
of data (e.g., images, audio, web information, or computer
logs, as well as various structured database records), which
is the basis for the popularity of the letter BV[ used in many
Big Data marketing communications.
Big Data has significantly altered the data management
considerations for the information technology (IT)
professional. According to International Data Corporation
(IDC), approximately 90% of the digital data we encounter
today did not exist two years ago and is predicted to grow
from 2.7 zettabytes (ZB) in 2012 to 35 ZB by the year 2020
[5]. However, Big Data is not just big in terms of size;
it is transmitted at high rates and usually implies
heterogeneity of data types, representation, and semantic
interpretation. For example, it can come from online
customer behavior via social media, from click streams
(e.g., records of the paths taken by users through a company
website), call detail and data-usage records (xDRs;
also known as event detail records) from telecommunications
companies, and data files or transaction notes or audio
files captured by contact centers (e.g., facilities that manage
client contacts through telephone calls). Sensors and other
machines automatically and continuously generate data
and event streams. Biometric data (such as fingerprints
or retina scans) as well as medical images and health care
data add variety, while traditional mission-critical
applications continue to produce terabytes of transaction
data.
Likewise, the world faces a data deluge because of the
widespread proliferation of the Internet, combined with the
ubiquity of advanced computing and mobile phone
technologies. Organizations will need to handle it
appropriately and responsibly not only for competitive
advantage, but for survival. To add to the challenge of
handling structured data in rows and columns in traditional
ÓCopyright 2013 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
P. MALIK 1 : 1IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
0018-8646/13/$5.00 B 2013 IBM
Digital Object Identifier: 10.1147/JRD.2013.2241359
5. data stores, organizations must also contend with
semi-structured data from web logs and machine-to-machine
(M2M) logs, as well as unstructured data in the form of texts,
documents, PDF (portable document format) files, images,
videos, and audio streams.
Big Data analytics enables organizations to make
Bsmarter[ decisions and execute them when it is most
beneficial to the business, its customers, and its
partnersVsuch as in real time. For example, organizations
deploy predictive models on event streams to identify
real-time trends. In this way, they are improving the speed
and intelligence of their response to customer behavior,
suspicious activity, and aberrations in processes. The
following cases illustrate diverse applications of Big Data
for smarter decision making.
In the first case, the 2012 U.S. elections provided a
glimpse of how effectively the Democratic political party
staff used voter contact data combined with social media
analytics to precisely adjust its campaign messaging,
created simulation models, and mined data to decide exactly
when and on which media channel to place their
advertisements to maximize the campaign donationsVand
effectively influenced the outcome of the elections in its
favor [6].
In another example, researchers at the University of
Ontario Institute of Technology (UOIT) in Canada use
streams of medical sensor data to help doctors detect and
analyze subtle changes in the condition of critically ill
premature babies to predict the onset of neonatal
life-threatening infections as much as 24 hours in advance,
thereby to take precautionary measures and save lives [7].
Finally, consider analytics applications in video
surveillance. The image data file of the video is large and
uninteresting, but the associated data becomes more useful
after adding various forms of metadata, indices of frames
in the video, who and what is contained therein, transcripts of
dialogs from the video, and comments from viewers. In short,
all the information that makes the video searchable and
relevant adds both value and volume. A commercial
application of such systems can be found in ensuring the
security of buildings and physical infrastructure, and these
systems are being incorporated in many home video
surveillance and security systems as well. Adding audio
and video analytics for this streaming data in real time,
TerraEchos offers value-added solutions across industry
segments for monitoring and protecting sensitive
infrastructure (e.g., infrastructures involving government,
oil, gas, and power grids) as well as for cybersecurity [8].
As mentioned, Big Data is a term that refers to the
management, processing, and analysis of large amounts
of data. Big Data includes the data in storage systems
(e.g., Hadoop** Distributed File System [HDFS], under open
source Apache Hadoop and MapReduce frameworks [9])
and databases, as well as data in the process of being
transmitted (e.g., data in stream-processing systems).
As the corporate world explores and recognizes the
importance of using Big Data analytics, a few organizations
have started realizing the potential of Big Data, though
many organizations are still struggling to manage their
transactional and operational data and are not yet ready
[10, 11]. For those organizations who aspire to be Bworld
class[ and beat their competition, it is only natural that
they will need to strengthen their information management
programs to secure, manage, and expand governance
initiatives to address Big Data.
The subsequent sections of this paper outline the need for
governing Big Data and the challenges involved. Further,
they will highlight the principles and practices for governing
Big Data as demonstrated by early adopters of Big Data
technologies.
Why govern Big Data?
Governance of information involves the orchestration of
people, processes, and technology to turn data into a strategic
asset for the organization. Data governance embodies the
exercise of decision making and authority for data-related
matters spanning policies, rules, rights, organizational
structures, and accountabilities of people and information
systems as they perform information-related processes to
accomplish business objectives [12]. Depending on the
maturity of the information management function, varying
levels of adoption dictate the sophistication of the
information quality and governance discipline within the
enterprise. While Big Data introduces more and new types of
information, the reasons to govern data are well established.
We govern data to manage risk and, more importantly,
to extract value from data.
Big Data for operations and cost management
Improving operational efficiencies through reducing direct
and indirect costs is of interest to business managers across
industries. As mentioned, in any industry, Big Data plays
a role through continuous monitoring and analysis of data
emitted by sophisticated machines, embedded environmental
sensors, and operational metrics. This M2M sensor data,
which traditionally was often discarded because of
unmanageable volumes or speed at which it gets generated,
has potential to improve operating efficiency of equipment
as well as overall operational safety. For example, in oil
drilling rigs, proactively acting on early warning signs of
mechanical or electronic equipment component failures
can prevent deadly accidents and oil spills. Analyzing this
type of M2M data is valuable to predict the health of
equipment components and for timely preventative
maintenance, thereby avoiding costly industrial downtime.
Thus, adequately governing such Big Data contributes to
efficiency of operations.
1 : 2 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
6. Big Data for value creation
Social media companies such as Twitter, Facebook, and
LinkedIn rely on users’ willingness to provide their personal
stories, status updates, and posts to create a valuable corpus
of rich content. On each social media platform, aggregated
data is monetized and ultimately creates revenue streams.
Thus, when Facebook valued the company at $100 billion
at the launch of their initial public offering in early 2012,
though many had expressed doubts about the valuation,
the data scientists and social media software experts took
pride in the coming of age of Big Data.
Tremendous possibilities emerge for transforming not only
businesses, but also entire industries with data. Leading
organizations in financial services, health care, retail,
telecommunications, media, and other industries are already
benefitting from new opportunities generated by Big Data
analysis. For example, organizations are actively analyzing
the content from thousands of social media posts every day
to uncover new customer insights that enable them to
profit from emerging trends and deliver a richer customer
experience. Some are improving the consistency of
information across sales channels to provide a seamless
experience for customers as they move among web,
mobile-app, in-person, phone, and other interaction points.
Many others are using real-time analytics to identify new
cross-sell and upsell opportunities while performing
predictions and retaining customers. Still others are working
on enhancing the efficiency of operations and increasing
service availability by analyzing tens of billions of system
event records. In many cases, companies are scrutinizing
identities, relationships, and previous customer interactions
and are using predictive analytics to anticipate fraud and
avoid costly losses [13].
In an IBM study [14] of more than 1,700 chief marketing
officers (CMOs), the authors highlight the transforming
of the CMO’s agenda as a result of glaring gaps in
organizational readiness to handle Big Data. At the same
time, the marketing professionals led by the CMO are
increasingly benefitting from Big Data adoption and are
readily engaging in new data-centric initiatives.
There is intrinsic value of historical customer data,
augmented by behavioral data that has been enhanced with
micro-segmentation (e.g., demographic categorization to a
near-personal level), online browsing, telephone calling
patterns (e.g., who, when, and how), and social media
activity to predict purchasing intent, influencers, and
consumer buying trends. In the past, available computing
technology did not allow for combining all of this data
economically to perform the real-time tracking, monitoring,
and fine-tuning of promotions and campaigns. However,
processing Big Data has provided unprecedented avenues
for marketers to access and analyze all types of internal
or externally procured data, thereby empowering the
marketing executives.
Big Data for risk and compliance management
The risk management function across industries has the
potential to gain from the application of Big Data. If the
corporate governance policies for banking and financial
institutions were robust and mortgage loan officers had been
empowered in data-driven measures, we would possibly be
living in a different world today. For example, if bankers
had the ability to use the full profile of the prospective loan
applicants, identifying their creditworthiness, they could
assess historic banking patterns, spending habits, and future
earnings potential clearly. That, combined with enriched
social media and economic data from external sources, could
have produced a comprehensive borrower risk profile for
each applicant, and as a result, bankers and lenders could
have been better prepared to handle the collapsing U.S.
financial markets, which was a classic case of lack of
governance at many levels [15]. Likewise, if the health care
industry executives had better insight and capability to
analyze millions of health care transactions daily, they could,
for example, better identify fraudulent claims in health
care and potentially prevent an estimated worldwide
180 billion loss annually, per a 2010 study by the European
Healthcare Fraud and Corruption Network [16].
Most organizations focus on efficiently handling financial
data to ensure the accuracy and integrity of their financial
information and protect their chief financial officer (CFO)
and chief executive officer (CEO). While increasing scrutiny
caused by such regulators and compliance mandates as
Sarbanes-Oxley, Basel II, Solvency II, and Health Insurance
Portability and Accountability Act (HIPPA) have brought
increasing responsibility to the chief information officer
(CIO) organization for information quality and governance,
the data ownership and hence responsibility for its
trustworthiness need to remain with the business.
Big Data for operations, risk, and value
simultaneously
When data is governed in order to meet all three business
needs (i.e., operations, risk, and value management),
the value obtained by the organization is amplified. As
illustrated in the examples below, converting sensor and
diagnostics Big Data to valuable insights is changing the
airlines and utilities industries.
Machine and sensor data from aircraft engines, which
could generate as much as 640 TB per flight [17], can be put
to effective use, including fault prediction and prevention.
This ultimately results in better fleet care, reduced flight
delays, and reduced cancellations due to unforeseen technical
issues with the plane. Those frequent fliers who have missed
connecting flights because of last-minute maintenance crew
Bcheckups[ will certainly find this application of Big Data
useful.
Smart meters, capturing electric- and gas-usage data every
few minutes, provide a good illustration of Big Data
P. MALIK 1 : 3IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
7. benefitting the consumer as well as the provider of energy.
The utility companies can now intelligently match supply
with demand and offer consumer incentives to change usage
patterns and behaviors. With active governance of such data,
isolation of faults and quick fixing of issues can prevent
systemic energy grid collapse [18, 19].
Thus, Big Data is enhancing operations, reducing
operational and performance risks, and generating value for
the consumer. Almost every industry has the potential to
benefit significantly from the application and analysis of
Big Data. Having established the importance and reasons for
governing Big Data, the next section examines how leading
organizations are overcoming challenges in managing and
governing it today.
Challenges and opportunities offered by
Big Data
As several use cases of Big Data have evolved, it is clear that
much of this data can be of value and must be considered for
exploratory analysis. Once the exploration starts to show
some useful trends, it provides insight into which datasets
can be further deeply analyzed and which data may be
discarded or purged. This section highlights some trends
that are shaping opportunities and challenges for governing
Big Data.
Confluence of mobile, Internet, and social activity
With roughly half of the world’s population today being
online, and a majority of them being mobile, consumers
as well as enterprises of all sizes have benefitted from the
Internet and mobile revolution. Increasing propensity to use
social media tools to shop, spend, and share insights has
heralded the emergence of social business [20]. Obviously,
in this rapid confluence of mobile, Internet, and social
activity, more data is being generated than can be managed
effectively by humans. Machine-learning algorithms and
recommendation engines using this data like the engines
from Amazon and Netflix are not only getting smarter
(e.g., more sophisticated and effective), but also generating
more data while enriching our lives. Data is being recognized
as central to innovation and hence can no longer be just
relegated to enterprise back-office systems, and this data
must be governed appropriately.
Evolving consumer behavior
Rapidly evolving consumer behavior in which purchase
behavior is influenced more by friends or online reviews than
by advertisements [21] signals a trend. Moreover, a study
found that more than 70% of online shoppers are more likely
to pay attention and act upon a stranger’s comments on social
media than believe the direct targeted messages from the
manufacturer, and the younger generation is more likely
than baby boomers to do the same [22]. There is more
window-shopping activity than actual buying that occurs
in physical stores in the cases where a company website is
able to offer the same or better product cheaper with targeted
promotions combined with incentives such as no sales
tax and free shipping. Thus, it is valuable to understand
consumers across their multiple social personas
(e.g., presence at social networks such as Facebook**,
LinkedIn**, Twitter**, and MySpace**, along with
blogging or reviewing activities), households, and online
activities. Combining historical buying patterns with online
and in-store activities reveals challenges for governing
the master data (e.g., linking the system of engagement with
the system of record to formulate the single view of the
consumer) that may not be ready for Big Data.
Rise of social commerce
Social media is no longer reserved just for the younger
generation; it is also being embraced for mainstream
businesses. Popularity of massively multiplayer online
games, in-app purchases with a phone, tablet, or other mobile
devices, and the advent of gamification (i.e., use of game
design techniques and mechanics to incentivize or change
consumer behavior in non-gaming situations), combined
with social shopping behaviors, have accelerated the arrival
of social commerce.
In such an era, social shopping is rapidly becoming the
norm and is generating even more data than traditionally
generated through point-of-sale (POS) systems. A huge
Big Data opportunity for retailers and marketing science
enthusiasts alike, this shopping data, however, poses data
governance challenges because it needs to be combined
with enterprise corporate data. With the incorporation of
Facebook, Twitter, and YouTube** into digital marketing,
public relations, and other traditional business operations,
companies now must deal with massive amounts of
unstructured data. This data onslaught comes in addition
to rapid growth in the amount of customer data housed
in enterprise systems. Though hardware storage costs are
decreasing as well, we often see that multiple copies of the
same datasets continue to propagate within the enterprise and
compound problems in management, archive, and retrieval
of the latest, most complete, and relevant data. Companies
must now find cost-effective ways to integrate and
analyze the collective pool of all relevant data to generate
granular business insights.
Security and privacy
The extent to which our digital footprints, combined with
in-store activity and credit-card usage patterns, can be
exploited by marketers using Big Data is aptly illustrated
by the overly aggressive focused promotional campaign
by retailer Target for baby products to expectant mothers,
thereby antagonizing some customers [23]. This marketing
was made possible through mining data using algorithms that
could infer that different digital transactions in different
1 : 4 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
8. systems are related to the activity of a single person or
household. Commonly termed entity resolution (ER),
this is a great use of technology for public benefit
such as counterterrorism and detecting frauds
(e.g., anti-money-laundering approaches); however, it raises
some serious invasion of privacy concerns for the ordinary
citizen and motivates public policy, data privacy, and
governance concerns. A Big Data governance program needs
to address the security challenges of sharing data between
applications and compliance with geographical trans-border
data regulations, along with strong protection requirements
for personal, health, and financial data.
Technology advancements and open-source issues
Technical challenges for handling Big Data are morphed
with the rapid pace with which new technologies and
methods are becoming available in the open-source world.
Unstructured data (including free-form text, images, and
log data) does not easily lend itself to data patterns or data
relationships because it does not need to conform to defined
data formats. The historical value of such data is limited
until a pattern is discovered. Loading such Big Data into
traditional relational databases is very time-consuming,
error prone, and cost prohibitive. In order to manage this,
many companies are deploying advanced technologies such
as the parallel-processing MapReduce-based Hadoop
software (e.g., an open-source version or a commercial
distribution such as IBM BigInsights*) to process and
analyze Big Data using distributed commodity hardware
server clusters. They then integrate Hadoop with systems
housing other customer data to gain richer insights. We have
the intersection of the traditional methods of delivering,
managing, and viewing information, as well as a new
approach that allows data of all types and formats to be
quickly sorted for transactional and operational opportunity.
This new era of data exchange requires next-generation
compute, storage, and input/output technologies [24], as well
as systemic thinking. Hadoop, streaming, and NoSQL
(not-only Structured Query Language) technologies
encourage fresh thinking; however, traditional governance
programs will not achieve their intended impact if not
expanded for handling Big Data.
Quality and uncertainty
The phrase Bdata is the new oil[ [25] underscores the
immense value of Big Data. However, just like crude oil,
the most value is in its refined state (e.g., cleansed data).
Increasing volume and speed of data generation can result
in decreasing data certainty. This raises the all-important
issue of data quality and trustworthiness that must be
addressed given the incomplete and often uncertain nature
of Big Data. For example in social media channels
(e.g., Facebook posts or Twitter tweets), human sentiments
and expressions are often inherently treated with skepticism.
Thus, without the full context, this data leaves an element
of doubt in situations such as those involving a customer’s
preference and sentiment for a brand or future buying
decision for a specific product in a specific geography.
None of the traditional data-cleansing technologies can
routinely make this data better or useful. However, this data
contains valuable information that could yield trustworthy
data when triangulated with multiple similar sources and
geospatial location and advanced mathematical modeling
techniques such as emerging sense-making systems [26].
Thus, Big Data must be managed in context, which despite
its inherent noisiness and lack of an upfront model raises
the need to track data lineage and its origin in order to handle
uncertainty and error.
Public datasets and data consortiums
Publicly available high-quality datasetsVsuch as satellite
data, astronomical data, and imaging and mapping geospatial
data, along with up-to-date weather data feeds combined
with public-agency demographic datasets such as those from
data.gov [27]Vare a boon for both consumers and the
scientific research community. Numerous public utility
data-mashups and Bdata apps[ are now available for practical
applications, such as an in-city parking locator [28] or a
crime stopper (e.g., through gunshot acoustic analytics
offered by ShotSpotter**). Likewise, with massive amounts
of data being collected about consumers and their buying
patterns based on store loyalty cards, coupon redemption
or credit-card transaction history data, or mobile calling and
data-usage patterns and online behaviors, many companies
have amassed very rich customer repositories and are
in a position to act as data consortiums for the industry.
These datasets and infomediaries (e.g., businesses working
as agents on behalf of customers, helping them to monetize
and maximize the value of their information profiles [29])
would need to be governed as this Bexternal[ data becomes
integrated into the enterprise.
Integrating Big Data with traditional enterprise data
Historic IT behaviors, such as building systems without
considering the real need of the business yet deploying
systems for the sake of novelty in technology, will not work
in the age of Big Data. From being product centric,
organizations have evolved to being customer centric;
however, in the Big Data era, even this will not suffice.
Differentiated service based on past interactions, anticipated
needs, and a 360-degree view of the customer is needed to
succeed today. IT organizations must partner closely with
business organizations and must adopt agile and iterative
experimentation approaches. In essence, this involves a total
shift in mindset. However, it is amply evident that silos of
data will not work. Thus, historic enterprise data must be
fully integrated with the newer types of datasets and
keeping the aspect of performance and efficiency at scale
P. MALIK 1 : 5IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
9. (e.g., handling large workloads through distributed
computing architectures), organizations must be able to
transform all of this new data into actionable information
within budget [30]. As a result, Big Data governance needs
to assume a central role if competitive advantage from data
is desired.
Governance principles and leading practices
Big Data governance is defined as the emerging set of
processes, methods, technologies, and practices that enable
the rapid discovery, collection, processing, analysis, storage,
and defensible disposal of large volumes and fast streams
of structured and unstructured data with security, privacy,
and cost efficiency [31]. Although the strategic reasons
for instituting data governance programs do not change by
integrating Big Data in the enterprise, the newer types of data
and related characteristics make existing tools, processes,
and practices inadequate to handle the data. Because of
inherent veracity, context and provenance gain prominence
in Big Data governance, and hence, automation and metadata
need to be addressed. Some of the early adopters of
Big Data technologies have experimented with and embraced
a variety of processes and principles that can serve as
guidance for the industry. Leading practices in Big Data
governance are still evolving and will be refined as Big Data
initiatives become widely accepted. This BGovernance
principles[ section highlights a few of these.
Aligning goals with business strategy to build
a roadmap
As with all strategic initiatives, success relies on maintaining
a clear vision of the final goal. Information governance
is a discipline that governs company data assets throughout
their life cycles [32]. Data governance programs aim to
maximize the value of data through active engagement
of people, processes, and technology. Big Data governance
is associated with an overarching mission to maximize the
value of all types of enterprise data throughout their life
cycle. The first step of the journey toward Big Data
governance involves stakeholder engagement as well as goal
alignment with the business strategy of an organization.
Establishing time value of data
Organizations view Big Data technologies as change agents,
especially because several of these technologies support
data in its native format. At the starting point in the Big Data
life cycle, organizations do not always know which data
sources have value and whether organizations want to invest
vast resources to gather requirements and sponsor formal
information governance programs [33]. Although it is
difficult for some to overcome long-time habits, it must be
made clear that not all data is valuable and worth keeping
forever. Direct storage and indirect handling expenses can be
cost prohibitive, given the rapidly increasing volumes of
Big Data. Thus, it must be recognized that data has varying
degrees of value. Even with such recognition, that value will
change over time as circumstances change and use cases
evolve. At a minimum, the value is at least related to the
cost of storage, and, thus, the data is more like a natural
resource. If it is abundant but has no use, the value is low.
However, data can gain value when it has a novel purpose.
Time value of data (TVoD) takes into account the
Btemperature[ of data (e.g., frequency of access guiding a
tiered archival and storage policy), use case, and relative
importance (e.g., expiry and aging policy, and short shelf life
for certain types of data), thereby helping establish optimum
governance practices.
Identifying a business use case
Experimenting with Big Data technologies, to explore some
viable use cases in the context of an industry domain,
is a good practice for both IT and business teams in order to
appreciate the potential of Big Data first-hand. Once some
useful patterns emerge from data exploration, a business case
can be identified and further refined to quantify benefits.
Ideally, this should be in line with the objectives or
imperatives set forth by the organization. For example,
a telecommunications firm had a goal of reducing Bchurn[
(e.g., reducing customer turnover) and improving customer
intimacy. By analyzing transactional call data records
(CDRs), the company was able to predict the likelihood of
customers discontinuing their service by analyzing the
number of dropped calls for each consumer within each
month. By using geo-location data from the phone, they were
able to offer location-based promotions and services.
Detailed telecommunications Bchurn[ models also include
social media data from Facebook and Twitter to derive the
customer sentiment. In this example, in order to correlate this
data and perform deep predictive analytics, this data was
to be shipped to an overseas vendor each day. This caused
significant concern with respect to safeguarding the privacy
of customer data. After appropriate deliberation, the
telecom operator decided to mask sensitive data, such as
subscriber name, because the calling and receiving telephone
numbers were the primary fields of value for churn
analytics [34].
A sampling of examples from oil and gas industry
illustrates the concept of establishing use cases for Big Data
governance to further gain sponsorships [35]. Geospatial
data, which is critical to exploration and production, may
relate to land-based drilling, offshore drilling, or wells that
might be abandoned. There are several examples of poor
geospatial data governance that have had a profound business
impact. In one instance, an oil company used incorrect
geospatial coordinates to drill a well at the exact location of
an abandoned well, which resulted in losses of millions
of dollars. In another instance, an oil company used incorrect
navigational coordinates to drill a hole in an adjacent field
1 : 6 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
10. that belonged to another company. The company faced legal
issues and public embarrassment in addition to lost time and
money. Instances such as these are sufficient to convince
stakeholders of the need for governance and thereby support
a specific set of use cases.
Performing maturity assessment
An early step in the implementation of an
information-governance program is to assess the current state
and determine the gaps and efforts needed to reach the
desired future state. Such maturity assessments are often
conducted as part of a strategic information agenda,
IT-business alignment, or an organizational readiness
fact-finding exercise. According to an IT executive at the
Turkish bank, Akbank, BBig Data has all the characteristics
of Bsmall data[ when it comes to governance. The only
difference is in the complexity and variety of channels that
it comes from. Although there are greater demands on
organizational energy and resources to govern Big Data,
the gain in terms of business value is that much higher. Being
able to analyze Big Data from the Web, and take necessary
actions, can have a major impact on the profit of a
company. A maturity model for Big Data governance is a
critical first step in this journey[ [33]. A framework for
Big Data maturity assessment can be established and
deployed as an extension of a traditional
information-governance assessment initiative using a
capability maturity model (CMM) such as the IBM
Information Governance Council Maturity Model as a
starting point. However, additional factors (such as
the use-case domain, industry context, and unique data types
for Big Data) help in shaping the questions and formulating
the maturity assessment. For example, in Bcustomer
centricity improvement[ use cases to determine customer
sentiment, the customer master data needs to be integrated
with social media data, but the social media data may not
require cleansing, extensive documentation, or individual
record life-cycle management. Thus, because the purpose of
governance is to establish and deliver trusted information,
use cases dictate the rigor applied in Big Data governance
and thus the future level of maturity desired.
Handling veracity through context and provenance
Popularly characterized by the three Vs, as discussed earlier
in the introduction of this paper, Big Data has more recently
been subjected to additional criteria starting with letter V.
The Global Technology Outlook (GTO) study by IBM
Research [36] identified veracity as an important criterion
in characterization of Big Data. The GTO states, BToday,
the world’s data contains increasing uncertainties that arise
from such sources as ambiguities associated with social
media, imprecise data from sensors, imperfect object
recognition in video streams, model approximations, and
other sources. Industry analysts believe that by 2015, 80% of
the world’s data will be uncertain.[ Figure 1 depicts the
increasing data uncertainty as the data volumes are projected
Figure 1
Increasing data volumes usher in the era of higher uncertainty in data, especially as the quantity and incidence of unstructured data are rising. The
phrase BInternet of Things[ refers to identifiable objects and their representations in an Internet-like structure. (VoIP: Voice over Internet Protocol;
IDC: International Data Corporation. Figure adapted from the 2012 IBM GTO Study, an overview of which is given in [36].)
P. MALIK 1 : 7IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
11. to increase. The rate of increase is more pronounced for the
largely unstructured data (e.g., from VoIP [Voice over
Internet Protocol], social media, and sensors), compared with
the enterprise data, which is mostly contained in structured
databases.
This decreasing Bveracity[ of data is a natural byproduct of
increasing complexity, volume, and velocity of data.
As discussed in a previous section, veracity of data must be
acknowledged and addressed. Techniques for managing the
predictability of decisions based on imprecise datasets
must be refined. Enhancing the metadata through correlation
of context and provenance (i.e., lineage) of data improves the
veracity. For example, one IBM solution for fighting crime
uses historic criminal records and uses analytics to predict
and reduce occurrence of new incidents. The latest advances
in law enforcement and crime prevention integrate fragments
of data from multiple sources, but details described in
eyewitness accounts, incident reports, media and journalist
reports, and other individual observations can differ widely
and ultimately limit accuracy. The extrapolation across
missing data points, using sophisticated software, enables
the solution to be useful, even with imprecision in the
available datasets.
Establishing or adapting a data-quality program
to embrace Big Data
Data quality has some characteristics of art, with its worth
realized after the data is employed and away from its source
of generationVin the case of data, to produce information
for business decision making. The value of most Big Data
depreciates with time, especially data from social media
channels. Poor-quality data, to some, may not appear
inaccurate, inconsistent, or redundant, because it may be fit
for the intended purpose or use case [37]. There have been
efforts in the past to quantify the amount of losses
corporations experience if the problem of poor data quality
is not addressed [38, 39]. However, as a practice, dealing
with poor data quality is often neglected, except in cases of
financial data or compliance mandates that could draw
attention from regulators and company management. In a
2012 study by the International Association for Information
and Data Quality (IAIDQ), more than 70% respondents
considered data a strategic asset. Increasing regulatory
activity and an increase in the understanding of the value of
data have raised the importance of data quality as a discipline
across organizations [40]. Most organizations have
established data-quality initiatives that improve the level
of trustworthiness of corporate data. Some address the
data-quality issue by aggregating, matching, merging, and
cleansing data in intermediary staging databases
(referred to as Bdownstream[ data fixing). However, it is
often well established that quality problems primarily arise
with inadequate controls at the transactional source system
and the costs of fixing increases in a downstream setting.
To date, many of these data-quality initiatives have only
addressed traditional or structured data. However,
as mentioned, in the area of Big Data, data sources such as
social media marketing, real-time sensor feeds, and IT system
logs have historically not been linked to official reference
data from transactional systems. Traditionally, such data
did not usually have to be cleansed because the data was
examined in isolation by specialist teams that often addressed
issues Boffline.[ However, cross-information-type
analyticsVwhich are common in the Big Data spaceVhave
changed this dynamic.
Consider an example of a large health care insurance
company that processes more than 500 million claims per
year, each claim record consisting of 600 to 1,000 attributes
[34]. The company uses predictive analytics to determine
whether certain proactive measures were required for a small
subset of members. However, an audit found that physicians
were using inconsistent procedure codes to submit claims,
thereby limiting the effectiveness of the predictive analytics.
Further, the text notes within claims documents were
ambiguous. For example, the team used terms such as
Bchronic congestion[ and Bblood-sugar monitoring[ to
determine that certain members might be candidates for
disease management programs for asthma and diabetes,
respectively. These type of cases are likely to encounter
data-quality issues when the information sources that have
historically been underutilized are rigorously used such as
in cross-information matching for deeper insights.
Clearly, handling massive amounts of data implies that
organizations need to be able to deal with the uncertainty
in the quality of this data. Although not all data needs to have
an in-depth review, practical approaches to data-quality
management, as have applied to traditional structured data,
will not be sufficient for the Big Data era, and the
data-quality programs have to be adapted.
Thus, the governance program needs to adopt the
following substeps to address data quality. First, those
involved with such a program must work with business
stakeholders to identify critical data elements that ensure the
success of the information quality and governance program.
Second, they must build a business case to justify the
management of data quality for the organization in general
and extend it so as to relate to the Big Data aspects.
In addition, they must build the overall data-quality plan,
covering vital parameters such as those involving
specification, timeliness, availability, accuracy, precision,
consistency, synchronization, security, and accessibility [41].
Of course, not all of these kinds of dimensions that apply to
traditional data need to apply to Big Data (e.g., timeliness
and accuracy are important dimensions for sensor data,
whereas the same kind of accuracy may not be critical
for social media data).
Those involved with governance programs must also
establish data-quality standards and policies and deploy
1 : 8 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
12. proactive monitoring through technology and automation.
They must develop a scorecard to monitor the data-quality
metrics along each dimension and establish confidence
intervals for the quality of virtually all types of data. They
should also appoint data stewards who will be accountable to
the information governance council for improving the
quality metrics over time. These data stewards must be
educated in handling Big Data types and effectively
communicate among one another and with business and
IT stakeholders. Finally, governance-program leaders should
refresh the information governance scorecard on the basis
of ongoing quality improvement progress and the needs of
the business.
With a combination of advancing technology and adequate
processes, organizations will be able to improve quality of
data-based decisions and compensate for sparsity, ambiguity,
and veracity of Big Data.
Formally accommodating Big Data roles in the
governance organization
Data governance must be institutionalized through a formal
organizational structure. Given the disruptive potential of
Big Data, specialized people need to be hired, trained,
nurtured, and integrated into the organization. This is one
of the Bsofter[ (e.g., people) aspects of governance, but
organizational development and change management are
key here.
In order to effectively accommodate Big Data in the
enterprise, the organization will need to extend the
information governance team and its charter to incorporate
tenets of Big Data. If it does not already exist, a data
stewardship program should be established in the
organization. Otherwise, it should be extended for Big Data,
ensuring that new roles (such as Big Data analyst and
data scientist) are defined and recognized as key players.
This means the data stewards must be familiar with new
data types and help with the data-validation processes.
Data stewards should be trained to handle data profiling
and anomalous pattern detection with Big Data.
Detecting outliers and understanding the nuances between
false positives and false negatives help them in proper
data exploration and discovery. Tools and technologies
that help with profiling and visualizing Big Data
must be made available to the team. This may mean that
the existing data governance team must be further
trained and enabled in Big Data technologies and
processes.
People, processes, and technologies must be aligned as
in a traditional data governance program. Further, those
individuals with data governance organizational roles
(including chief data officers, data owners, data governance
council members, data stewards, and data analysts) need to
be savvy with respect to Big Data, as policy making and
the oversight of Big Data are critical.
While the different governance roles need to interact
with one another, without structure and organization,
the workflow can stall. Best practices indicate the need to
establish clear, structured, and well-directed
communications, escalation paths, and workflow-monitoring
patterns that would aid in handling policies, procedures,
and methods concerning Big Data in the organization,
for quick resolution in case any challenges arise.
Automating business-rules monitoring for detecting
governance policy exceptions
Determination of policies and standards for governance is
typically done in a collaborative manner with IT and business
teams coming together to agree on a framework of set
policies, processes, and a RACI (responsible, accountable,
consulted, informed) matrix to establish role clarity.
Adapting this framework from traditional governance
program for dealing with Big Data requires automation
and technology. With the pace and volume of data involved,
traditional data monitoring and manual interventions would
not suffice as data governance tactics for Big Data.
Closed-loop feedback systems need to be established for
high-speed data streams filtered through automated business
rules and policies. Consider an example from the financial
services industry where companies seek ways to gain
competitive advantage in trading stocks, commodities,
precious metals, and other financial instruments.
High-frequency traders using algorithmic trading will use
approaches to reduce milliseconds from trading transition
time, even if these approaches require digging trenches
to lay dedicated fiber-optic networks [42, 43] or even special
transatlantic underwater cables and moving closer to the
trading action at the exchange. However, such systems need
to automate the monitoring processes and provide a
feedback loop in order to provide a Bcircuit breaker[
effect and prevent runaway M2M interactions, lest they
cause disruption to the operations of the entire trading
exchange.
Provisioning for security and privacy of Big Data
by design
Security principlesVsuch as separation of duties, separation
of concern, principle of least privilege, and defense in
depthVapply to all types of data. Extracting value from
Big Data requires a responsible approach to security and
privacy. There is sensitive data that needs to be protected,
retention policies that need to be determined, and personal
data that needs to be masked before transmittal. For example,
the Big Data governance program will need to define
policies surrounding new digital data, such as policies
involving RFID (radiofrequency identification) tag
deactivating or sharing geo-location codes, because these
may be fraught with security and privacy issues depending
on the country of data origin.
P. MALIK 1 : 9IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
13. For companies that deal with consumer information,
extra precautions need to be considered in order to maintain
trust of subscribers. The governance program needs
to be sensitive to invasion of privacy. The need for
ethics-before-profits policy around privacy dictates not
renting or sharing the collected data that could identify
individuals. Governance practices and policiesVsuch as
those involving data activity monitoring covering all read,
write, or access functions on sensitive and private
dataVshould be enabled through use of suitable processes
and technologies for automatic logging and alerting.
Data masking and obfuscation of sensitive portions of
structured and unstructured data, such as in documents and
electronic files, is another security practice that should be
part of the governance program.
Increasingly, an individual’s personal data in any company
transaction system is no longer retained in isolation;
instead, it is aggregated so that prospective customers can be
shown targeted advertisements or directed to customized
services. Advertising is just one way data can be collected,
aggregated, and monetized. Organizations can assess credit
worthiness, evaluate employees, or even take the step toward
linking with government or other legal data. The security
risks arise because users must relinquish control over their
data. Because Big Data can be focused on the aggregation of
customer data across an organization, security-protection
efforts should focus on the elimination of the Bsilos of data[
and examine ways to control and enhance the data-protection
processes that exist.
As mentioned, realizing that not all data is equally
useful and undergoes decay at varying paces, information
life-cycle governance practices need to be extended
considering TVoD. Governance policies are needed for
selective storage and defensible disposal. Regulatory
requirements may necessitate keeping certain types of data
for seven years (e.g., financial information) or up to ten years
(e.g., pharmaceutical clinical trial data), but retaining data
beyond that is not only an expensive proposition, but also a
Figure 2
A framework to establish principles of Big Data governance. The arrows indicate a communication pattern and a workflow that helps in the practical
realization of these principles.
1 : 10 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
14. potential legal exposure. Organizations need to be prepared
to purge nonuseful data after a certain time, especially when
data trends have been determined and aggregate data stored
and processed.
Establishing, monitoring, and measuring metrics
and communicating
The success or failure of any program depends on how well it
meets its goals over the established timeline. Without
establishing suitable metrics and key performance indicators,
it will be difficult to ascertain whether the Big Data program
is progressing as planned. Further, regular measurement,
monitoring, and sharing of these metrics with the
stakeholders is a sound governance practice that can be easily
adopted from the traditional data governance programs and
provides a vital feedback loop to fine-tune strategy with
changing business priorities.
Thus, many of the information governance strategies,
programs, and practices developed for the Bstructured world[
are valid and transferrable to the Big Data world. However,
these practices must be extended, expanded, and adapted
for Big Data as discussed above. Figure 2 summarizes the
principles of Big Data governance and provides a framework
for putting them into practice. The arrows show workflow
and communication patterns. The tan boxes at top of the
figure (i.e., Strategy, Process, Policy, People, and
Technology and Automation) refer to the key domains
involved in Big Data governance. Note that the most-often
used definition of governance involves an orchestration of
People, Process, and Technology. The blue boxes in Figure 2
refer to the primary principles of Big Data governance
as explained herein.
Conclusion
Our world is becoming increasingly instrumented and
interconnected, with a proliferation of information. This
provides a clear opportunity to transform the world into a
more Bintelligent[ planet [44]. As per the 2012 IBM CEO
study, the ability of an organization to derive value from
data is strongly correlated with performance, where
outperforming organizations are twice as good as
underperformers at accessing and drawing insights from data
[45]. With the advent of Big Data, organizations now have
the opportunity to realize even more value from their
information assets. Early adopters and nearly two-thirds of
respondents in the 2012 Big Data@Work study [10] confirm
that the use of information (including Big Data) and analytics
is creating a competitive advantage for their organizations.
Clearly, there is a great emphasis on data and deriving its
value through effective governance.
Customer-centric analytics are top priority for Big Data
initiatives among many early adopters. As a result of
advanced techniques (such as machine learning and natural
language processing), sophisticated hardware, and Bsmart[
software, computers have become adept at processing and
determining trends in consumer behavior that were hitherto
cost prohibitive to analyze and impossible for humans to
decipher. Every click on a website, each phone call, social
media post, tweet, blog entry, or a credit-card purchase
creates a record that can be stored and has the potential to be
analyzed in a manner that will create tangible value for the
business. Enhancing the inherent value of such data in
the enterprise is a key Big Data governance objective,
as discussed throughout this paper.
As observed through numerous examples presented in
this paper, Big Data is revolutionizing our world at an
unprecedented pace despite the nascent and rapidly evolving
technologies involved [46]. Several use cases have emerged
with a potential to cause significant disruption in the
conventional data management thinking. Extending the
information management framework to address data types
unique to Big Data, and incorporating modified practices
suitable for integrating Big Data, allows us to govern
Big Data as an integral part of enterprise data management
function as opposed to dealing with it in a silo. However,
harnessing the full power of data in an enterprise setting
requires removal of many obstacles that still remain in the
path [47]. For instance, a Gartner analyst recently observed
that Balthough information arguably meets accounting
standards criteria for an asset, and more specifically, further
litmus tests for an intangible asset, it is not found on public
companies’ balance sheets[ [48].
This provokes a series of thoughts. For example,
if information is really considered a corporate asset, then
why does it not appear in any company balance sheets and
regulatory filings, like brand value does? In addition, what is
the value of information owned by corporations and how
is it measured? Would an incident involving an information
security breach cause a company to book losses on the
quarterly earnings statement? When these questions are
answered, it will bring focus and attention to the
quantification of the intrinsic value of data and further
underscore the need for governing all types of data.
As Big Data becomes ubiquitous in the coming years,
the need for governing and managing it effectively in the
enterprise will only become stronger. The principles and
practices involved in Big Data governance that have been
presented in this paper serve as a starting point and will
further evolve and ultimately become pervasive to the extent
of being clearly incorporated into normal business and
IT functions. Until then, organizations must learn to deal
with volume, velocity, variety, and veracity of Big Data
through the application of strategy, processes, automation
technology, and people implementing those processes.
*Trademark, service mark, or registered trademark of International
Business Machines Corporation in the United States, other countries,
or both.
P. MALIK 1 : 11IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
15. **Trademark, service mark, or registered trademark of Apache
Software Foundation, Facebook, Inc., LinkedIn Corporation, Twitter,
Inc., MySpace, Inc., Google, Inc., or ShotSpotter, Inc., in the
United States, other countries, or both.
References
1. B. Johnson, What the web is saying about the god particle.
[Online]. Available: http://gigaom.com/2012/07/04/what-the-web-
is-saying-about-the-god-particle/
2. M. Adrian, BBig Data,[ Teradata Magazine. [Online]. Available:
http://www.teradatamagazine.com/v11n01/Features/Big-Data/
3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
C. Roxburgh, and A. H. Byers, BBig data: The next frontier for
innovation, competition and productivity,[ McKinsey & Comp.,
San Francisco, CA. [Online]. Available: http://www.mckinsey.
com/Insights/MGI/Research/Technology_and_Innovation/
Big_data_The_next_frontier_for_innovation
4. D. Laney, BApplication delivery strategies,[ META Group Inc.,
Stamford, CT. [Online]. Available: http://blogs.gartner.com/
doug-laney/files/2012/01/ad949-3D-Data-Management-
Controlling-Data-Volume-Velocity-and-Variety.pdf
5. IDC, IDC Go-to-Market Services: The 2011 Digital Universe
Study: Extracting Value from Chaos. [Online]. Available: http://
www.emc.com/collateral/demos/microsites/emc-digital-universe-
2011/index.htm
6. M Scherer, Inside the Secret World of Quants and Data Crunchers
who helped Obama Win. [Online]. Available: http://swampland.
time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-
crunchers-who-helped-obama-win/
7. N. Bressan, C. McGregor, and A. James, BTrends and
opportunities for integrated real time neonatal clinical decision
support in conference proceedings,[ in Proc. IEEE-EMBS Int.
Conf. BHI, Jan. 2012, pp. 687–690.
8. D. Boyd and K. Crawford, BSix provocations for big data : A
decade in internet time: symposium on the dynamics of the internet
and society,[ Social Sci. Res. Netw., New York. [Online].
Available: http://ssrn.com/abstract=1926431 or http://dx.doi.org/
10.2139/ssrn.1926431
9. D. Borthakur, BThe Hadoop distributed file system: Architecture
and design,[ Apache Softw. Found., Forest Hill, MD. [Online].
Available: http://mit.edu/~mriap/hadoop/hadoop-0.13.1/docs/
hdfs_design.pdf
10. M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and
P. Tufano. Analytics: The real-world use of big data, IBV and
Saiid School of Business, University of Oxford. [Online].
Available: http://public.dhe.ibm.com/common/ssi/ecm/en/
gbe03519usen/GBE03519USEN.PDF
11. B. Franks, Taming the Big Data Tidal Wave: Finding
Opportunities in Huge Data Streams with Advanced Analytics.
Hoboken, NJ: Wiley, 2012.
12. T. Fisher, The Data Asset: How Smart Companies Govern Their
Data for Business Success. Hoboken, NJ: Wiley, 2010.
13. Big Data: New Insights Transform Industries, IBM Corp., New
York, Jun. 2012, a whitepaper, IBM. [Online]. Available: https://
www14.software.ibm.com/webapp/iwm/web/signup.do?source=
sw-infomgt&S_PKG=ov8620&S_TACT=109HF63W&S_CMP=
is_bdwp2_bdhub.
14. IBM Global CMO Study, IBM Corp., New York, 2011. [Online].
Available: http://www-935.ibm.com/services/us/cmo/
cmostudy2011/cmo-registration.html.
15. G. Kirkpatrick, BThe corporate governance lessons from the
financial crisis,[ Org. Econ. Co-oper. Develop., Cedex, France.
[Online]. Available: http://www.oecd.org/dataoecd/32/1/
42229620.pdf
16. J. Gee, M. Button, and G. Brooks, BFinancial cost of healthcare
fraud,[ Eur. Healthcare Fraud Corrupt. Netw., Brussels,
Belgium. [Online]. Available: http://www.ehfcn.org/media/
documents/The-Financial-Cost-of-Healthcare-FraudVFinal-
%282%29.pdf
17. S. Rogers. (2011, Sep.). Big data is scaling BI and analytics.
Inf. Manage. [Online]. 21(5), p. 14. Available: http://www.
information-management.com/issues/21_5/big-data-is-scaling-
bi-and-analytics-10021093-1.html
18. S. Soares, Smart Meters and Big Data. [Online]. Available: http://
www.smartgridnews.com/artman/publish/Technologies_MDM/
Smart-meters-show-Big-Data-governance-best-practices-4433.
html
19. G. Raine, Power grid failure leaves 2 million in the dark. [Online].
Available: http://www.sfgate.com/news/article/Power-grid-failure-
leaves-2-million-in-the-dark-3134893.php
20. S. Carter, Get Bold: Using Social Media to Create a New Type of
Social Business. Upper Saddle River, NJ: IBM Press, 2012.
21. R. Scott, Social Retail. [Online]. Available: http://searchengine-
watch.com/article/2192745/Social-Retail-Finding-Engaging-
Cultivating-Todays-Connected-Consumer
22. Talking to Strangers: Millennials Trust People over Brands,
Bazaarvoice, Austin, TX, , 2012. [Online]. Available:
http://www.bazaarvoice.com/files/whitepapers/
BV_whitepaper_millenials.pdf.
23. C. Duhigg, BHow Companies learn your secrets,[ The New York
Times Co., New York. [Online]. Available: http://www.nytimes.
com/2012/02/19/magazine/shopping-habits.html?=rss&page-
wanted=all
24. P. Nist, The Big Data Challenge: Social Data Meets Corporate
Data. [Online]. Available: http://communities.intel.com/
community/openportit/server/blog/2012/03/29/the-big-data-
challenge-social-data-meets-corporate-data
25. M. Palmer, Data is the new oil. [Online]. Available: http://ana.
blogs.com/maestros/2006/11/data_is_the_new.html
26. A. Cavoukian and J. Jonas, BPrivacy by design in the age of big
data,[ Privacy by Design, Toronto, ON, Canada. [Online].
Available: http://privacybydesign.ca/content/uploads/2012/06/
pbd-big_data.pdf
27. Data.gov Concept of Operations, OMB, New York. [Online].
Available: http://www.data.gov/sites/default/files/attachments/
data_gov_conops_v1.0.pdf.
28. Streetline. [Online]. Available: http://www.wired.com/autopia/
2010/11/city-parking-smartens-up-with-streetline/
29. J. Hagel and J. F. Rayport, BComing battle for customer
information,[ Harvard Bus. Rev., Boston, MA. [Online].
Available: http://cb.hbsp.harvard.edu/cb/web/product_detail.seam;
jsessionid=A641511F8D6094B1AF8431E85981183D?E=
60482&R=97104-PDF-ENG&conversationId=230602
30. S. Swoyer, Big Data Integration. [Online]. Available: http://tdwi.
org/Research/2012/06/TDWI-EBook-Big-Data-Integration.aspx
31. P. Malik, BBig data governanceVThe next frontier for
the information economy,[ IAIDQ, Baltimore, MD. [Online].
Available: http://iaidq.org/webinars/2012-04-24.shtml
32. M. Godinez, E. Hechler, K. Koenig, S. Lockwood, M. Oberhofer,
and M. Schroeck, The Art of Enterprise Information Architecture:
A Systems-Based Approach for Unlocking Business Insight.
Boston, MA: IBM Press, 2010.
33. S. Soares, T. Deutsch, S. Hanna, and P. Malik, BBig data
governance: A framework to assess maturity,[ IBM Corp., New
York. [Online]. Available: http://ibmdatamag.com/2012/04/
big-data-governance-a-framework-to-assess-maturity/
34. S. Soares, BA framework that focuses on the BData[ in big data
governance,[ IBM Corp., New York. [Online]. Available: http://
ibmdatamag.com/2012/06/a-framework-that-focuses-on-the-data-
in-big-data-governance/
35. S. Soares, Selling Information Governance to the Business: Best
Practices by Industry and Job Function. Ketchum, ID: MC
Press, 2011.
36. IBM GTO 2012 Study: Managing Uncertain Data at scale, IBM
Corp., New York, , 2012. [Online]. Available: http://www.zurich.
ibm.com/pdf/isl/infoportal/GTO_2012_Booklet.pdf.
37. P. Malik, BInformation integrity for CRM in a virtual world,[ in
Encyclopedia of Virtual Communities and Technologies.
Hershey, PA: Idea Group, 2006, pp. 266–272.
38. W. Eckerson, BData quality and the bottom line,[ The Data
Warehousing Inst., Chatsworth, CA, TDWI Rep. Ser., TDWI,
101 Commun., 2002. [Online]. Available: http://download.
101com.com/pub/tdwi/Files/DQReport.pdf
1 : 12 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
16. 39. L. English and P. Malik, Plain English about Information Quality.
[Online]. Available: http://www.information-management.com/
issues/20070401/1079288-1.html
40. E. Pierce, C. L. Yonke, P. Malik, and C. K. Nagaraj, BThe state of
information and data quality industry report 2012,[ IAIDQ,
Baltimore, MD. [Online]. Available: http://iaidq.org/publications/
pierce-2012-11.shtml
41. S. Ruschka-Taylor, C. Evask, P. Malik, and S. Minsinger,
BTransforming information integrity,[ IBM Corp., New York.
[Online]. Available: http://www-935.ibm.com/services/us/gbs/bus/
pdf/g510-3831-transforming-enterprise-information-integrity.pdf
42. D. I. Amin and A. F. Bach, BNYSE Euronext and 100G: The drive
to zero latency,[ Lightwave, Nashua, NH. [Online]. Available:
http://www.lightwaveonline.com/articles/print/volume-27/issue-1/
applications/case-by-case/nyse-euronext-and-100g-the-drive-to-
zero-latency.html
43. A. Troianovski, Networks Built on Milliseconds. [Online].
Available: http://online.wsj.com/article/
SB10001424052702304065704577426500918047624.html
44. C. Harrison, B. Eckman, R. Hamilton, P. Hartswick,
J. Kalagnanam, J. Paraszczak, and P. Williams, BFoundations for
Smarter Cities,[ IBM J. Res. & Dev., vol. 54, no. 4, pp. 1–16,
Jul./Aug. 2010, paper 1.
45. IBM Global CEO Study 2012, IBM Corp., New York, 2012.
[Online]. Available: ibm.com/services/us/en/c-suite/ceostudy2012.
46. P. Zikopoulos, C. Eaton, T. Deutsch, D. Deroos, and G. Lapis,
Understanding Big Data: Analytics for Enterprise Class Hadoop
and Streaming Data. New York: McGraw-Hill, 2012.
47. C. Stakutis and J. Webster, Inescapable Data: Harnessing the
Power of Convergence. Upper Saddle River, NJ: IBM Press,
2005.
48. D. Laney, BInfonomics: The practice of Information economics,[
in Forbes. [Online]. Available: http://www.forbes.com/sites/
gartnergroup/2012/05/22/infonomics-the-practice-of-information-
economics/
Received August 1, 2012; accepted for publication
August 28, 2012
Piyush Malik IBM Global Business Services, San Jose, CA 95131
USA (Piyush.malik@us.ibm.com). Mr. Malik leads the Worldwide
Big Data Services Center of Excellence within the IBM Global
Business Analytics and Optimization (BAO) consulting practice.
Specializing in information management strategy and architecture,
information quality and governance, business intelligence, master data,
and advanced analytics, Mr. Malik has more than 24 years of
international consulting, practice building, sales, and delivery
experience with Fortune 500 clients across multiple industries. He has
served as Founding Director of the IBM Global BAO Center of
Competency and previously led the Information Integrity consulting
practice at Pricewaterhousecoopers Management Consulting Services
before it was acquired by IBM in 2002. Mr. Malik has also been serving
on the Board of Directors of International Association for Information
and Data Quality (IAIDQ) since 2008. He received an undergraduate
degree in electronics and communications engineering in 1989 and
a master’s degree in management of technology from Indian Institute
of Technology, Delhi, in 1995. He is a frequent speaker at industry
conferences and has authored several articles and papers.
P. MALIK 1 : 13IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013