SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
VOLUME 57, NUMBER 3/4, MAY/JUL. 2013
Journal of Research
and Development
Massive-Scale Analytics
Including IBM Systems Journal
Vol. 57, No. 3/4, May/July 2013
Massive-Scale Analytics
BBig Data[ refers to large data sets that are beyond the capability of traditional software tools
to quickly manage, process, and analyze. The development of techniques for gaining insight
from such information provides potential benefits in such arenas as business, science, and
public policy. This special issue of the IBM Journal emphasizes applications, analytics, software,
and hardware technologies that form the foundational building blocks for massive-scale
analytics and the processing of Big Data.
Preface: Massive-scale analytics
A. Soffer, Guest Editor
1 Governing Big Data: Principles and practices
P. Malik
2 Trends and outlook for the massive-scale analytics stack
A. N. Ghoting, J. A. Gunnels, P. Kambadur, E. P. Pednault, and M. S. Squillante
3 Understanding system design for Big Data workloads
H. P. Hofstee, G. C. Chen, F. H. Gebara, K. Hall, J. Herring, D. Jamsek, J. Li, Y. Li, J. W. Shi, and
P. W. Y. Wong
4 A platform for eXtreme Analytics
A. Balmin, K. Beyer, V. Ercegovac, J. McPherson, F. O¨ zcan, H. Pirahesh, E. Shekita, Y. Sismanis,
S. Tata, and Y. Tian
5 GPFS-SNC: An enterprise cluster file system for Big Data
R. Jain, P. Sarkar, and D. Subhraveti
6 Toward a scale-out data-management middleware for low-latency enterprise computing
L. L. Fong, Y. Gao, X. R. Guerin, Y. G. Liu, T. Salo, S. R. Seelam, W. Tan, and S. Tata
Ó Copyright 2013 by International Business Machines Corporation. See individual articles for copying information. Post-1994 articles that
carry a code at the bottom of the first page may be copied, provided the per-copy fee indicated in the code is paid through the Copyright
Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, U.S.A. Pages containing the table of contents may be freely copied and
distributed in any form. ISSN 0018-8646. Printed in U.S.A.
(continued on next page)
7 IBM Streams Processing Language: Analyzing Big Data in motion
M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell,
H. Nasgaard, S. Schneider, R. Soule´, and K.-L. Wu
8 Real-time analysis and management of big time-series data
A. Biem, H. Feng, A. V. Riabov, and D. S. Turaga
9 Novel document detection for massive data streams using distributed dictionary learning
S. P. Kasiviswanathan, G. Cong, P. Melville, and R. D. Lawrence
10 Big Data text-oriented benchmark creation for Hadoop
A. Gattiker, F. H. Gebara, H. P. Hofstee, J. D. Hayes, and A. Hylick
11 Platform and applications for massive-scale streaming network analytics
P. Zerfos, M. Srivatsa, H. Yu, D. Dennerline, H. Franke, and D. Agrawal
12 Scalable community detection in massive social networks using MapReduce
J. Shi, W. Xue, W. Wang, Y. Zhang, B. Yang, and J. Li
13 Visual analysis of large-scale network anomalies
Q. Liao, L. Shi, and C. Wang
14 A statistical approach to mining customers’ conversational data from social media
D. Konopnicki, M. Shmueli-Scheuer, D. Cohen, B. Sznajder, J. Herzig, A. Raviv, N. Zwerling,
H. Roitman, and Y. Mass
15 A real-time stream storage and analysis platform for underwater acoustic monitoring
J. P. Hayes, H. R. Kolar, A. Akhriev, M. G. Barry, M. E. Purcell, and E. P. McKeown
(continued from previous page)
Governing Big Data:
Principles and practices
P. Malik
As data-intensive decision making is being increasingly adopted
by businesses, governments, and other agencies around the world,
most organizations encountering a very large amount and variety
of data are still contemplating and assessing their readiness
to embrace BBig Data.[ While these organizations devise various ways
to deal with the challenges such data brings, the impact and
importance of Big Data to information quality and governance
programs should not be underestimated. Drawing upon
implementation experiences of early adopters of Big Data
technologies across multiple industries, this paper explores the issues
and challenges involved in the management of Big Data, highlighting
the principles and best practices for effective Big Data governance.
Introduction
Big Data, a term commonly associated with massive datasets
growing at a rapid pace, also represents complexity and
variety in the types of data that are being collected and
analyzed by organizations. Whether the data is used to
explain global climatic change patterns and their impact,
predict customer behavior, forecast purchasing intent,
retain customers, target political donors, influence voters
for political campaigns, or even help understand the
characteristics of our very early universe via the discovery
of subatomic particles such as the Higgs boson particle [1],
Big Data clearly has a significant role to play.
Many different definitions and theories on what constitutes
Big Data exist today. The most often cited definition [2]
captures its essence: BBig Data exceeds the reach of
commonly used hardware environments and software tools to
capture, manage, and process it within a tolerable elapsed
time for its user population.[ A seminal report [3] from
McKinsey Global Institute in 2011 helped promote benefits
of using Big Data in the corporate environment, but it was
more than ten years earlier that an analyst introduced the
idea in a research note [4], discussing ways and means of
controlling volume (e.g., massive size of datasets), velocity
(e.g., amount of data transferred per unit of time), and variety
of data (e.g., images, audio, web information, or computer
logs, as well as various structured database records), which
is the basis for the popularity of the letter BV[ used in many
Big Data marketing communications.
Big Data has significantly altered the data management
considerations for the information technology (IT)
professional. According to International Data Corporation
(IDC), approximately 90% of the digital data we encounter
today did not exist two years ago and is predicted to grow
from 2.7 zettabytes (ZB) in 2012 to 35 ZB by the year 2020
[5]. However, Big Data is not just big in terms of size;
it is transmitted at high rates and usually implies
heterogeneity of data types, representation, and semantic
interpretation. For example, it can come from online
customer behavior via social media, from click streams
(e.g., records of the paths taken by users through a company
website), call detail and data-usage records (xDRs;
also known as event detail records) from telecommunications
companies, and data files or transaction notes or audio
files captured by contact centers (e.g., facilities that manage
client contacts through telephone calls). Sensors and other
machines automatically and continuously generate data
and event streams. Biometric data (such as fingerprints
or retina scans) as well as medical images and health care
data add variety, while traditional mission-critical
applications continue to produce terabytes of transaction
data.
Likewise, the world faces a data deluge because of the
widespread proliferation of the Internet, combined with the
ubiquity of advanced computing and mobile phone
technologies. Organizations will need to handle it
appropriately and responsibly not only for competitive
advantage, but for survival. To add to the challenge of
handling structured data in rows and columns in traditional
ÓCopyright 2013 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
P. MALIK 1 : 1IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
0018-8646/13/$5.00 B 2013 IBM
Digital Object Identifier: 10.1147/JRD.2013.2241359
data stores, organizations must also contend with
semi-structured data from web logs and machine-to-machine
(M2M) logs, as well as unstructured data in the form of texts,
documents, PDF (portable document format) files, images,
videos, and audio streams.
Big Data analytics enables organizations to make
Bsmarter[ decisions and execute them when it is most
beneficial to the business, its customers, and its
partnersVsuch as in real time. For example, organizations
deploy predictive models on event streams to identify
real-time trends. In this way, they are improving the speed
and intelligence of their response to customer behavior,
suspicious activity, and aberrations in processes. The
following cases illustrate diverse applications of Big Data
for smarter decision making.
In the first case, the 2012 U.S. elections provided a
glimpse of how effectively the Democratic political party
staff used voter contact data combined with social media
analytics to precisely adjust its campaign messaging,
created simulation models, and mined data to decide exactly
when and on which media channel to place their
advertisements to maximize the campaign donationsVand
effectively influenced the outcome of the elections in its
favor [6].
In another example, researchers at the University of
Ontario Institute of Technology (UOIT) in Canada use
streams of medical sensor data to help doctors detect and
analyze subtle changes in the condition of critically ill
premature babies to predict the onset of neonatal
life-threatening infections as much as 24 hours in advance,
thereby to take precautionary measures and save lives [7].
Finally, consider analytics applications in video
surveillance. The image data file of the video is large and
uninteresting, but the associated data becomes more useful
after adding various forms of metadata, indices of frames
in the video, who and what is contained therein, transcripts of
dialogs from the video, and comments from viewers. In short,
all the information that makes the video searchable and
relevant adds both value and volume. A commercial
application of such systems can be found in ensuring the
security of buildings and physical infrastructure, and these
systems are being incorporated in many home video
surveillance and security systems as well. Adding audio
and video analytics for this streaming data in real time,
TerraEchos offers value-added solutions across industry
segments for monitoring and protecting sensitive
infrastructure (e.g., infrastructures involving government,
oil, gas, and power grids) as well as for cybersecurity [8].
As mentioned, Big Data is a term that refers to the
management, processing, and analysis of large amounts
of data. Big Data includes the data in storage systems
(e.g., Hadoop** Distributed File System [HDFS], under open
source Apache Hadoop and MapReduce frameworks [9])
and databases, as well as data in the process of being
transmitted (e.g., data in stream-processing systems).
As the corporate world explores and recognizes the
importance of using Big Data analytics, a few organizations
have started realizing the potential of Big Data, though
many organizations are still struggling to manage their
transactional and operational data and are not yet ready
[10, 11]. For those organizations who aspire to be Bworld
class[ and beat their competition, it is only natural that
they will need to strengthen their information management
programs to secure, manage, and expand governance
initiatives to address Big Data.
The subsequent sections of this paper outline the need for
governing Big Data and the challenges involved. Further,
they will highlight the principles and practices for governing
Big Data as demonstrated by early adopters of Big Data
technologies.
Why govern Big Data?
Governance of information involves the orchestration of
people, processes, and technology to turn data into a strategic
asset for the organization. Data governance embodies the
exercise of decision making and authority for data-related
matters spanning policies, rules, rights, organizational
structures, and accountabilities of people and information
systems as they perform information-related processes to
accomplish business objectives [12]. Depending on the
maturity of the information management function, varying
levels of adoption dictate the sophistication of the
information quality and governance discipline within the
enterprise. While Big Data introduces more and new types of
information, the reasons to govern data are well established.
We govern data to manage risk and, more importantly,
to extract value from data.
Big Data for operations and cost management
Improving operational efficiencies through reducing direct
and indirect costs is of interest to business managers across
industries. As mentioned, in any industry, Big Data plays
a role through continuous monitoring and analysis of data
emitted by sophisticated machines, embedded environmental
sensors, and operational metrics. This M2M sensor data,
which traditionally was often discarded because of
unmanageable volumes or speed at which it gets generated,
has potential to improve operating efficiency of equipment
as well as overall operational safety. For example, in oil
drilling rigs, proactively acting on early warning signs of
mechanical or electronic equipment component failures
can prevent deadly accidents and oil spills. Analyzing this
type of M2M data is valuable to predict the health of
equipment components and for timely preventative
maintenance, thereby avoiding costly industrial downtime.
Thus, adequately governing such Big Data contributes to
efficiency of operations.
1 : 2 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
Big Data for value creation
Social media companies such as Twitter, Facebook, and
LinkedIn rely on users’ willingness to provide their personal
stories, status updates, and posts to create a valuable corpus
of rich content. On each social media platform, aggregated
data is monetized and ultimately creates revenue streams.
Thus, when Facebook valued the company at $100 billion
at the launch of their initial public offering in early 2012,
though many had expressed doubts about the valuation,
the data scientists and social media software experts took
pride in the coming of age of Big Data.
Tremendous possibilities emerge for transforming not only
businesses, but also entire industries with data. Leading
organizations in financial services, health care, retail,
telecommunications, media, and other industries are already
benefitting from new opportunities generated by Big Data
analysis. For example, organizations are actively analyzing
the content from thousands of social media posts every day
to uncover new customer insights that enable them to
profit from emerging trends and deliver a richer customer
experience. Some are improving the consistency of
information across sales channels to provide a seamless
experience for customers as they move among web,
mobile-app, in-person, phone, and other interaction points.
Many others are using real-time analytics to identify new
cross-sell and upsell opportunities while performing
predictions and retaining customers. Still others are working
on enhancing the efficiency of operations and increasing
service availability by analyzing tens of billions of system
event records. In many cases, companies are scrutinizing
identities, relationships, and previous customer interactions
and are using predictive analytics to anticipate fraud and
avoid costly losses [13].
In an IBM study [14] of more than 1,700 chief marketing
officers (CMOs), the authors highlight the transforming
of the CMO’s agenda as a result of glaring gaps in
organizational readiness to handle Big Data. At the same
time, the marketing professionals led by the CMO are
increasingly benefitting from Big Data adoption and are
readily engaging in new data-centric initiatives.
There is intrinsic value of historical customer data,
augmented by behavioral data that has been enhanced with
micro-segmentation (e.g., demographic categorization to a
near-personal level), online browsing, telephone calling
patterns (e.g., who, when, and how), and social media
activity to predict purchasing intent, influencers, and
consumer buying trends. In the past, available computing
technology did not allow for combining all of this data
economically to perform the real-time tracking, monitoring,
and fine-tuning of promotions and campaigns. However,
processing Big Data has provided unprecedented avenues
for marketers to access and analyze all types of internal
or externally procured data, thereby empowering the
marketing executives.
Big Data for risk and compliance management
The risk management function across industries has the
potential to gain from the application of Big Data. If the
corporate governance policies for banking and financial
institutions were robust and mortgage loan officers had been
empowered in data-driven measures, we would possibly be
living in a different world today. For example, if bankers
had the ability to use the full profile of the prospective loan
applicants, identifying their creditworthiness, they could
assess historic banking patterns, spending habits, and future
earnings potential clearly. That, combined with enriched
social media and economic data from external sources, could
have produced a comprehensive borrower risk profile for
each applicant, and as a result, bankers and lenders could
have been better prepared to handle the collapsing U.S.
financial markets, which was a classic case of lack of
governance at many levels [15]. Likewise, if the health care
industry executives had better insight and capability to
analyze millions of health care transactions daily, they could,
for example, better identify fraudulent claims in health
care and potentially prevent an estimated worldwide
180 billion loss annually, per a 2010 study by the European
Healthcare Fraud and Corruption Network [16].
Most organizations focus on efficiently handling financial
data to ensure the accuracy and integrity of their financial
information and protect their chief financial officer (CFO)
and chief executive officer (CEO). While increasing scrutiny
caused by such regulators and compliance mandates as
Sarbanes-Oxley, Basel II, Solvency II, and Health Insurance
Portability and Accountability Act (HIPPA) have brought
increasing responsibility to the chief information officer
(CIO) organization for information quality and governance,
the data ownership and hence responsibility for its
trustworthiness need to remain with the business.
Big Data for operations, risk, and value
simultaneously
When data is governed in order to meet all three business
needs (i.e., operations, risk, and value management),
the value obtained by the organization is amplified. As
illustrated in the examples below, converting sensor and
diagnostics Big Data to valuable insights is changing the
airlines and utilities industries.
Machine and sensor data from aircraft engines, which
could generate as much as 640 TB per flight [17], can be put
to effective use, including fault prediction and prevention.
This ultimately results in better fleet care, reduced flight
delays, and reduced cancellations due to unforeseen technical
issues with the plane. Those frequent fliers who have missed
connecting flights because of last-minute maintenance crew
Bcheckups[ will certainly find this application of Big Data
useful.
Smart meters, capturing electric- and gas-usage data every
few minutes, provide a good illustration of Big Data
P. MALIK 1 : 3IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
benefitting the consumer as well as the provider of energy.
The utility companies can now intelligently match supply
with demand and offer consumer incentives to change usage
patterns and behaviors. With active governance of such data,
isolation of faults and quick fixing of issues can prevent
systemic energy grid collapse [18, 19].
Thus, Big Data is enhancing operations, reducing
operational and performance risks, and generating value for
the consumer. Almost every industry has the potential to
benefit significantly from the application and analysis of
Big Data. Having established the importance and reasons for
governing Big Data, the next section examines how leading
organizations are overcoming challenges in managing and
governing it today.
Challenges and opportunities offered by
Big Data
As several use cases of Big Data have evolved, it is clear that
much of this data can be of value and must be considered for
exploratory analysis. Once the exploration starts to show
some useful trends, it provides insight into which datasets
can be further deeply analyzed and which data may be
discarded or purged. This section highlights some trends
that are shaping opportunities and challenges for governing
Big Data.
Confluence of mobile, Internet, and social activity
With roughly half of the world’s population today being
online, and a majority of them being mobile, consumers
as well as enterprises of all sizes have benefitted from the
Internet and mobile revolution. Increasing propensity to use
social media tools to shop, spend, and share insights has
heralded the emergence of social business [20]. Obviously,
in this rapid confluence of mobile, Internet, and social
activity, more data is being generated than can be managed
effectively by humans. Machine-learning algorithms and
recommendation engines using this data like the engines
from Amazon and Netflix are not only getting smarter
(e.g., more sophisticated and effective), but also generating
more data while enriching our lives. Data is being recognized
as central to innovation and hence can no longer be just
relegated to enterprise back-office systems, and this data
must be governed appropriately.
Evolving consumer behavior
Rapidly evolving consumer behavior in which purchase
behavior is influenced more by friends or online reviews than
by advertisements [21] signals a trend. Moreover, a study
found that more than 70% of online shoppers are more likely
to pay attention and act upon a stranger’s comments on social
media than believe the direct targeted messages from the
manufacturer, and the younger generation is more likely
than baby boomers to do the same [22]. There is more
window-shopping activity than actual buying that occurs
in physical stores in the cases where a company website is
able to offer the same or better product cheaper with targeted
promotions combined with incentives such as no sales
tax and free shipping. Thus, it is valuable to understand
consumers across their multiple social personas
(e.g., presence at social networks such as Facebook**,
LinkedIn**, Twitter**, and MySpace**, along with
blogging or reviewing activities), households, and online
activities. Combining historical buying patterns with online
and in-store activities reveals challenges for governing
the master data (e.g., linking the system of engagement with
the system of record to formulate the single view of the
consumer) that may not be ready for Big Data.
Rise of social commerce
Social media is no longer reserved just for the younger
generation; it is also being embraced for mainstream
businesses. Popularity of massively multiplayer online
games, in-app purchases with a phone, tablet, or other mobile
devices, and the advent of gamification (i.e., use of game
design techniques and mechanics to incentivize or change
consumer behavior in non-gaming situations), combined
with social shopping behaviors, have accelerated the arrival
of social commerce.
In such an era, social shopping is rapidly becoming the
norm and is generating even more data than traditionally
generated through point-of-sale (POS) systems. A huge
Big Data opportunity for retailers and marketing science
enthusiasts alike, this shopping data, however, poses data
governance challenges because it needs to be combined
with enterprise corporate data. With the incorporation of
Facebook, Twitter, and YouTube** into digital marketing,
public relations, and other traditional business operations,
companies now must deal with massive amounts of
unstructured data. This data onslaught comes in addition
to rapid growth in the amount of customer data housed
in enterprise systems. Though hardware storage costs are
decreasing as well, we often see that multiple copies of the
same datasets continue to propagate within the enterprise and
compound problems in management, archive, and retrieval
of the latest, most complete, and relevant data. Companies
must now find cost-effective ways to integrate and
analyze the collective pool of all relevant data to generate
granular business insights.
Security and privacy
The extent to which our digital footprints, combined with
in-store activity and credit-card usage patterns, can be
exploited by marketers using Big Data is aptly illustrated
by the overly aggressive focused promotional campaign
by retailer Target for baby products to expectant mothers,
thereby antagonizing some customers [23]. This marketing
was made possible through mining data using algorithms that
could infer that different digital transactions in different
1 : 4 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
systems are related to the activity of a single person or
household. Commonly termed entity resolution (ER),
this is a great use of technology for public benefit
such as counterterrorism and detecting frauds
(e.g., anti-money-laundering approaches); however, it raises
some serious invasion of privacy concerns for the ordinary
citizen and motivates public policy, data privacy, and
governance concerns. A Big Data governance program needs
to address the security challenges of sharing data between
applications and compliance with geographical trans-border
data regulations, along with strong protection requirements
for personal, health, and financial data.
Technology advancements and open-source issues
Technical challenges for handling Big Data are morphed
with the rapid pace with which new technologies and
methods are becoming available in the open-source world.
Unstructured data (including free-form text, images, and
log data) does not easily lend itself to data patterns or data
relationships because it does not need to conform to defined
data formats. The historical value of such data is limited
until a pattern is discovered. Loading such Big Data into
traditional relational databases is very time-consuming,
error prone, and cost prohibitive. In order to manage this,
many companies are deploying advanced technologies such
as the parallel-processing MapReduce-based Hadoop
software (e.g., an open-source version or a commercial
distribution such as IBM BigInsights*) to process and
analyze Big Data using distributed commodity hardware
server clusters. They then integrate Hadoop with systems
housing other customer data to gain richer insights. We have
the intersection of the traditional methods of delivering,
managing, and viewing information, as well as a new
approach that allows data of all types and formats to be
quickly sorted for transactional and operational opportunity.
This new era of data exchange requires next-generation
compute, storage, and input/output technologies [24], as well
as systemic thinking. Hadoop, streaming, and NoSQL
(not-only Structured Query Language) technologies
encourage fresh thinking; however, traditional governance
programs will not achieve their intended impact if not
expanded for handling Big Data.
Quality and uncertainty
The phrase Bdata is the new oil[ [25] underscores the
immense value of Big Data. However, just like crude oil,
the most value is in its refined state (e.g., cleansed data).
Increasing volume and speed of data generation can result
in decreasing data certainty. This raises the all-important
issue of data quality and trustworthiness that must be
addressed given the incomplete and often uncertain nature
of Big Data. For example in social media channels
(e.g., Facebook posts or Twitter tweets), human sentiments
and expressions are often inherently treated with skepticism.
Thus, without the full context, this data leaves an element
of doubt in situations such as those involving a customer’s
preference and sentiment for a brand or future buying
decision for a specific product in a specific geography.
None of the traditional data-cleansing technologies can
routinely make this data better or useful. However, this data
contains valuable information that could yield trustworthy
data when triangulated with multiple similar sources and
geospatial location and advanced mathematical modeling
techniques such as emerging sense-making systems [26].
Thus, Big Data must be managed in context, which despite
its inherent noisiness and lack of an upfront model raises
the need to track data lineage and its origin in order to handle
uncertainty and error.
Public datasets and data consortiums
Publicly available high-quality datasetsVsuch as satellite
data, astronomical data, and imaging and mapping geospatial
data, along with up-to-date weather data feeds combined
with public-agency demographic datasets such as those from
data.gov [27]Vare a boon for both consumers and the
scientific research community. Numerous public utility
data-mashups and Bdata apps[ are now available for practical
applications, such as an in-city parking locator [28] or a
crime stopper (e.g., through gunshot acoustic analytics
offered by ShotSpotter**). Likewise, with massive amounts
of data being collected about consumers and their buying
patterns based on store loyalty cards, coupon redemption
or credit-card transaction history data, or mobile calling and
data-usage patterns and online behaviors, many companies
have amassed very rich customer repositories and are
in a position to act as data consortiums for the industry.
These datasets and infomediaries (e.g., businesses working
as agents on behalf of customers, helping them to monetize
and maximize the value of their information profiles [29])
would need to be governed as this Bexternal[ data becomes
integrated into the enterprise.
Integrating Big Data with traditional enterprise data
Historic IT behaviors, such as building systems without
considering the real need of the business yet deploying
systems for the sake of novelty in technology, will not work
in the age of Big Data. From being product centric,
organizations have evolved to being customer centric;
however, in the Big Data era, even this will not suffice.
Differentiated service based on past interactions, anticipated
needs, and a 360-degree view of the customer is needed to
succeed today. IT organizations must partner closely with
business organizations and must adopt agile and iterative
experimentation approaches. In essence, this involves a total
shift in mindset. However, it is amply evident that silos of
data will not work. Thus, historic enterprise data must be
fully integrated with the newer types of datasets and
keeping the aspect of performance and efficiency at scale
P. MALIK 1 : 5IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
(e.g., handling large workloads through distributed
computing architectures), organizations must be able to
transform all of this new data into actionable information
within budget [30]. As a result, Big Data governance needs
to assume a central role if competitive advantage from data
is desired.
Governance principles and leading practices
Big Data governance is defined as the emerging set of
processes, methods, technologies, and practices that enable
the rapid discovery, collection, processing, analysis, storage,
and defensible disposal of large volumes and fast streams
of structured and unstructured data with security, privacy,
and cost efficiency [31]. Although the strategic reasons
for instituting data governance programs do not change by
integrating Big Data in the enterprise, the newer types of data
and related characteristics make existing tools, processes,
and practices inadequate to handle the data. Because of
inherent veracity, context and provenance gain prominence
in Big Data governance, and hence, automation and metadata
need to be addressed. Some of the early adopters of
Big Data technologies have experimented with and embraced
a variety of processes and principles that can serve as
guidance for the industry. Leading practices in Big Data
governance are still evolving and will be refined as Big Data
initiatives become widely accepted. This BGovernance
principles[ section highlights a few of these.
Aligning goals with business strategy to build
a roadmap
As with all strategic initiatives, success relies on maintaining
a clear vision of the final goal. Information governance
is a discipline that governs company data assets throughout
their life cycles [32]. Data governance programs aim to
maximize the value of data through active engagement
of people, processes, and technology. Big Data governance
is associated with an overarching mission to maximize the
value of all types of enterprise data throughout their life
cycle. The first step of the journey toward Big Data
governance involves stakeholder engagement as well as goal
alignment with the business strategy of an organization.
Establishing time value of data
Organizations view Big Data technologies as change agents,
especially because several of these technologies support
data in its native format. At the starting point in the Big Data
life cycle, organizations do not always know which data
sources have value and whether organizations want to invest
vast resources to gather requirements and sponsor formal
information governance programs [33]. Although it is
difficult for some to overcome long-time habits, it must be
made clear that not all data is valuable and worth keeping
forever. Direct storage and indirect handling expenses can be
cost prohibitive, given the rapidly increasing volumes of
Big Data. Thus, it must be recognized that data has varying
degrees of value. Even with such recognition, that value will
change over time as circumstances change and use cases
evolve. At a minimum, the value is at least related to the
cost of storage, and, thus, the data is more like a natural
resource. If it is abundant but has no use, the value is low.
However, data can gain value when it has a novel purpose.
Time value of data (TVoD) takes into account the
Btemperature[ of data (e.g., frequency of access guiding a
tiered archival and storage policy), use case, and relative
importance (e.g., expiry and aging policy, and short shelf life
for certain types of data), thereby helping establish optimum
governance practices.
Identifying a business use case
Experimenting with Big Data technologies, to explore some
viable use cases in the context of an industry domain,
is a good practice for both IT and business teams in order to
appreciate the potential of Big Data first-hand. Once some
useful patterns emerge from data exploration, a business case
can be identified and further refined to quantify benefits.
Ideally, this should be in line with the objectives or
imperatives set forth by the organization. For example,
a telecommunications firm had a goal of reducing Bchurn[
(e.g., reducing customer turnover) and improving customer
intimacy. By analyzing transactional call data records
(CDRs), the company was able to predict the likelihood of
customers discontinuing their service by analyzing the
number of dropped calls for each consumer within each
month. By using geo-location data from the phone, they were
able to offer location-based promotions and services.
Detailed telecommunications Bchurn[ models also include
social media data from Facebook and Twitter to derive the
customer sentiment. In this example, in order to correlate this
data and perform deep predictive analytics, this data was
to be shipped to an overseas vendor each day. This caused
significant concern with respect to safeguarding the privacy
of customer data. After appropriate deliberation, the
telecom operator decided to mask sensitive data, such as
subscriber name, because the calling and receiving telephone
numbers were the primary fields of value for churn
analytics [34].
A sampling of examples from oil and gas industry
illustrates the concept of establishing use cases for Big Data
governance to further gain sponsorships [35]. Geospatial
data, which is critical to exploration and production, may
relate to land-based drilling, offshore drilling, or wells that
might be abandoned. There are several examples of poor
geospatial data governance that have had a profound business
impact. In one instance, an oil company used incorrect
geospatial coordinates to drill a well at the exact location of
an abandoned well, which resulted in losses of millions
of dollars. In another instance, an oil company used incorrect
navigational coordinates to drill a hole in an adjacent field
1 : 6 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
that belonged to another company. The company faced legal
issues and public embarrassment in addition to lost time and
money. Instances such as these are sufficient to convince
stakeholders of the need for governance and thereby support
a specific set of use cases.
Performing maturity assessment
An early step in the implementation of an
information-governance program is to assess the current state
and determine the gaps and efforts needed to reach the
desired future state. Such maturity assessments are often
conducted as part of a strategic information agenda,
IT-business alignment, or an organizational readiness
fact-finding exercise. According to an IT executive at the
Turkish bank, Akbank, BBig Data has all the characteristics
of Bsmall data[ when it comes to governance. The only
difference is in the complexity and variety of channels that
it comes from. Although there are greater demands on
organizational energy and resources to govern Big Data,
the gain in terms of business value is that much higher. Being
able to analyze Big Data from the Web, and take necessary
actions, can have a major impact on the profit of a
company. A maturity model for Big Data governance is a
critical first step in this journey[ [33]. A framework for
Big Data maturity assessment can be established and
deployed as an extension of a traditional
information-governance assessment initiative using a
capability maturity model (CMM) such as the IBM
Information Governance Council Maturity Model as a
starting point. However, additional factors (such as
the use-case domain, industry context, and unique data types
for Big Data) help in shaping the questions and formulating
the maturity assessment. For example, in Bcustomer
centricity improvement[ use cases to determine customer
sentiment, the customer master data needs to be integrated
with social media data, but the social media data may not
require cleansing, extensive documentation, or individual
record life-cycle management. Thus, because the purpose of
governance is to establish and deliver trusted information,
use cases dictate the rigor applied in Big Data governance
and thus the future level of maturity desired.
Handling veracity through context and provenance
Popularly characterized by the three Vs, as discussed earlier
in the introduction of this paper, Big Data has more recently
been subjected to additional criteria starting with letter V.
The Global Technology Outlook (GTO) study by IBM
Research [36] identified veracity as an important criterion
in characterization of Big Data. The GTO states, BToday,
the world’s data contains increasing uncertainties that arise
from such sources as ambiguities associated with social
media, imprecise data from sensors, imperfect object
recognition in video streams, model approximations, and
other sources. Industry analysts believe that by 2015, 80% of
the world’s data will be uncertain.[ Figure 1 depicts the
increasing data uncertainty as the data volumes are projected
Figure 1
Increasing data volumes usher in the era of higher uncertainty in data, especially as the quantity and incidence of unstructured data are rising. The
phrase BInternet of Things[ refers to identifiable objects and their representations in an Internet-like structure. (VoIP: Voice over Internet Protocol;
IDC: International Data Corporation. Figure adapted from the 2012 IBM GTO Study, an overview of which is given in [36].)
P. MALIK 1 : 7IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
to increase. The rate of increase is more pronounced for the
largely unstructured data (e.g., from VoIP [Voice over
Internet Protocol], social media, and sensors), compared with
the enterprise data, which is mostly contained in structured
databases.
This decreasing Bveracity[ of data is a natural byproduct of
increasing complexity, volume, and velocity of data.
As discussed in a previous section, veracity of data must be
acknowledged and addressed. Techniques for managing the
predictability of decisions based on imprecise datasets
must be refined. Enhancing the metadata through correlation
of context and provenance (i.e., lineage) of data improves the
veracity. For example, one IBM solution for fighting crime
uses historic criminal records and uses analytics to predict
and reduce occurrence of new incidents. The latest advances
in law enforcement and crime prevention integrate fragments
of data from multiple sources, but details described in
eyewitness accounts, incident reports, media and journalist
reports, and other individual observations can differ widely
and ultimately limit accuracy. The extrapolation across
missing data points, using sophisticated software, enables
the solution to be useful, even with imprecision in the
available datasets.
Establishing or adapting a data-quality program
to embrace Big Data
Data quality has some characteristics of art, with its worth
realized after the data is employed and away from its source
of generationVin the case of data, to produce information
for business decision making. The value of most Big Data
depreciates with time, especially data from social media
channels. Poor-quality data, to some, may not appear
inaccurate, inconsistent, or redundant, because it may be fit
for the intended purpose or use case [37]. There have been
efforts in the past to quantify the amount of losses
corporations experience if the problem of poor data quality
is not addressed [38, 39]. However, as a practice, dealing
with poor data quality is often neglected, except in cases of
financial data or compliance mandates that could draw
attention from regulators and company management. In a
2012 study by the International Association for Information
and Data Quality (IAIDQ), more than 70% respondents
considered data a strategic asset. Increasing regulatory
activity and an increase in the understanding of the value of
data have raised the importance of data quality as a discipline
across organizations [40]. Most organizations have
established data-quality initiatives that improve the level
of trustworthiness of corporate data. Some address the
data-quality issue by aggregating, matching, merging, and
cleansing data in intermediary staging databases
(referred to as Bdownstream[ data fixing). However, it is
often well established that quality problems primarily arise
with inadequate controls at the transactional source system
and the costs of fixing increases in a downstream setting.
To date, many of these data-quality initiatives have only
addressed traditional or structured data. However,
as mentioned, in the area of Big Data, data sources such as
social media marketing, real-time sensor feeds, and IT system
logs have historically not been linked to official reference
data from transactional systems. Traditionally, such data
did not usually have to be cleansed because the data was
examined in isolation by specialist teams that often addressed
issues Boffline.[ However, cross-information-type
analyticsVwhich are common in the Big Data spaceVhave
changed this dynamic.
Consider an example of a large health care insurance
company that processes more than 500 million claims per
year, each claim record consisting of 600 to 1,000 attributes
[34]. The company uses predictive analytics to determine
whether certain proactive measures were required for a small
subset of members. However, an audit found that physicians
were using inconsistent procedure codes to submit claims,
thereby limiting the effectiveness of the predictive analytics.
Further, the text notes within claims documents were
ambiguous. For example, the team used terms such as
Bchronic congestion[ and Bblood-sugar monitoring[ to
determine that certain members might be candidates for
disease management programs for asthma and diabetes,
respectively. These type of cases are likely to encounter
data-quality issues when the information sources that have
historically been underutilized are rigorously used such as
in cross-information matching for deeper insights.
Clearly, handling massive amounts of data implies that
organizations need to be able to deal with the uncertainty
in the quality of this data. Although not all data needs to have
an in-depth review, practical approaches to data-quality
management, as have applied to traditional structured data,
will not be sufficient for the Big Data era, and the
data-quality programs have to be adapted.
Thus, the governance program needs to adopt the
following substeps to address data quality. First, those
involved with such a program must work with business
stakeholders to identify critical data elements that ensure the
success of the information quality and governance program.
Second, they must build a business case to justify the
management of data quality for the organization in general
and extend it so as to relate to the Big Data aspects.
In addition, they must build the overall data-quality plan,
covering vital parameters such as those involving
specification, timeliness, availability, accuracy, precision,
consistency, synchronization, security, and accessibility [41].
Of course, not all of these kinds of dimensions that apply to
traditional data need to apply to Big Data (e.g., timeliness
and accuracy are important dimensions for sensor data,
whereas the same kind of accuracy may not be critical
for social media data).
Those involved with governance programs must also
establish data-quality standards and policies and deploy
1 : 8 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
proactive monitoring through technology and automation.
They must develop a scorecard to monitor the data-quality
metrics along each dimension and establish confidence
intervals for the quality of virtually all types of data. They
should also appoint data stewards who will be accountable to
the information governance council for improving the
quality metrics over time. These data stewards must be
educated in handling Big Data types and effectively
communicate among one another and with business and
IT stakeholders. Finally, governance-program leaders should
refresh the information governance scorecard on the basis
of ongoing quality improvement progress and the needs of
the business.
With a combination of advancing technology and adequate
processes, organizations will be able to improve quality of
data-based decisions and compensate for sparsity, ambiguity,
and veracity of Big Data.
Formally accommodating Big Data roles in the
governance organization
Data governance must be institutionalized through a formal
organizational structure. Given the disruptive potential of
Big Data, specialized people need to be hired, trained,
nurtured, and integrated into the organization. This is one
of the Bsofter[ (e.g., people) aspects of governance, but
organizational development and change management are
key here.
In order to effectively accommodate Big Data in the
enterprise, the organization will need to extend the
information governance team and its charter to incorporate
tenets of Big Data. If it does not already exist, a data
stewardship program should be established in the
organization. Otherwise, it should be extended for Big Data,
ensuring that new roles (such as Big Data analyst and
data scientist) are defined and recognized as key players.
This means the data stewards must be familiar with new
data types and help with the data-validation processes.
Data stewards should be trained to handle data profiling
and anomalous pattern detection with Big Data.
Detecting outliers and understanding the nuances between
false positives and false negatives help them in proper
data exploration and discovery. Tools and technologies
that help with profiling and visualizing Big Data
must be made available to the team. This may mean that
the existing data governance team must be further
trained and enabled in Big Data technologies and
processes.
People, processes, and technologies must be aligned as
in a traditional data governance program. Further, those
individuals with data governance organizational roles
(including chief data officers, data owners, data governance
council members, data stewards, and data analysts) need to
be savvy with respect to Big Data, as policy making and
the oversight of Big Data are critical.
While the different governance roles need to interact
with one another, without structure and organization,
the workflow can stall. Best practices indicate the need to
establish clear, structured, and well-directed
communications, escalation paths, and workflow-monitoring
patterns that would aid in handling policies, procedures,
and methods concerning Big Data in the organization,
for quick resolution in case any challenges arise.
Automating business-rules monitoring for detecting
governance policy exceptions
Determination of policies and standards for governance is
typically done in a collaborative manner with IT and business
teams coming together to agree on a framework of set
policies, processes, and a RACI (responsible, accountable,
consulted, informed) matrix to establish role clarity.
Adapting this framework from traditional governance
program for dealing with Big Data requires automation
and technology. With the pace and volume of data involved,
traditional data monitoring and manual interventions would
not suffice as data governance tactics for Big Data.
Closed-loop feedback systems need to be established for
high-speed data streams filtered through automated business
rules and policies. Consider an example from the financial
services industry where companies seek ways to gain
competitive advantage in trading stocks, commodities,
precious metals, and other financial instruments.
High-frequency traders using algorithmic trading will use
approaches to reduce milliseconds from trading transition
time, even if these approaches require digging trenches
to lay dedicated fiber-optic networks [42, 43] or even special
transatlantic underwater cables and moving closer to the
trading action at the exchange. However, such systems need
to automate the monitoring processes and provide a
feedback loop in order to provide a Bcircuit breaker[
effect and prevent runaway M2M interactions, lest they
cause disruption to the operations of the entire trading
exchange.
Provisioning for security and privacy of Big Data
by design
Security principlesVsuch as separation of duties, separation
of concern, principle of least privilege, and defense in
depthVapply to all types of data. Extracting value from
Big Data requires a responsible approach to security and
privacy. There is sensitive data that needs to be protected,
retention policies that need to be determined, and personal
data that needs to be masked before transmittal. For example,
the Big Data governance program will need to define
policies surrounding new digital data, such as policies
involving RFID (radiofrequency identification) tag
deactivating or sharing geo-location codes, because these
may be fraught with security and privacy issues depending
on the country of data origin.
P. MALIK 1 : 9IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
For companies that deal with consumer information,
extra precautions need to be considered in order to maintain
trust of subscribers. The governance program needs
to be sensitive to invasion of privacy. The need for
ethics-before-profits policy around privacy dictates not
renting or sharing the collected data that could identify
individuals. Governance practices and policiesVsuch as
those involving data activity monitoring covering all read,
write, or access functions on sensitive and private
dataVshould be enabled through use of suitable processes
and technologies for automatic logging and alerting.
Data masking and obfuscation of sensitive portions of
structured and unstructured data, such as in documents and
electronic files, is another security practice that should be
part of the governance program.
Increasingly, an individual’s personal data in any company
transaction system is no longer retained in isolation;
instead, it is aggregated so that prospective customers can be
shown targeted advertisements or directed to customized
services. Advertising is just one way data can be collected,
aggregated, and monetized. Organizations can assess credit
worthiness, evaluate employees, or even take the step toward
linking with government or other legal data. The security
risks arise because users must relinquish control over their
data. Because Big Data can be focused on the aggregation of
customer data across an organization, security-protection
efforts should focus on the elimination of the Bsilos of data[
and examine ways to control and enhance the data-protection
processes that exist.
As mentioned, realizing that not all data is equally
useful and undergoes decay at varying paces, information
life-cycle governance practices need to be extended
considering TVoD. Governance policies are needed for
selective storage and defensible disposal. Regulatory
requirements may necessitate keeping certain types of data
for seven years (e.g., financial information) or up to ten years
(e.g., pharmaceutical clinical trial data), but retaining data
beyond that is not only an expensive proposition, but also a
Figure 2
A framework to establish principles of Big Data governance. The arrows indicate a communication pattern and a workflow that helps in the practical
realization of these principles.
1 : 10 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
potential legal exposure. Organizations need to be prepared
to purge nonuseful data after a certain time, especially when
data trends have been determined and aggregate data stored
and processed.
Establishing, monitoring, and measuring metrics
and communicating
The success or failure of any program depends on how well it
meets its goals over the established timeline. Without
establishing suitable metrics and key performance indicators,
it will be difficult to ascertain whether the Big Data program
is progressing as planned. Further, regular measurement,
monitoring, and sharing of these metrics with the
stakeholders is a sound governance practice that can be easily
adopted from the traditional data governance programs and
provides a vital feedback loop to fine-tune strategy with
changing business priorities.
Thus, many of the information governance strategies,
programs, and practices developed for the Bstructured world[
are valid and transferrable to the Big Data world. However,
these practices must be extended, expanded, and adapted
for Big Data as discussed above. Figure 2 summarizes the
principles of Big Data governance and provides a framework
for putting them into practice. The arrows show workflow
and communication patterns. The tan boxes at top of the
figure (i.e., Strategy, Process, Policy, People, and
Technology and Automation) refer to the key domains
involved in Big Data governance. Note that the most-often
used definition of governance involves an orchestration of
People, Process, and Technology. The blue boxes in Figure 2
refer to the primary principles of Big Data governance
as explained herein.
Conclusion
Our world is becoming increasingly instrumented and
interconnected, with a proliferation of information. This
provides a clear opportunity to transform the world into a
more Bintelligent[ planet [44]. As per the 2012 IBM CEO
study, the ability of an organization to derive value from
data is strongly correlated with performance, where
outperforming organizations are twice as good as
underperformers at accessing and drawing insights from data
[45]. With the advent of Big Data, organizations now have
the opportunity to realize even more value from their
information assets. Early adopters and nearly two-thirds of
respondents in the 2012 Big Data@Work study [10] confirm
that the use of information (including Big Data) and analytics
is creating a competitive advantage for their organizations.
Clearly, there is a great emphasis on data and deriving its
value through effective governance.
Customer-centric analytics are top priority for Big Data
initiatives among many early adopters. As a result of
advanced techniques (such as machine learning and natural
language processing), sophisticated hardware, and Bsmart[
software, computers have become adept at processing and
determining trends in consumer behavior that were hitherto
cost prohibitive to analyze and impossible for humans to
decipher. Every click on a website, each phone call, social
media post, tweet, blog entry, or a credit-card purchase
creates a record that can be stored and has the potential to be
analyzed in a manner that will create tangible value for the
business. Enhancing the inherent value of such data in
the enterprise is a key Big Data governance objective,
as discussed throughout this paper.
As observed through numerous examples presented in
this paper, Big Data is revolutionizing our world at an
unprecedented pace despite the nascent and rapidly evolving
technologies involved [46]. Several use cases have emerged
with a potential to cause significant disruption in the
conventional data management thinking. Extending the
information management framework to address data types
unique to Big Data, and incorporating modified practices
suitable for integrating Big Data, allows us to govern
Big Data as an integral part of enterprise data management
function as opposed to dealing with it in a silo. However,
harnessing the full power of data in an enterprise setting
requires removal of many obstacles that still remain in the
path [47]. For instance, a Gartner analyst recently observed
that Balthough information arguably meets accounting
standards criteria for an asset, and more specifically, further
litmus tests for an intangible asset, it is not found on public
companies’ balance sheets[ [48].
This provokes a series of thoughts. For example,
if information is really considered a corporate asset, then
why does it not appear in any company balance sheets and
regulatory filings, like brand value does? In addition, what is
the value of information owned by corporations and how
is it measured? Would an incident involving an information
security breach cause a company to book losses on the
quarterly earnings statement? When these questions are
answered, it will bring focus and attention to the
quantification of the intrinsic value of data and further
underscore the need for governing all types of data.
As Big Data becomes ubiquitous in the coming years,
the need for governing and managing it effectively in the
enterprise will only become stronger. The principles and
practices involved in Big Data governance that have been
presented in this paper serve as a starting point and will
further evolve and ultimately become pervasive to the extent
of being clearly incorporated into normal business and
IT functions. Until then, organizations must learn to deal
with volume, velocity, variety, and veracity of Big Data
through the application of strategy, processes, automation
technology, and people implementing those processes.
*Trademark, service mark, or registered trademark of International
Business Machines Corporation in the United States, other countries,
or both.
P. MALIK 1 : 11IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
**Trademark, service mark, or registered trademark of Apache
Software Foundation, Facebook, Inc., LinkedIn Corporation, Twitter,
Inc., MySpace, Inc., Google, Inc., or ShotSpotter, Inc., in the
United States, other countries, or both.
References
1. B. Johnson, What the web is saying about the god particle.
[Online]. Available: http://gigaom.com/2012/07/04/what-the-web-
is-saying-about-the-god-particle/
2. M. Adrian, BBig Data,[ Teradata Magazine. [Online]. Available:
http://www.teradatamagazine.com/v11n01/Features/Big-Data/
3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
C. Roxburgh, and A. H. Byers, BBig data: The next frontier for
innovation, competition and productivity,[ McKinsey & Comp.,
San Francisco, CA. [Online]. Available: http://www.mckinsey.
com/Insights/MGI/Research/Technology_and_Innovation/
Big_data_The_next_frontier_for_innovation
4. D. Laney, BApplication delivery strategies,[ META Group Inc.,
Stamford, CT. [Online]. Available: http://blogs.gartner.com/
doug-laney/files/2012/01/ad949-3D-Data-Management-
Controlling-Data-Volume-Velocity-and-Variety.pdf
5. IDC, IDC Go-to-Market Services: The 2011 Digital Universe
Study: Extracting Value from Chaos. [Online]. Available: http://
www.emc.com/collateral/demos/microsites/emc-digital-universe-
2011/index.htm
6. M Scherer, Inside the Secret World of Quants and Data Crunchers
who helped Obama Win. [Online]. Available: http://swampland.
time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-
crunchers-who-helped-obama-win/
7. N. Bressan, C. McGregor, and A. James, BTrends and
opportunities for integrated real time neonatal clinical decision
support in conference proceedings,[ in Proc. IEEE-EMBS Int.
Conf. BHI, Jan. 2012, pp. 687–690.
8. D. Boyd and K. Crawford, BSix provocations for big data : A
decade in internet time: symposium on the dynamics of the internet
and society,[ Social Sci. Res. Netw., New York. [Online].
Available: http://ssrn.com/abstract=1926431 or http://dx.doi.org/
10.2139/ssrn.1926431
9. D. Borthakur, BThe Hadoop distributed file system: Architecture
and design,[ Apache Softw. Found., Forest Hill, MD. [Online].
Available: http://mit.edu/~mriap/hadoop/hadoop-0.13.1/docs/
hdfs_design.pdf
10. M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and
P. Tufano. Analytics: The real-world use of big data, IBV and
Saiid School of Business, University of Oxford. [Online].
Available: http://public.dhe.ibm.com/common/ssi/ecm/en/
gbe03519usen/GBE03519USEN.PDF
11. B. Franks, Taming the Big Data Tidal Wave: Finding
Opportunities in Huge Data Streams with Advanced Analytics.
Hoboken, NJ: Wiley, 2012.
12. T. Fisher, The Data Asset: How Smart Companies Govern Their
Data for Business Success. Hoboken, NJ: Wiley, 2010.
13. Big Data: New Insights Transform Industries, IBM Corp., New
York, Jun. 2012, a whitepaper, IBM. [Online]. Available: https://
www14.software.ibm.com/webapp/iwm/web/signup.do?source=
sw-infomgt&S_PKG=ov8620&S_TACT=109HF63W&S_CMP=
is_bdwp2_bdhub.
14. IBM Global CMO Study, IBM Corp., New York, 2011. [Online].
Available: http://www-935.ibm.com/services/us/cmo/
cmostudy2011/cmo-registration.html.
15. G. Kirkpatrick, BThe corporate governance lessons from the
financial crisis,[ Org. Econ. Co-oper. Develop., Cedex, France.
[Online]. Available: http://www.oecd.org/dataoecd/32/1/
42229620.pdf
16. J. Gee, M. Button, and G. Brooks, BFinancial cost of healthcare
fraud,[ Eur. Healthcare Fraud Corrupt. Netw., Brussels,
Belgium. [Online]. Available: http://www.ehfcn.org/media/
documents/The-Financial-Cost-of-Healthcare-FraudVFinal-
%282%29.pdf
17. S. Rogers. (2011, Sep.). Big data is scaling BI and analytics.
Inf. Manage. [Online]. 21(5), p. 14. Available: http://www.
information-management.com/issues/21_5/big-data-is-scaling-
bi-and-analytics-10021093-1.html
18. S. Soares, Smart Meters and Big Data. [Online]. Available: http://
www.smartgridnews.com/artman/publish/Technologies_MDM/
Smart-meters-show-Big-Data-governance-best-practices-4433.
html
19. G. Raine, Power grid failure leaves 2 million in the dark. [Online].
Available: http://www.sfgate.com/news/article/Power-grid-failure-
leaves-2-million-in-the-dark-3134893.php
20. S. Carter, Get Bold: Using Social Media to Create a New Type of
Social Business. Upper Saddle River, NJ: IBM Press, 2012.
21. R. Scott, Social Retail. [Online]. Available: http://searchengine-
watch.com/article/2192745/Social-Retail-Finding-Engaging-
Cultivating-Todays-Connected-Consumer
22. Talking to Strangers: Millennials Trust People over Brands,
Bazaarvoice, Austin, TX, , 2012. [Online]. Available:
http://www.bazaarvoice.com/files/whitepapers/
BV_whitepaper_millenials.pdf.
23. C. Duhigg, BHow Companies learn your secrets,[ The New York
Times Co., New York. [Online]. Available: http://www.nytimes.
com/2012/02/19/magazine/shopping-habits.html?=rss&page-
wanted=all
24. P. Nist, The Big Data Challenge: Social Data Meets Corporate
Data. [Online]. Available: http://communities.intel.com/
community/openportit/server/blog/2012/03/29/the-big-data-
challenge-social-data-meets-corporate-data
25. M. Palmer, Data is the new oil. [Online]. Available: http://ana.
blogs.com/maestros/2006/11/data_is_the_new.html
26. A. Cavoukian and J. Jonas, BPrivacy by design in the age of big
data,[ Privacy by Design, Toronto, ON, Canada. [Online].
Available: http://privacybydesign.ca/content/uploads/2012/06/
pbd-big_data.pdf
27. Data.gov Concept of Operations, OMB, New York. [Online].
Available: http://www.data.gov/sites/default/files/attachments/
data_gov_conops_v1.0.pdf.
28. Streetline. [Online]. Available: http://www.wired.com/autopia/
2010/11/city-parking-smartens-up-with-streetline/
29. J. Hagel and J. F. Rayport, BComing battle for customer
information,[ Harvard Bus. Rev., Boston, MA. [Online].
Available: http://cb.hbsp.harvard.edu/cb/web/product_detail.seam;
jsessionid=A641511F8D6094B1AF8431E85981183D?E=
60482&R=97104-PDF-ENG&conversationId=230602
30. S. Swoyer, Big Data Integration. [Online]. Available: http://tdwi.
org/Research/2012/06/TDWI-EBook-Big-Data-Integration.aspx
31. P. Malik, BBig data governanceVThe next frontier for
the information economy,[ IAIDQ, Baltimore, MD. [Online].
Available: http://iaidq.org/webinars/2012-04-24.shtml
32. M. Godinez, E. Hechler, K. Koenig, S. Lockwood, M. Oberhofer,
and M. Schroeck, The Art of Enterprise Information Architecture:
A Systems-Based Approach for Unlocking Business Insight.
Boston, MA: IBM Press, 2010.
33. S. Soares, T. Deutsch, S. Hanna, and P. Malik, BBig data
governance: A framework to assess maturity,[ IBM Corp., New
York. [Online]. Available: http://ibmdatamag.com/2012/04/
big-data-governance-a-framework-to-assess-maturity/
34. S. Soares, BA framework that focuses on the BData[ in big data
governance,[ IBM Corp., New York. [Online]. Available: http://
ibmdatamag.com/2012/06/a-framework-that-focuses-on-the-data-
in-big-data-governance/
35. S. Soares, Selling Information Governance to the Business: Best
Practices by Industry and Job Function. Ketchum, ID: MC
Press, 2011.
36. IBM GTO 2012 Study: Managing Uncertain Data at scale, IBM
Corp., New York, , 2012. [Online]. Available: http://www.zurich.
ibm.com/pdf/isl/infoportal/GTO_2012_Booklet.pdf.
37. P. Malik, BInformation integrity for CRM in a virtual world,[ in
Encyclopedia of Virtual Communities and Technologies.
Hershey, PA: Idea Group, 2006, pp. 266–272.
38. W. Eckerson, BData quality and the bottom line,[ The Data
Warehousing Inst., Chatsworth, CA, TDWI Rep. Ser., TDWI,
101 Commun., 2002. [Online]. Available: http://download.
101com.com/pub/tdwi/Files/DQReport.pdf
1 : 12 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
39. L. English and P. Malik, Plain English about Information Quality.
[Online]. Available: http://www.information-management.com/
issues/20070401/1079288-1.html
40. E. Pierce, C. L. Yonke, P. Malik, and C. K. Nagaraj, BThe state of
information and data quality industry report 2012,[ IAIDQ,
Baltimore, MD. [Online]. Available: http://iaidq.org/publications/
pierce-2012-11.shtml
41. S. Ruschka-Taylor, C. Evask, P. Malik, and S. Minsinger,
BTransforming information integrity,[ IBM Corp., New York.
[Online]. Available: http://www-935.ibm.com/services/us/gbs/bus/
pdf/g510-3831-transforming-enterprise-information-integrity.pdf
42. D. I. Amin and A. F. Bach, BNYSE Euronext and 100G: The drive
to zero latency,[ Lightwave, Nashua, NH. [Online]. Available:
http://www.lightwaveonline.com/articles/print/volume-27/issue-1/
applications/case-by-case/nyse-euronext-and-100g-the-drive-to-
zero-latency.html
43. A. Troianovski, Networks Built on Milliseconds. [Online].
Available: http://online.wsj.com/article/
SB10001424052702304065704577426500918047624.html
44. C. Harrison, B. Eckman, R. Hamilton, P. Hartswick,
J. Kalagnanam, J. Paraszczak, and P. Williams, BFoundations for
Smarter Cities,[ IBM J. Res. & Dev., vol. 54, no. 4, pp. 1–16,
Jul./Aug. 2010, paper 1.
45. IBM Global CEO Study 2012, IBM Corp., New York, 2012.
[Online]. Available: ibm.com/services/us/en/c-suite/ceostudy2012.
46. P. Zikopoulos, C. Eaton, T. Deutsch, D. Deroos, and G. Lapis,
Understanding Big Data: Analytics for Enterprise Class Hadoop
and Streaming Data. New York: McGraw-Hill, 2012.
47. C. Stakutis and J. Webster, Inescapable Data: Harnessing the
Power of Convergence. Upper Saddle River, NJ: IBM Press,
2005.
48. D. Laney, BInfonomics: The practice of Information economics,[
in Forbes. [Online]. Available: http://www.forbes.com/sites/
gartnergroup/2012/05/22/infonomics-the-practice-of-information-
economics/
Received August 1, 2012; accepted for publication
August 28, 2012
Piyush Malik IBM Global Business Services, San Jose, CA 95131
USA (Piyush.malik@us.ibm.com). Mr. Malik leads the Worldwide
Big Data Services Center of Excellence within the IBM Global
Business Analytics and Optimization (BAO) consulting practice.
Specializing in information management strategy and architecture,
information quality and governance, business intelligence, master data,
and advanced analytics, Mr. Malik has more than 24 years of
international consulting, practice building, sales, and delivery
experience with Fortune 500 clients across multiple industries. He has
served as Founding Director of the IBM Global BAO Center of
Competency and previously led the Information Integrity consulting
practice at Pricewaterhousecoopers Management Consulting Services
before it was acquired by IBM in 2002. Mr. Malik has also been serving
on the Board of Directors of International Association for Information
and Data Quality (IAIDQ) since 2008. He received an undergraduate
degree in electronics and communications engineering in 1989 and
a master’s degree in management of technology from Indian Institute
of Technology, Delhi, in 1995. He is a frequent speaker at industry
conferences and has authored several articles and papers.
P. MALIK 1 : 13IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013

Más contenido relacionado

La actualidad más candente

Orzota all-in-one Big Data Platform
Orzota all-in-one Big Data PlatformOrzota all-in-one Big Data Platform
Orzota all-in-one Big Data PlatformOrzota
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges Experian_US
 
CS309A Final Paper_KM_DD
CS309A Final Paper_KM_DDCS309A Final Paper_KM_DD
CS309A Final Paper_KM_DDDavid Darrough
 
Big Data in Banking (Data Science Thailand Meetup #2)
Big Data in Banking (Data Science Thailand Meetup #2)Big Data in Banking (Data Science Thailand Meetup #2)
Big Data in Banking (Data Science Thailand Meetup #2)Data Science Thailand
 
Analytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big DataAnalytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big DataDavid Pittman
 
Is Your Company Braced Up for handling Big Data
Is Your Company Braced Up for handling Big DataIs Your Company Braced Up for handling Big Data
Is Your Company Braced Up for handling Big Datahimanshu13jun
 
Big Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future FoundationBig Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future FoundationForesight Factory
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsJOSEPH FRANCIS
 
Big data analytic market opportunity
Big data analytic market opportunityBig data analytic market opportunity
Big data analytic market opportunityStanley Wang
 
Big, small or just complex data?
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?panoratio
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Oomph! Recruitment
 
Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you
Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you
Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you Intellectyx Inc
 
Lauren Moores Keynote
Lauren Moores KeynoteLauren Moores Keynote
Lauren Moores KeynoteData Con LA
 
Snowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group
 

La actualidad más candente (20)

Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Orzota all-in-one Big Data Platform
Orzota all-in-one Big Data PlatformOrzota all-in-one Big Data Platform
Orzota all-in-one Big Data Platform
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges
 
The 25 Predictions About The Future Of Big Data
The 25 Predictions About The Future Of Big DataThe 25 Predictions About The Future Of Big Data
The 25 Predictions About The Future Of Big Data
 
CS309A Final Paper_KM_DD
CS309A Final Paper_KM_DDCS309A Final Paper_KM_DD
CS309A Final Paper_KM_DD
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data in Banking (Data Science Thailand Meetup #2)
Big Data in Banking (Data Science Thailand Meetup #2)Big Data in Banking (Data Science Thailand Meetup #2)
Big Data in Banking (Data Science Thailand Meetup #2)
 
Analytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big DataAnalytics: The Real-world Use of Big Data
Analytics: The Real-world Use of Big Data
 
Is Your Company Braced Up for handling Big Data
Is Your Company Braced Up for handling Big DataIs Your Company Braced Up for handling Big Data
Is Your Company Braced Up for handling Big Data
 
Bigdata Hadoop introduction
Bigdata Hadoop introductionBigdata Hadoop introduction
Bigdata Hadoop introduction
 
Big Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future FoundationBig Data and The Future of Insight - Future Foundation
Big Data and The Future of Insight - Future Foundation
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analytics
 
Big data analytic market opportunity
Big data analytic market opportunityBig data analytic market opportunity
Big data analytic market opportunity
 
Big, small or just complex data?
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?
 
Big data
Big dataBig data
Big data
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 
Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you
Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you
Whitepaper: Thriving in the Big Data era Manage Data before Data Manages you
 
Lauren Moores Keynote
Lauren Moores KeynoteLauren Moores Keynote
Lauren Moores Keynote
 
Big Data
Big DataBig Data
Big Data
 
Snowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big Data
 

Destacado

Intro to network Science
Intro to network ScienceIntro to network Science
Intro to network SciencePyData
 
A Pragmatic Approach to Identity and Access Management
A Pragmatic Approach to Identity and Access ManagementA Pragmatic Approach to Identity and Access Management
A Pragmatic Approach to Identity and Access Managementhankgruenberg
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Miningtobiemuir
 
Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...Mathias Kluba
 
50 data principles for loosely coupled identity management v1 0
50 data principles for loosely coupled identity management v1 050 data principles for loosely coupled identity management v1 0
50 data principles for loosely coupled identity management v1 0Ganesh Prasad
 
Review of Data Management Maturity Models
Review of Data Management Maturity ModelsReview of Data Management Maturity Models
Review of Data Management Maturity ModelsAlan McSweeney
 
Identity and Access Management - Data modeling concepts
Identity and Access Management - Data modeling conceptsIdentity and Access Management - Data modeling concepts
Identity and Access Management - Data modeling conceptsAlain Huet
 

Destacado (15)

Preparing Data for Sharing: The FAIR Principles
Preparing Data for Sharing: The FAIR PrinciplesPreparing Data for Sharing: The FAIR Principles
Preparing Data for Sharing: The FAIR Principles
 
FAIR data overview
FAIR data overviewFAIR data overview
FAIR data overview
 
Intro to network Science
Intro to network ScienceIntro to network Science
Intro to network Science
 
A Pragmatic Approach to Identity and Access Management
A Pragmatic Approach to Identity and Access ManagementA Pragmatic Approach to Identity and Access Management
A Pragmatic Approach to Identity and Access Management
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
 
Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
 
Big Data
Big DataBig Data
Big Data
 
50 data principles for loosely coupled identity management v1 0
50 data principles for loosely coupled identity management v1 050 data principles for loosely coupled identity management v1 0
50 data principles for loosely coupled identity management v1 0
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Big Data
Big DataBig Data
Big Data
 
Review of Data Management Maturity Models
Review of Data Management Maturity ModelsReview of Data Management Maturity Models
Review of Data Management Maturity Models
 
Identity and Access Management - Data modeling concepts
Identity and Access Management - Data modeling conceptsIdentity and Access Management - Data modeling concepts
Identity and Access Management - Data modeling concepts
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar a Massive-Scale Analytics Journal

Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)Sonu Gupta
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSijistjournal
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...IJERDJOURNAL
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applicationsIJARIIT
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
Idc big data whitepaper_final
Idc big data whitepaper_finalIdc big data whitepaper_final
Idc big data whitepaper_finalOsman Circi
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...ijdpsjournal
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET Journal
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Taniya Fansupkar
 

Similar a Massive-Scale Analytics Journal (20)

Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)
 
Big data
Big dataBig data
Big data
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICS
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applications
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
Big Data.pdf
Big Data.pdfBig Data.pdf
Big Data.pdf
 
Sample
Sample Sample
Sample
 
Big data survey
Big data surveyBig data survey
Big data survey
 
big-data.pdf
big-data.pdfbig-data.pdf
big-data.pdf
 
Idc big data whitepaper_final
Idc big data whitepaper_finalIdc big data whitepaper_final
Idc big data whitepaper_final
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big data
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
 

Último

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 

Último (20)

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 

Massive-Scale Analytics Journal

  • 1. VOLUME 57, NUMBER 3/4, MAY/JUL. 2013 Journal of Research and Development Massive-Scale Analytics Including IBM Systems Journal
  • 2. Vol. 57, No. 3/4, May/July 2013 Massive-Scale Analytics BBig Data[ refers to large data sets that are beyond the capability of traditional software tools to quickly manage, process, and analyze. The development of techniques for gaining insight from such information provides potential benefits in such arenas as business, science, and public policy. This special issue of the IBM Journal emphasizes applications, analytics, software, and hardware technologies that form the foundational building blocks for massive-scale analytics and the processing of Big Data. Preface: Massive-scale analytics A. Soffer, Guest Editor 1 Governing Big Data: Principles and practices P. Malik 2 Trends and outlook for the massive-scale analytics stack A. N. Ghoting, J. A. Gunnels, P. Kambadur, E. P. Pednault, and M. S. Squillante 3 Understanding system design for Big Data workloads H. P. Hofstee, G. C. Chen, F. H. Gebara, K. Hall, J. Herring, D. Jamsek, J. Li, Y. Li, J. W. Shi, and P. W. Y. Wong 4 A platform for eXtreme Analytics A. Balmin, K. Beyer, V. Ercegovac, J. McPherson, F. O¨ zcan, H. Pirahesh, E. Shekita, Y. Sismanis, S. Tata, and Y. Tian 5 GPFS-SNC: An enterprise cluster file system for Big Data R. Jain, P. Sarkar, and D. Subhraveti 6 Toward a scale-out data-management middleware for low-latency enterprise computing L. L. Fong, Y. Gao, X. R. Guerin, Y. G. Liu, T. Salo, S. R. Seelam, W. Tan, and S. Tata Ó Copyright 2013 by International Business Machines Corporation. See individual articles for copying information. Post-1994 articles that carry a code at the bottom of the first page may be copied, provided the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, U.S.A. Pages containing the table of contents may be freely copied and distributed in any form. ISSN 0018-8646. Printed in U.S.A. (continued on next page)
  • 3. 7 IBM Streams Processing Language: Analyzing Big Data in motion M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soule´, and K.-L. Wu 8 Real-time analysis and management of big time-series data A. Biem, H. Feng, A. V. Riabov, and D. S. Turaga 9 Novel document detection for massive data streams using distributed dictionary learning S. P. Kasiviswanathan, G. Cong, P. Melville, and R. D. Lawrence 10 Big Data text-oriented benchmark creation for Hadoop A. Gattiker, F. H. Gebara, H. P. Hofstee, J. D. Hayes, and A. Hylick 11 Platform and applications for massive-scale streaming network analytics P. Zerfos, M. Srivatsa, H. Yu, D. Dennerline, H. Franke, and D. Agrawal 12 Scalable community detection in massive social networks using MapReduce J. Shi, W. Xue, W. Wang, Y. Zhang, B. Yang, and J. Li 13 Visual analysis of large-scale network anomalies Q. Liao, L. Shi, and C. Wang 14 A statistical approach to mining customers’ conversational data from social media D. Konopnicki, M. Shmueli-Scheuer, D. Cohen, B. Sznajder, J. Herzig, A. Raviv, N. Zwerling, H. Roitman, and Y. Mass 15 A real-time stream storage and analysis platform for underwater acoustic monitoring J. P. Hayes, H. R. Kolar, A. Akhriev, M. G. Barry, M. E. Purcell, and E. P. McKeown (continued from previous page)
  • 4. Governing Big Data: Principles and practices P. Malik As data-intensive decision making is being increasingly adopted by businesses, governments, and other agencies around the world, most organizations encountering a very large amount and variety of data are still contemplating and assessing their readiness to embrace BBig Data.[ While these organizations devise various ways to deal with the challenges such data brings, the impact and importance of Big Data to information quality and governance programs should not be underestimated. Drawing upon implementation experiences of early adopters of Big Data technologies across multiple industries, this paper explores the issues and challenges involved in the management of Big Data, highlighting the principles and best practices for effective Big Data governance. Introduction Big Data, a term commonly associated with massive datasets growing at a rapid pace, also represents complexity and variety in the types of data that are being collected and analyzed by organizations. Whether the data is used to explain global climatic change patterns and their impact, predict customer behavior, forecast purchasing intent, retain customers, target political donors, influence voters for political campaigns, or even help understand the characteristics of our very early universe via the discovery of subatomic particles such as the Higgs boson particle [1], Big Data clearly has a significant role to play. Many different definitions and theories on what constitutes Big Data exist today. The most often cited definition [2] captures its essence: BBig Data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.[ A seminal report [3] from McKinsey Global Institute in 2011 helped promote benefits of using Big Data in the corporate environment, but it was more than ten years earlier that an analyst introduced the idea in a research note [4], discussing ways and means of controlling volume (e.g., massive size of datasets), velocity (e.g., amount of data transferred per unit of time), and variety of data (e.g., images, audio, web information, or computer logs, as well as various structured database records), which is the basis for the popularity of the letter BV[ used in many Big Data marketing communications. Big Data has significantly altered the data management considerations for the information technology (IT) professional. According to International Data Corporation (IDC), approximately 90% of the digital data we encounter today did not exist two years ago and is predicted to grow from 2.7 zettabytes (ZB) in 2012 to 35 ZB by the year 2020 [5]. However, Big Data is not just big in terms of size; it is transmitted at high rates and usually implies heterogeneity of data types, representation, and semantic interpretation. For example, it can come from online customer behavior via social media, from click streams (e.g., records of the paths taken by users through a company website), call detail and data-usage records (xDRs; also known as event detail records) from telecommunications companies, and data files or transaction notes or audio files captured by contact centers (e.g., facilities that manage client contacts through telephone calls). Sensors and other machines automatically and continuously generate data and event streams. Biometric data (such as fingerprints or retina scans) as well as medical images and health care data add variety, while traditional mission-critical applications continue to produce terabytes of transaction data. Likewise, the world faces a data deluge because of the widespread proliferation of the Internet, combined with the ubiquity of advanced computing and mobile phone technologies. Organizations will need to handle it appropriately and responsibly not only for competitive advantage, but for survival. To add to the challenge of handling structured data in rows and columns in traditional ÓCopyright 2013 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. P. MALIK 1 : 1IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013 0018-8646/13/$5.00 B 2013 IBM Digital Object Identifier: 10.1147/JRD.2013.2241359
  • 5. data stores, organizations must also contend with semi-structured data from web logs and machine-to-machine (M2M) logs, as well as unstructured data in the form of texts, documents, PDF (portable document format) files, images, videos, and audio streams. Big Data analytics enables organizations to make Bsmarter[ decisions and execute them when it is most beneficial to the business, its customers, and its partnersVsuch as in real time. For example, organizations deploy predictive models on event streams to identify real-time trends. In this way, they are improving the speed and intelligence of their response to customer behavior, suspicious activity, and aberrations in processes. The following cases illustrate diverse applications of Big Data for smarter decision making. In the first case, the 2012 U.S. elections provided a glimpse of how effectively the Democratic political party staff used voter contact data combined with social media analytics to precisely adjust its campaign messaging, created simulation models, and mined data to decide exactly when and on which media channel to place their advertisements to maximize the campaign donationsVand effectively influenced the outcome of the elections in its favor [6]. In another example, researchers at the University of Ontario Institute of Technology (UOIT) in Canada use streams of medical sensor data to help doctors detect and analyze subtle changes in the condition of critically ill premature babies to predict the onset of neonatal life-threatening infections as much as 24 hours in advance, thereby to take precautionary measures and save lives [7]. Finally, consider analytics applications in video surveillance. The image data file of the video is large and uninteresting, but the associated data becomes more useful after adding various forms of metadata, indices of frames in the video, who and what is contained therein, transcripts of dialogs from the video, and comments from viewers. In short, all the information that makes the video searchable and relevant adds both value and volume. A commercial application of such systems can be found in ensuring the security of buildings and physical infrastructure, and these systems are being incorporated in many home video surveillance and security systems as well. Adding audio and video analytics for this streaming data in real time, TerraEchos offers value-added solutions across industry segments for monitoring and protecting sensitive infrastructure (e.g., infrastructures involving government, oil, gas, and power grids) as well as for cybersecurity [8]. As mentioned, Big Data is a term that refers to the management, processing, and analysis of large amounts of data. Big Data includes the data in storage systems (e.g., Hadoop** Distributed File System [HDFS], under open source Apache Hadoop and MapReduce frameworks [9]) and databases, as well as data in the process of being transmitted (e.g., data in stream-processing systems). As the corporate world explores and recognizes the importance of using Big Data analytics, a few organizations have started realizing the potential of Big Data, though many organizations are still struggling to manage their transactional and operational data and are not yet ready [10, 11]. For those organizations who aspire to be Bworld class[ and beat their competition, it is only natural that they will need to strengthen their information management programs to secure, manage, and expand governance initiatives to address Big Data. The subsequent sections of this paper outline the need for governing Big Data and the challenges involved. Further, they will highlight the principles and practices for governing Big Data as demonstrated by early adopters of Big Data technologies. Why govern Big Data? Governance of information involves the orchestration of people, processes, and technology to turn data into a strategic asset for the organization. Data governance embodies the exercise of decision making and authority for data-related matters spanning policies, rules, rights, organizational structures, and accountabilities of people and information systems as they perform information-related processes to accomplish business objectives [12]. Depending on the maturity of the information management function, varying levels of adoption dictate the sophistication of the information quality and governance discipline within the enterprise. While Big Data introduces more and new types of information, the reasons to govern data are well established. We govern data to manage risk and, more importantly, to extract value from data. Big Data for operations and cost management Improving operational efficiencies through reducing direct and indirect costs is of interest to business managers across industries. As mentioned, in any industry, Big Data plays a role through continuous monitoring and analysis of data emitted by sophisticated machines, embedded environmental sensors, and operational metrics. This M2M sensor data, which traditionally was often discarded because of unmanageable volumes or speed at which it gets generated, has potential to improve operating efficiency of equipment as well as overall operational safety. For example, in oil drilling rigs, proactively acting on early warning signs of mechanical or electronic equipment component failures can prevent deadly accidents and oil spills. Analyzing this type of M2M data is valuable to predict the health of equipment components and for timely preventative maintenance, thereby avoiding costly industrial downtime. Thus, adequately governing such Big Data contributes to efficiency of operations. 1 : 2 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 6. Big Data for value creation Social media companies such as Twitter, Facebook, and LinkedIn rely on users’ willingness to provide their personal stories, status updates, and posts to create a valuable corpus of rich content. On each social media platform, aggregated data is monetized and ultimately creates revenue streams. Thus, when Facebook valued the company at $100 billion at the launch of their initial public offering in early 2012, though many had expressed doubts about the valuation, the data scientists and social media software experts took pride in the coming of age of Big Data. Tremendous possibilities emerge for transforming not only businesses, but also entire industries with data. Leading organizations in financial services, health care, retail, telecommunications, media, and other industries are already benefitting from new opportunities generated by Big Data analysis. For example, organizations are actively analyzing the content from thousands of social media posts every day to uncover new customer insights that enable them to profit from emerging trends and deliver a richer customer experience. Some are improving the consistency of information across sales channels to provide a seamless experience for customers as they move among web, mobile-app, in-person, phone, and other interaction points. Many others are using real-time analytics to identify new cross-sell and upsell opportunities while performing predictions and retaining customers. Still others are working on enhancing the efficiency of operations and increasing service availability by analyzing tens of billions of system event records. In many cases, companies are scrutinizing identities, relationships, and previous customer interactions and are using predictive analytics to anticipate fraud and avoid costly losses [13]. In an IBM study [14] of more than 1,700 chief marketing officers (CMOs), the authors highlight the transforming of the CMO’s agenda as a result of glaring gaps in organizational readiness to handle Big Data. At the same time, the marketing professionals led by the CMO are increasingly benefitting from Big Data adoption and are readily engaging in new data-centric initiatives. There is intrinsic value of historical customer data, augmented by behavioral data that has been enhanced with micro-segmentation (e.g., demographic categorization to a near-personal level), online browsing, telephone calling patterns (e.g., who, when, and how), and social media activity to predict purchasing intent, influencers, and consumer buying trends. In the past, available computing technology did not allow for combining all of this data economically to perform the real-time tracking, monitoring, and fine-tuning of promotions and campaigns. However, processing Big Data has provided unprecedented avenues for marketers to access and analyze all types of internal or externally procured data, thereby empowering the marketing executives. Big Data for risk and compliance management The risk management function across industries has the potential to gain from the application of Big Data. If the corporate governance policies for banking and financial institutions were robust and mortgage loan officers had been empowered in data-driven measures, we would possibly be living in a different world today. For example, if bankers had the ability to use the full profile of the prospective loan applicants, identifying their creditworthiness, they could assess historic banking patterns, spending habits, and future earnings potential clearly. That, combined with enriched social media and economic data from external sources, could have produced a comprehensive borrower risk profile for each applicant, and as a result, bankers and lenders could have been better prepared to handle the collapsing U.S. financial markets, which was a classic case of lack of governance at many levels [15]. Likewise, if the health care industry executives had better insight and capability to analyze millions of health care transactions daily, they could, for example, better identify fraudulent claims in health care and potentially prevent an estimated worldwide 180 billion loss annually, per a 2010 study by the European Healthcare Fraud and Corruption Network [16]. Most organizations focus on efficiently handling financial data to ensure the accuracy and integrity of their financial information and protect their chief financial officer (CFO) and chief executive officer (CEO). While increasing scrutiny caused by such regulators and compliance mandates as Sarbanes-Oxley, Basel II, Solvency II, and Health Insurance Portability and Accountability Act (HIPPA) have brought increasing responsibility to the chief information officer (CIO) organization for information quality and governance, the data ownership and hence responsibility for its trustworthiness need to remain with the business. Big Data for operations, risk, and value simultaneously When data is governed in order to meet all three business needs (i.e., operations, risk, and value management), the value obtained by the organization is amplified. As illustrated in the examples below, converting sensor and diagnostics Big Data to valuable insights is changing the airlines and utilities industries. Machine and sensor data from aircraft engines, which could generate as much as 640 TB per flight [17], can be put to effective use, including fault prediction and prevention. This ultimately results in better fleet care, reduced flight delays, and reduced cancellations due to unforeseen technical issues with the plane. Those frequent fliers who have missed connecting flights because of last-minute maintenance crew Bcheckups[ will certainly find this application of Big Data useful. Smart meters, capturing electric- and gas-usage data every few minutes, provide a good illustration of Big Data P. MALIK 1 : 3IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 7. benefitting the consumer as well as the provider of energy. The utility companies can now intelligently match supply with demand and offer consumer incentives to change usage patterns and behaviors. With active governance of such data, isolation of faults and quick fixing of issues can prevent systemic energy grid collapse [18, 19]. Thus, Big Data is enhancing operations, reducing operational and performance risks, and generating value for the consumer. Almost every industry has the potential to benefit significantly from the application and analysis of Big Data. Having established the importance and reasons for governing Big Data, the next section examines how leading organizations are overcoming challenges in managing and governing it today. Challenges and opportunities offered by Big Data As several use cases of Big Data have evolved, it is clear that much of this data can be of value and must be considered for exploratory analysis. Once the exploration starts to show some useful trends, it provides insight into which datasets can be further deeply analyzed and which data may be discarded or purged. This section highlights some trends that are shaping opportunities and challenges for governing Big Data. Confluence of mobile, Internet, and social activity With roughly half of the world’s population today being online, and a majority of them being mobile, consumers as well as enterprises of all sizes have benefitted from the Internet and mobile revolution. Increasing propensity to use social media tools to shop, spend, and share insights has heralded the emergence of social business [20]. Obviously, in this rapid confluence of mobile, Internet, and social activity, more data is being generated than can be managed effectively by humans. Machine-learning algorithms and recommendation engines using this data like the engines from Amazon and Netflix are not only getting smarter (e.g., more sophisticated and effective), but also generating more data while enriching our lives. Data is being recognized as central to innovation and hence can no longer be just relegated to enterprise back-office systems, and this data must be governed appropriately. Evolving consumer behavior Rapidly evolving consumer behavior in which purchase behavior is influenced more by friends or online reviews than by advertisements [21] signals a trend. Moreover, a study found that more than 70% of online shoppers are more likely to pay attention and act upon a stranger’s comments on social media than believe the direct targeted messages from the manufacturer, and the younger generation is more likely than baby boomers to do the same [22]. There is more window-shopping activity than actual buying that occurs in physical stores in the cases where a company website is able to offer the same or better product cheaper with targeted promotions combined with incentives such as no sales tax and free shipping. Thus, it is valuable to understand consumers across their multiple social personas (e.g., presence at social networks such as Facebook**, LinkedIn**, Twitter**, and MySpace**, along with blogging or reviewing activities), households, and online activities. Combining historical buying patterns with online and in-store activities reveals challenges for governing the master data (e.g., linking the system of engagement with the system of record to formulate the single view of the consumer) that may not be ready for Big Data. Rise of social commerce Social media is no longer reserved just for the younger generation; it is also being embraced for mainstream businesses. Popularity of massively multiplayer online games, in-app purchases with a phone, tablet, or other mobile devices, and the advent of gamification (i.e., use of game design techniques and mechanics to incentivize or change consumer behavior in non-gaming situations), combined with social shopping behaviors, have accelerated the arrival of social commerce. In such an era, social shopping is rapidly becoming the norm and is generating even more data than traditionally generated through point-of-sale (POS) systems. A huge Big Data opportunity for retailers and marketing science enthusiasts alike, this shopping data, however, poses data governance challenges because it needs to be combined with enterprise corporate data. With the incorporation of Facebook, Twitter, and YouTube** into digital marketing, public relations, and other traditional business operations, companies now must deal with massive amounts of unstructured data. This data onslaught comes in addition to rapid growth in the amount of customer data housed in enterprise systems. Though hardware storage costs are decreasing as well, we often see that multiple copies of the same datasets continue to propagate within the enterprise and compound problems in management, archive, and retrieval of the latest, most complete, and relevant data. Companies must now find cost-effective ways to integrate and analyze the collective pool of all relevant data to generate granular business insights. Security and privacy The extent to which our digital footprints, combined with in-store activity and credit-card usage patterns, can be exploited by marketers using Big Data is aptly illustrated by the overly aggressive focused promotional campaign by retailer Target for baby products to expectant mothers, thereby antagonizing some customers [23]. This marketing was made possible through mining data using algorithms that could infer that different digital transactions in different 1 : 4 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 8. systems are related to the activity of a single person or household. Commonly termed entity resolution (ER), this is a great use of technology for public benefit such as counterterrorism and detecting frauds (e.g., anti-money-laundering approaches); however, it raises some serious invasion of privacy concerns for the ordinary citizen and motivates public policy, data privacy, and governance concerns. A Big Data governance program needs to address the security challenges of sharing data between applications and compliance with geographical trans-border data regulations, along with strong protection requirements for personal, health, and financial data. Technology advancements and open-source issues Technical challenges for handling Big Data are morphed with the rapid pace with which new technologies and methods are becoming available in the open-source world. Unstructured data (including free-form text, images, and log data) does not easily lend itself to data patterns or data relationships because it does not need to conform to defined data formats. The historical value of such data is limited until a pattern is discovered. Loading such Big Data into traditional relational databases is very time-consuming, error prone, and cost prohibitive. In order to manage this, many companies are deploying advanced technologies such as the parallel-processing MapReduce-based Hadoop software (e.g., an open-source version or a commercial distribution such as IBM BigInsights*) to process and analyze Big Data using distributed commodity hardware server clusters. They then integrate Hadoop with systems housing other customer data to gain richer insights. We have the intersection of the traditional methods of delivering, managing, and viewing information, as well as a new approach that allows data of all types and formats to be quickly sorted for transactional and operational opportunity. This new era of data exchange requires next-generation compute, storage, and input/output technologies [24], as well as systemic thinking. Hadoop, streaming, and NoSQL (not-only Structured Query Language) technologies encourage fresh thinking; however, traditional governance programs will not achieve their intended impact if not expanded for handling Big Data. Quality and uncertainty The phrase Bdata is the new oil[ [25] underscores the immense value of Big Data. However, just like crude oil, the most value is in its refined state (e.g., cleansed data). Increasing volume and speed of data generation can result in decreasing data certainty. This raises the all-important issue of data quality and trustworthiness that must be addressed given the incomplete and often uncertain nature of Big Data. For example in social media channels (e.g., Facebook posts or Twitter tweets), human sentiments and expressions are often inherently treated with skepticism. Thus, without the full context, this data leaves an element of doubt in situations such as those involving a customer’s preference and sentiment for a brand or future buying decision for a specific product in a specific geography. None of the traditional data-cleansing technologies can routinely make this data better or useful. However, this data contains valuable information that could yield trustworthy data when triangulated with multiple similar sources and geospatial location and advanced mathematical modeling techniques such as emerging sense-making systems [26]. Thus, Big Data must be managed in context, which despite its inherent noisiness and lack of an upfront model raises the need to track data lineage and its origin in order to handle uncertainty and error. Public datasets and data consortiums Publicly available high-quality datasetsVsuch as satellite data, astronomical data, and imaging and mapping geospatial data, along with up-to-date weather data feeds combined with public-agency demographic datasets such as those from data.gov [27]Vare a boon for both consumers and the scientific research community. Numerous public utility data-mashups and Bdata apps[ are now available for practical applications, such as an in-city parking locator [28] or a crime stopper (e.g., through gunshot acoustic analytics offered by ShotSpotter**). Likewise, with massive amounts of data being collected about consumers and their buying patterns based on store loyalty cards, coupon redemption or credit-card transaction history data, or mobile calling and data-usage patterns and online behaviors, many companies have amassed very rich customer repositories and are in a position to act as data consortiums for the industry. These datasets and infomediaries (e.g., businesses working as agents on behalf of customers, helping them to monetize and maximize the value of their information profiles [29]) would need to be governed as this Bexternal[ data becomes integrated into the enterprise. Integrating Big Data with traditional enterprise data Historic IT behaviors, such as building systems without considering the real need of the business yet deploying systems for the sake of novelty in technology, will not work in the age of Big Data. From being product centric, organizations have evolved to being customer centric; however, in the Big Data era, even this will not suffice. Differentiated service based on past interactions, anticipated needs, and a 360-degree view of the customer is needed to succeed today. IT organizations must partner closely with business organizations and must adopt agile and iterative experimentation approaches. In essence, this involves a total shift in mindset. However, it is amply evident that silos of data will not work. Thus, historic enterprise data must be fully integrated with the newer types of datasets and keeping the aspect of performance and efficiency at scale P. MALIK 1 : 5IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 9. (e.g., handling large workloads through distributed computing architectures), organizations must be able to transform all of this new data into actionable information within budget [30]. As a result, Big Data governance needs to assume a central role if competitive advantage from data is desired. Governance principles and leading practices Big Data governance is defined as the emerging set of processes, methods, technologies, and practices that enable the rapid discovery, collection, processing, analysis, storage, and defensible disposal of large volumes and fast streams of structured and unstructured data with security, privacy, and cost efficiency [31]. Although the strategic reasons for instituting data governance programs do not change by integrating Big Data in the enterprise, the newer types of data and related characteristics make existing tools, processes, and practices inadequate to handle the data. Because of inherent veracity, context and provenance gain prominence in Big Data governance, and hence, automation and metadata need to be addressed. Some of the early adopters of Big Data technologies have experimented with and embraced a variety of processes and principles that can serve as guidance for the industry. Leading practices in Big Data governance are still evolving and will be refined as Big Data initiatives become widely accepted. This BGovernance principles[ section highlights a few of these. Aligning goals with business strategy to build a roadmap As with all strategic initiatives, success relies on maintaining a clear vision of the final goal. Information governance is a discipline that governs company data assets throughout their life cycles [32]. Data governance programs aim to maximize the value of data through active engagement of people, processes, and technology. Big Data governance is associated with an overarching mission to maximize the value of all types of enterprise data throughout their life cycle. The first step of the journey toward Big Data governance involves stakeholder engagement as well as goal alignment with the business strategy of an organization. Establishing time value of data Organizations view Big Data technologies as change agents, especially because several of these technologies support data in its native format. At the starting point in the Big Data life cycle, organizations do not always know which data sources have value and whether organizations want to invest vast resources to gather requirements and sponsor formal information governance programs [33]. Although it is difficult for some to overcome long-time habits, it must be made clear that not all data is valuable and worth keeping forever. Direct storage and indirect handling expenses can be cost prohibitive, given the rapidly increasing volumes of Big Data. Thus, it must be recognized that data has varying degrees of value. Even with such recognition, that value will change over time as circumstances change and use cases evolve. At a minimum, the value is at least related to the cost of storage, and, thus, the data is more like a natural resource. If it is abundant but has no use, the value is low. However, data can gain value when it has a novel purpose. Time value of data (TVoD) takes into account the Btemperature[ of data (e.g., frequency of access guiding a tiered archival and storage policy), use case, and relative importance (e.g., expiry and aging policy, and short shelf life for certain types of data), thereby helping establish optimum governance practices. Identifying a business use case Experimenting with Big Data technologies, to explore some viable use cases in the context of an industry domain, is a good practice for both IT and business teams in order to appreciate the potential of Big Data first-hand. Once some useful patterns emerge from data exploration, a business case can be identified and further refined to quantify benefits. Ideally, this should be in line with the objectives or imperatives set forth by the organization. For example, a telecommunications firm had a goal of reducing Bchurn[ (e.g., reducing customer turnover) and improving customer intimacy. By analyzing transactional call data records (CDRs), the company was able to predict the likelihood of customers discontinuing their service by analyzing the number of dropped calls for each consumer within each month. By using geo-location data from the phone, they were able to offer location-based promotions and services. Detailed telecommunications Bchurn[ models also include social media data from Facebook and Twitter to derive the customer sentiment. In this example, in order to correlate this data and perform deep predictive analytics, this data was to be shipped to an overseas vendor each day. This caused significant concern with respect to safeguarding the privacy of customer data. After appropriate deliberation, the telecom operator decided to mask sensitive data, such as subscriber name, because the calling and receiving telephone numbers were the primary fields of value for churn analytics [34]. A sampling of examples from oil and gas industry illustrates the concept of establishing use cases for Big Data governance to further gain sponsorships [35]. Geospatial data, which is critical to exploration and production, may relate to land-based drilling, offshore drilling, or wells that might be abandoned. There are several examples of poor geospatial data governance that have had a profound business impact. In one instance, an oil company used incorrect geospatial coordinates to drill a well at the exact location of an abandoned well, which resulted in losses of millions of dollars. In another instance, an oil company used incorrect navigational coordinates to drill a hole in an adjacent field 1 : 6 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 10. that belonged to another company. The company faced legal issues and public embarrassment in addition to lost time and money. Instances such as these are sufficient to convince stakeholders of the need for governance and thereby support a specific set of use cases. Performing maturity assessment An early step in the implementation of an information-governance program is to assess the current state and determine the gaps and efforts needed to reach the desired future state. Such maturity assessments are often conducted as part of a strategic information agenda, IT-business alignment, or an organizational readiness fact-finding exercise. According to an IT executive at the Turkish bank, Akbank, BBig Data has all the characteristics of Bsmall data[ when it comes to governance. The only difference is in the complexity and variety of channels that it comes from. Although there are greater demands on organizational energy and resources to govern Big Data, the gain in terms of business value is that much higher. Being able to analyze Big Data from the Web, and take necessary actions, can have a major impact on the profit of a company. A maturity model for Big Data governance is a critical first step in this journey[ [33]. A framework for Big Data maturity assessment can be established and deployed as an extension of a traditional information-governance assessment initiative using a capability maturity model (CMM) such as the IBM Information Governance Council Maturity Model as a starting point. However, additional factors (such as the use-case domain, industry context, and unique data types for Big Data) help in shaping the questions and formulating the maturity assessment. For example, in Bcustomer centricity improvement[ use cases to determine customer sentiment, the customer master data needs to be integrated with social media data, but the social media data may not require cleansing, extensive documentation, or individual record life-cycle management. Thus, because the purpose of governance is to establish and deliver trusted information, use cases dictate the rigor applied in Big Data governance and thus the future level of maturity desired. Handling veracity through context and provenance Popularly characterized by the three Vs, as discussed earlier in the introduction of this paper, Big Data has more recently been subjected to additional criteria starting with letter V. The Global Technology Outlook (GTO) study by IBM Research [36] identified veracity as an important criterion in characterization of Big Data. The GTO states, BToday, the world’s data contains increasing uncertainties that arise from such sources as ambiguities associated with social media, imprecise data from sensors, imperfect object recognition in video streams, model approximations, and other sources. Industry analysts believe that by 2015, 80% of the world’s data will be uncertain.[ Figure 1 depicts the increasing data uncertainty as the data volumes are projected Figure 1 Increasing data volumes usher in the era of higher uncertainty in data, especially as the quantity and incidence of unstructured data are rising. The phrase BInternet of Things[ refers to identifiable objects and their representations in an Internet-like structure. (VoIP: Voice over Internet Protocol; IDC: International Data Corporation. Figure adapted from the 2012 IBM GTO Study, an overview of which is given in [36].) P. MALIK 1 : 7IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 11. to increase. The rate of increase is more pronounced for the largely unstructured data (e.g., from VoIP [Voice over Internet Protocol], social media, and sensors), compared with the enterprise data, which is mostly contained in structured databases. This decreasing Bveracity[ of data is a natural byproduct of increasing complexity, volume, and velocity of data. As discussed in a previous section, veracity of data must be acknowledged and addressed. Techniques for managing the predictability of decisions based on imprecise datasets must be refined. Enhancing the metadata through correlation of context and provenance (i.e., lineage) of data improves the veracity. For example, one IBM solution for fighting crime uses historic criminal records and uses analytics to predict and reduce occurrence of new incidents. The latest advances in law enforcement and crime prevention integrate fragments of data from multiple sources, but details described in eyewitness accounts, incident reports, media and journalist reports, and other individual observations can differ widely and ultimately limit accuracy. The extrapolation across missing data points, using sophisticated software, enables the solution to be useful, even with imprecision in the available datasets. Establishing or adapting a data-quality program to embrace Big Data Data quality has some characteristics of art, with its worth realized after the data is employed and away from its source of generationVin the case of data, to produce information for business decision making. The value of most Big Data depreciates with time, especially data from social media channels. Poor-quality data, to some, may not appear inaccurate, inconsistent, or redundant, because it may be fit for the intended purpose or use case [37]. There have been efforts in the past to quantify the amount of losses corporations experience if the problem of poor data quality is not addressed [38, 39]. However, as a practice, dealing with poor data quality is often neglected, except in cases of financial data or compliance mandates that could draw attention from regulators and company management. In a 2012 study by the International Association for Information and Data Quality (IAIDQ), more than 70% respondents considered data a strategic asset. Increasing regulatory activity and an increase in the understanding of the value of data have raised the importance of data quality as a discipline across organizations [40]. Most organizations have established data-quality initiatives that improve the level of trustworthiness of corporate data. Some address the data-quality issue by aggregating, matching, merging, and cleansing data in intermediary staging databases (referred to as Bdownstream[ data fixing). However, it is often well established that quality problems primarily arise with inadequate controls at the transactional source system and the costs of fixing increases in a downstream setting. To date, many of these data-quality initiatives have only addressed traditional or structured data. However, as mentioned, in the area of Big Data, data sources such as social media marketing, real-time sensor feeds, and IT system logs have historically not been linked to official reference data from transactional systems. Traditionally, such data did not usually have to be cleansed because the data was examined in isolation by specialist teams that often addressed issues Boffline.[ However, cross-information-type analyticsVwhich are common in the Big Data spaceVhave changed this dynamic. Consider an example of a large health care insurance company that processes more than 500 million claims per year, each claim record consisting of 600 to 1,000 attributes [34]. The company uses predictive analytics to determine whether certain proactive measures were required for a small subset of members. However, an audit found that physicians were using inconsistent procedure codes to submit claims, thereby limiting the effectiveness of the predictive analytics. Further, the text notes within claims documents were ambiguous. For example, the team used terms such as Bchronic congestion[ and Bblood-sugar monitoring[ to determine that certain members might be candidates for disease management programs for asthma and diabetes, respectively. These type of cases are likely to encounter data-quality issues when the information sources that have historically been underutilized are rigorously used such as in cross-information matching for deeper insights. Clearly, handling massive amounts of data implies that organizations need to be able to deal with the uncertainty in the quality of this data. Although not all data needs to have an in-depth review, practical approaches to data-quality management, as have applied to traditional structured data, will not be sufficient for the Big Data era, and the data-quality programs have to be adapted. Thus, the governance program needs to adopt the following substeps to address data quality. First, those involved with such a program must work with business stakeholders to identify critical data elements that ensure the success of the information quality and governance program. Second, they must build a business case to justify the management of data quality for the organization in general and extend it so as to relate to the Big Data aspects. In addition, they must build the overall data-quality plan, covering vital parameters such as those involving specification, timeliness, availability, accuracy, precision, consistency, synchronization, security, and accessibility [41]. Of course, not all of these kinds of dimensions that apply to traditional data need to apply to Big Data (e.g., timeliness and accuracy are important dimensions for sensor data, whereas the same kind of accuracy may not be critical for social media data). Those involved with governance programs must also establish data-quality standards and policies and deploy 1 : 8 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 12. proactive monitoring through technology and automation. They must develop a scorecard to monitor the data-quality metrics along each dimension and establish confidence intervals for the quality of virtually all types of data. They should also appoint data stewards who will be accountable to the information governance council for improving the quality metrics over time. These data stewards must be educated in handling Big Data types and effectively communicate among one another and with business and IT stakeholders. Finally, governance-program leaders should refresh the information governance scorecard on the basis of ongoing quality improvement progress and the needs of the business. With a combination of advancing technology and adequate processes, organizations will be able to improve quality of data-based decisions and compensate for sparsity, ambiguity, and veracity of Big Data. Formally accommodating Big Data roles in the governance organization Data governance must be institutionalized through a formal organizational structure. Given the disruptive potential of Big Data, specialized people need to be hired, trained, nurtured, and integrated into the organization. This is one of the Bsofter[ (e.g., people) aspects of governance, but organizational development and change management are key here. In order to effectively accommodate Big Data in the enterprise, the organization will need to extend the information governance team and its charter to incorporate tenets of Big Data. If it does not already exist, a data stewardship program should be established in the organization. Otherwise, it should be extended for Big Data, ensuring that new roles (such as Big Data analyst and data scientist) are defined and recognized as key players. This means the data stewards must be familiar with new data types and help with the data-validation processes. Data stewards should be trained to handle data profiling and anomalous pattern detection with Big Data. Detecting outliers and understanding the nuances between false positives and false negatives help them in proper data exploration and discovery. Tools and technologies that help with profiling and visualizing Big Data must be made available to the team. This may mean that the existing data governance team must be further trained and enabled in Big Data technologies and processes. People, processes, and technologies must be aligned as in a traditional data governance program. Further, those individuals with data governance organizational roles (including chief data officers, data owners, data governance council members, data stewards, and data analysts) need to be savvy with respect to Big Data, as policy making and the oversight of Big Data are critical. While the different governance roles need to interact with one another, without structure and organization, the workflow can stall. Best practices indicate the need to establish clear, structured, and well-directed communications, escalation paths, and workflow-monitoring patterns that would aid in handling policies, procedures, and methods concerning Big Data in the organization, for quick resolution in case any challenges arise. Automating business-rules monitoring for detecting governance policy exceptions Determination of policies and standards for governance is typically done in a collaborative manner with IT and business teams coming together to agree on a framework of set policies, processes, and a RACI (responsible, accountable, consulted, informed) matrix to establish role clarity. Adapting this framework from traditional governance program for dealing with Big Data requires automation and technology. With the pace and volume of data involved, traditional data monitoring and manual interventions would not suffice as data governance tactics for Big Data. Closed-loop feedback systems need to be established for high-speed data streams filtered through automated business rules and policies. Consider an example from the financial services industry where companies seek ways to gain competitive advantage in trading stocks, commodities, precious metals, and other financial instruments. High-frequency traders using algorithmic trading will use approaches to reduce milliseconds from trading transition time, even if these approaches require digging trenches to lay dedicated fiber-optic networks [42, 43] or even special transatlantic underwater cables and moving closer to the trading action at the exchange. However, such systems need to automate the monitoring processes and provide a feedback loop in order to provide a Bcircuit breaker[ effect and prevent runaway M2M interactions, lest they cause disruption to the operations of the entire trading exchange. Provisioning for security and privacy of Big Data by design Security principlesVsuch as separation of duties, separation of concern, principle of least privilege, and defense in depthVapply to all types of data. Extracting value from Big Data requires a responsible approach to security and privacy. There is sensitive data that needs to be protected, retention policies that need to be determined, and personal data that needs to be masked before transmittal. For example, the Big Data governance program will need to define policies surrounding new digital data, such as policies involving RFID (radiofrequency identification) tag deactivating or sharing geo-location codes, because these may be fraught with security and privacy issues depending on the country of data origin. P. MALIK 1 : 9IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 13. For companies that deal with consumer information, extra precautions need to be considered in order to maintain trust of subscribers. The governance program needs to be sensitive to invasion of privacy. The need for ethics-before-profits policy around privacy dictates not renting or sharing the collected data that could identify individuals. Governance practices and policiesVsuch as those involving data activity monitoring covering all read, write, or access functions on sensitive and private dataVshould be enabled through use of suitable processes and technologies for automatic logging and alerting. Data masking and obfuscation of sensitive portions of structured and unstructured data, such as in documents and electronic files, is another security practice that should be part of the governance program. Increasingly, an individual’s personal data in any company transaction system is no longer retained in isolation; instead, it is aggregated so that prospective customers can be shown targeted advertisements or directed to customized services. Advertising is just one way data can be collected, aggregated, and monetized. Organizations can assess credit worthiness, evaluate employees, or even take the step toward linking with government or other legal data. The security risks arise because users must relinquish control over their data. Because Big Data can be focused on the aggregation of customer data across an organization, security-protection efforts should focus on the elimination of the Bsilos of data[ and examine ways to control and enhance the data-protection processes that exist. As mentioned, realizing that not all data is equally useful and undergoes decay at varying paces, information life-cycle governance practices need to be extended considering TVoD. Governance policies are needed for selective storage and defensible disposal. Regulatory requirements may necessitate keeping certain types of data for seven years (e.g., financial information) or up to ten years (e.g., pharmaceutical clinical trial data), but retaining data beyond that is not only an expensive proposition, but also a Figure 2 A framework to establish principles of Big Data governance. The arrows indicate a communication pattern and a workflow that helps in the practical realization of these principles. 1 : 10 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 14. potential legal exposure. Organizations need to be prepared to purge nonuseful data after a certain time, especially when data trends have been determined and aggregate data stored and processed. Establishing, monitoring, and measuring metrics and communicating The success or failure of any program depends on how well it meets its goals over the established timeline. Without establishing suitable metrics and key performance indicators, it will be difficult to ascertain whether the Big Data program is progressing as planned. Further, regular measurement, monitoring, and sharing of these metrics with the stakeholders is a sound governance practice that can be easily adopted from the traditional data governance programs and provides a vital feedback loop to fine-tune strategy with changing business priorities. Thus, many of the information governance strategies, programs, and practices developed for the Bstructured world[ are valid and transferrable to the Big Data world. However, these practices must be extended, expanded, and adapted for Big Data as discussed above. Figure 2 summarizes the principles of Big Data governance and provides a framework for putting them into practice. The arrows show workflow and communication patterns. The tan boxes at top of the figure (i.e., Strategy, Process, Policy, People, and Technology and Automation) refer to the key domains involved in Big Data governance. Note that the most-often used definition of governance involves an orchestration of People, Process, and Technology. The blue boxes in Figure 2 refer to the primary principles of Big Data governance as explained herein. Conclusion Our world is becoming increasingly instrumented and interconnected, with a proliferation of information. This provides a clear opportunity to transform the world into a more Bintelligent[ planet [44]. As per the 2012 IBM CEO study, the ability of an organization to derive value from data is strongly correlated with performance, where outperforming organizations are twice as good as underperformers at accessing and drawing insights from data [45]. With the advent of Big Data, organizations now have the opportunity to realize even more value from their information assets. Early adopters and nearly two-thirds of respondents in the 2012 Big Data@Work study [10] confirm that the use of information (including Big Data) and analytics is creating a competitive advantage for their organizations. Clearly, there is a great emphasis on data and deriving its value through effective governance. Customer-centric analytics are top priority for Big Data initiatives among many early adopters. As a result of advanced techniques (such as machine learning and natural language processing), sophisticated hardware, and Bsmart[ software, computers have become adept at processing and determining trends in consumer behavior that were hitherto cost prohibitive to analyze and impossible for humans to decipher. Every click on a website, each phone call, social media post, tweet, blog entry, or a credit-card purchase creates a record that can be stored and has the potential to be analyzed in a manner that will create tangible value for the business. Enhancing the inherent value of such data in the enterprise is a key Big Data governance objective, as discussed throughout this paper. As observed through numerous examples presented in this paper, Big Data is revolutionizing our world at an unprecedented pace despite the nascent and rapidly evolving technologies involved [46]. Several use cases have emerged with a potential to cause significant disruption in the conventional data management thinking. Extending the information management framework to address data types unique to Big Data, and incorporating modified practices suitable for integrating Big Data, allows us to govern Big Data as an integral part of enterprise data management function as opposed to dealing with it in a silo. However, harnessing the full power of data in an enterprise setting requires removal of many obstacles that still remain in the path [47]. For instance, a Gartner analyst recently observed that Balthough information arguably meets accounting standards criteria for an asset, and more specifically, further litmus tests for an intangible asset, it is not found on public companies’ balance sheets[ [48]. This provokes a series of thoughts. For example, if information is really considered a corporate asset, then why does it not appear in any company balance sheets and regulatory filings, like brand value does? In addition, what is the value of information owned by corporations and how is it measured? Would an incident involving an information security breach cause a company to book losses on the quarterly earnings statement? When these questions are answered, it will bring focus and attention to the quantification of the intrinsic value of data and further underscore the need for governing all types of data. As Big Data becomes ubiquitous in the coming years, the need for governing and managing it effectively in the enterprise will only become stronger. The principles and practices involved in Big Data governance that have been presented in this paper serve as a starting point and will further evolve and ultimately become pervasive to the extent of being clearly incorporated into normal business and IT functions. Until then, organizations must learn to deal with volume, velocity, variety, and veracity of Big Data through the application of strategy, processes, automation technology, and people implementing those processes. *Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both. P. MALIK 1 : 11IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 15. **Trademark, service mark, or registered trademark of Apache Software Foundation, Facebook, Inc., LinkedIn Corporation, Twitter, Inc., MySpace, Inc., Google, Inc., or ShotSpotter, Inc., in the United States, other countries, or both. References 1. B. Johnson, What the web is saying about the god particle. [Online]. Available: http://gigaom.com/2012/07/04/what-the-web- is-saying-about-the-god-particle/ 2. M. Adrian, BBig Data,[ Teradata Magazine. [Online]. Available: http://www.teradatamagazine.com/v11n01/Features/Big-Data/ 3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, BBig data: The next frontier for innovation, competition and productivity,[ McKinsey & Comp., San Francisco, CA. [Online]. Available: http://www.mckinsey. com/Insights/MGI/Research/Technology_and_Innovation/ Big_data_The_next_frontier_for_innovation 4. D. Laney, BApplication delivery strategies,[ META Group Inc., Stamford, CT. [Online]. Available: http://blogs.gartner.com/ doug-laney/files/2012/01/ad949-3D-Data-Management- Controlling-Data-Volume-Velocity-and-Variety.pdf 5. IDC, IDC Go-to-Market Services: The 2011 Digital Universe Study: Extracting Value from Chaos. [Online]. Available: http:// www.emc.com/collateral/demos/microsites/emc-digital-universe- 2011/index.htm 6. M Scherer, Inside the Secret World of Quants and Data Crunchers who helped Obama Win. [Online]. Available: http://swampland. time.com/2012/11/07/inside-the-secret-world-of-quants-and-data- crunchers-who-helped-obama-win/ 7. N. Bressan, C. McGregor, and A. James, BTrends and opportunities for integrated real time neonatal clinical decision support in conference proceedings,[ in Proc. IEEE-EMBS Int. Conf. BHI, Jan. 2012, pp. 687–690. 8. D. Boyd and K. Crawford, BSix provocations for big data : A decade in internet time: symposium on the dynamics of the internet and society,[ Social Sci. Res. Netw., New York. [Online]. Available: http://ssrn.com/abstract=1926431 or http://dx.doi.org/ 10.2139/ssrn.1926431 9. D. Borthakur, BThe Hadoop distributed file system: Architecture and design,[ Apache Softw. Found., Forest Hill, MD. [Online]. Available: http://mit.edu/~mriap/hadoop/hadoop-0.13.1/docs/ hdfs_design.pdf 10. M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano. Analytics: The real-world use of big data, IBV and Saiid School of Business, University of Oxford. [Online]. Available: http://public.dhe.ibm.com/common/ssi/ecm/en/ gbe03519usen/GBE03519USEN.PDF 11. B. Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics. Hoboken, NJ: Wiley, 2012. 12. T. Fisher, The Data Asset: How Smart Companies Govern Their Data for Business Success. Hoboken, NJ: Wiley, 2010. 13. Big Data: New Insights Transform Industries, IBM Corp., New York, Jun. 2012, a whitepaper, IBM. [Online]. Available: https:// www14.software.ibm.com/webapp/iwm/web/signup.do?source= sw-infomgt&S_PKG=ov8620&S_TACT=109HF63W&S_CMP= is_bdwp2_bdhub. 14. IBM Global CMO Study, IBM Corp., New York, 2011. [Online]. Available: http://www-935.ibm.com/services/us/cmo/ cmostudy2011/cmo-registration.html. 15. G. Kirkpatrick, BThe corporate governance lessons from the financial crisis,[ Org. Econ. Co-oper. Develop., Cedex, France. [Online]. Available: http://www.oecd.org/dataoecd/32/1/ 42229620.pdf 16. J. Gee, M. Button, and G. Brooks, BFinancial cost of healthcare fraud,[ Eur. Healthcare Fraud Corrupt. Netw., Brussels, Belgium. [Online]. Available: http://www.ehfcn.org/media/ documents/The-Financial-Cost-of-Healthcare-FraudVFinal- %282%29.pdf 17. S. Rogers. (2011, Sep.). Big data is scaling BI and analytics. Inf. Manage. [Online]. 21(5), p. 14. Available: http://www. information-management.com/issues/21_5/big-data-is-scaling- bi-and-analytics-10021093-1.html 18. S. Soares, Smart Meters and Big Data. [Online]. Available: http:// www.smartgridnews.com/artman/publish/Technologies_MDM/ Smart-meters-show-Big-Data-governance-best-practices-4433. html 19. G. Raine, Power grid failure leaves 2 million in the dark. [Online]. Available: http://www.sfgate.com/news/article/Power-grid-failure- leaves-2-million-in-the-dark-3134893.php 20. S. Carter, Get Bold: Using Social Media to Create a New Type of Social Business. Upper Saddle River, NJ: IBM Press, 2012. 21. R. Scott, Social Retail. [Online]. Available: http://searchengine- watch.com/article/2192745/Social-Retail-Finding-Engaging- Cultivating-Todays-Connected-Consumer 22. Talking to Strangers: Millennials Trust People over Brands, Bazaarvoice, Austin, TX, , 2012. [Online]. Available: http://www.bazaarvoice.com/files/whitepapers/ BV_whitepaper_millenials.pdf. 23. C. Duhigg, BHow Companies learn your secrets,[ The New York Times Co., New York. [Online]. Available: http://www.nytimes. com/2012/02/19/magazine/shopping-habits.html?=rss&page- wanted=all 24. P. Nist, The Big Data Challenge: Social Data Meets Corporate Data. [Online]. Available: http://communities.intel.com/ community/openportit/server/blog/2012/03/29/the-big-data- challenge-social-data-meets-corporate-data 25. M. Palmer, Data is the new oil. [Online]. Available: http://ana. blogs.com/maestros/2006/11/data_is_the_new.html 26. A. Cavoukian and J. Jonas, BPrivacy by design in the age of big data,[ Privacy by Design, Toronto, ON, Canada. [Online]. Available: http://privacybydesign.ca/content/uploads/2012/06/ pbd-big_data.pdf 27. Data.gov Concept of Operations, OMB, New York. [Online]. Available: http://www.data.gov/sites/default/files/attachments/ data_gov_conops_v1.0.pdf. 28. Streetline. [Online]. Available: http://www.wired.com/autopia/ 2010/11/city-parking-smartens-up-with-streetline/ 29. J. Hagel and J. F. Rayport, BComing battle for customer information,[ Harvard Bus. Rev., Boston, MA. [Online]. Available: http://cb.hbsp.harvard.edu/cb/web/product_detail.seam; jsessionid=A641511F8D6094B1AF8431E85981183D?E= 60482&R=97104-PDF-ENG&conversationId=230602 30. S. Swoyer, Big Data Integration. [Online]. Available: http://tdwi. org/Research/2012/06/TDWI-EBook-Big-Data-Integration.aspx 31. P. Malik, BBig data governanceVThe next frontier for the information economy,[ IAIDQ, Baltimore, MD. [Online]. Available: http://iaidq.org/webinars/2012-04-24.shtml 32. M. Godinez, E. Hechler, K. Koenig, S. Lockwood, M. Oberhofer, and M. Schroeck, The Art of Enterprise Information Architecture: A Systems-Based Approach for Unlocking Business Insight. Boston, MA: IBM Press, 2010. 33. S. Soares, T. Deutsch, S. Hanna, and P. Malik, BBig data governance: A framework to assess maturity,[ IBM Corp., New York. [Online]. Available: http://ibmdatamag.com/2012/04/ big-data-governance-a-framework-to-assess-maturity/ 34. S. Soares, BA framework that focuses on the BData[ in big data governance,[ IBM Corp., New York. [Online]. Available: http:// ibmdatamag.com/2012/06/a-framework-that-focuses-on-the-data- in-big-data-governance/ 35. S. Soares, Selling Information Governance to the Business: Best Practices by Industry and Job Function. Ketchum, ID: MC Press, 2011. 36. IBM GTO 2012 Study: Managing Uncertain Data at scale, IBM Corp., New York, , 2012. [Online]. Available: http://www.zurich. ibm.com/pdf/isl/infoportal/GTO_2012_Booklet.pdf. 37. P. Malik, BInformation integrity for CRM in a virtual world,[ in Encyclopedia of Virtual Communities and Technologies. Hershey, PA: Idea Group, 2006, pp. 266–272. 38. W. Eckerson, BData quality and the bottom line,[ The Data Warehousing Inst., Chatsworth, CA, TDWI Rep. Ser., TDWI, 101 Commun., 2002. [Online]. Available: http://download. 101com.com/pub/tdwi/Files/DQReport.pdf 1 : 12 P. MALIK IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013
  • 16. 39. L. English and P. Malik, Plain English about Information Quality. [Online]. Available: http://www.information-management.com/ issues/20070401/1079288-1.html 40. E. Pierce, C. L. Yonke, P. Malik, and C. K. Nagaraj, BThe state of information and data quality industry report 2012,[ IAIDQ, Baltimore, MD. [Online]. Available: http://iaidq.org/publications/ pierce-2012-11.shtml 41. S. Ruschka-Taylor, C. Evask, P. Malik, and S. Minsinger, BTransforming information integrity,[ IBM Corp., New York. [Online]. Available: http://www-935.ibm.com/services/us/gbs/bus/ pdf/g510-3831-transforming-enterprise-information-integrity.pdf 42. D. I. Amin and A. F. Bach, BNYSE Euronext and 100G: The drive to zero latency,[ Lightwave, Nashua, NH. [Online]. Available: http://www.lightwaveonline.com/articles/print/volume-27/issue-1/ applications/case-by-case/nyse-euronext-and-100g-the-drive-to- zero-latency.html 43. A. Troianovski, Networks Built on Milliseconds. [Online]. Available: http://online.wsj.com/article/ SB10001424052702304065704577426500918047624.html 44. C. Harrison, B. Eckman, R. Hamilton, P. Hartswick, J. Kalagnanam, J. Paraszczak, and P. Williams, BFoundations for Smarter Cities,[ IBM J. Res. & Dev., vol. 54, no. 4, pp. 1–16, Jul./Aug. 2010, paper 1. 45. IBM Global CEO Study 2012, IBM Corp., New York, 2012. [Online]. Available: ibm.com/services/us/en/c-suite/ceostudy2012. 46. P. Zikopoulos, C. Eaton, T. Deutsch, D. Deroos, and G. Lapis, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. New York: McGraw-Hill, 2012. 47. C. Stakutis and J. Webster, Inescapable Data: Harnessing the Power of Convergence. Upper Saddle River, NJ: IBM Press, 2005. 48. D. Laney, BInfonomics: The practice of Information economics,[ in Forbes. [Online]. Available: http://www.forbes.com/sites/ gartnergroup/2012/05/22/infonomics-the-practice-of-information- economics/ Received August 1, 2012; accepted for publication August 28, 2012 Piyush Malik IBM Global Business Services, San Jose, CA 95131 USA (Piyush.malik@us.ibm.com). Mr. Malik leads the Worldwide Big Data Services Center of Excellence within the IBM Global Business Analytics and Optimization (BAO) consulting practice. Specializing in information management strategy and architecture, information quality and governance, business intelligence, master data, and advanced analytics, Mr. Malik has more than 24 years of international consulting, practice building, sales, and delivery experience with Fortune 500 clients across multiple industries. He has served as Founding Director of the IBM Global BAO Center of Competency and previously led the Information Integrity consulting practice at Pricewaterhousecoopers Management Consulting Services before it was acquired by IBM in 2002. Mr. Malik has also been serving on the Board of Directors of International Association for Information and Data Quality (IAIDQ) since 2008. He received an undergraduate degree in electronics and communications engineering in 1989 and a master’s degree in management of technology from Indian Institute of Technology, Delhi, in 1995. He is a frequent speaker at industry conferences and has authored several articles and papers. P. MALIK 1 : 13IBM J. RES. & DEV. VOL. 57 NO. 3/4 PAPER 1 MAY/JULY 2013