SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Emerging Data Quality Trends for
Governing and Analyzing Big Data
Harald Smith
Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus on
data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog author: “Data Democratized”
Agenda
• Ongoing Data Challenges
• Four Emerging Data Quality Trends
• Approaches to addressing Data Quality needs
• Questions
Why is Data Quality
so important?
Data: the fuel of the future
Data is to this century, what oil was to the last one: a driver of
growth and change.
The Economist: Fuel of the future - Data is giving rise to a new economy: 6th May 2017
Flows of data have created new infrastructures, new businesses,
new monopolies, new politics and crucially new economics.
Digital information is unlike any previous resource: it is extracted,
refined, valued, bought and sold in different ways.
It changes the rules for markets and it demands new approaches
from regulators.
Many a battle will be fought over who should own, and benefit
from, data.
5 Emerging Data Quality Trends
Analysis
Segmentation
Data compliance Access Scheduling All reports!
Competitor
analysis
Sales reports
Single Customer /
360 View
Data regulation Security Workloads Aggregations HR / recruitment
Dashboards CRM Content Governance
Capacity
Management
Performance
planning
Forecasting &
modelling
Overall business
strategy!
Performance
metrics
Campaign
management
Risk
Optimization &
SLA’s
Route planning Cash flow
Territory
management
ROI Disaster Recovery Inventory
Contingency
planning
UX
Data impacts all areas of the business
Sales Marketing FinanceLegal IT Operations Management
6 Emerging Data Quality Trends
Data Governance & Quality are top of mind
3V’s of Big Data
Volume, variety, and velocity
of data is growing
Ever more Analysis
New tools allowing more
granular data dissection and
segmentation
Dichotomy in Outcomes
Expectations of data is
increasing yet confidence in
data is falling
Governance Requirements
Broader and deeper
compliance & regulation
expectations
trust & confidence
7 Emerging Data Quality Trends
“Get to Know Me”…
• Design and deliver rich, individualized experiences that build customer loyalty
• Increasingly broad spectrum of data sources involved in, and required for,
effectively personalizing customer experiences and targeted marketing offers
What Types of Data?
• Internal sources – often many/overlapping
• 3rd Party data – geospatial, demographics, firmographics
• Suppression data – keeping customer information updated
• New sources – mobile, social media
What Data Challenges?
• Incorporating and managing the expected exponential increase in digital
demographic data
• Tapping into customer technology histories to build and evolve an understanding
of individual customers
Use Case: 360 View of Customer
Internal Data
▪ Customer Master Data
▪ Point-of-Sale Data
▪ Contact Form Data
▪ Loyalty Program Data
▪ ecommerce Data
▪ Customer Service Data
Suppression Data
▪ Change of Address
▪ Mortality
▪ Do Not Call
Third-Party Data
▪ Age
▪ Occupation
▪ Education
▪ Gender
▪ Income
▪ Geospatial/Location
Social Data
▪ Digital demographics
▪ Sentiment
▪ Opinions
▪ Interests
▪ Social handles
8 Emerging Data Quality Trends
Protect Financial Assets and Ensure Compliance
• Flag credit card fraud in real time
• Identify and report on money laundering
What Types of Data?
• Internal sources – often many/overlapping
• Suppression data – keeping customer information updated
• Mobile data – devices, locations
• New sources – social media, 3rd party data, …
What Data Challenges?
• Fraudulent transaction detection requires:
• Huge volumes of customer profile data
• Recent transaction activity with “last known” values
• Device data with geolocation and time-based tagging
• Data used to refine Machine Learning models (e.g., anomaly detection,
implausible behavior analysis) to review new transactions in real time
Use Case: Anti-Fraud/Anti-Money Laundering
Internal Data
▪ Customer Master Data
▪ Point-of-Sale Data
▪ Contact Form Data
▪ Loyalty Program Data
▪ ecommerce Data
▪ Customer Service Data
Mobile Data
▪ Device
▪ Location
▪ Wearables
▪ Mobile wallets
Suppression Data
▪ Change of Address
▪ Mortality
▪ Do Not Call
Social Data
▪ Digital Demographics
▪ Sentiment
▪ Opinions
▪ Interests
▪ Social handles
9 Emerging Data Quality Trends
Only 35%of senior executives have a high
level of trust in the accuracy of
their Big Data Analytics
KPMG 2016 Global CEO Outlook
92% of
executives are concerned about
the negative impact of data and
analytics on corporate
reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling due
to poor data quality
Dimensional Research, 2019
Big Data
Needs
Data Quality
10 Emerging Data Quality Trends
“Societal trust in business is
arguably at an all-time low
and, in a world increasingly
driven by data and technology,
reputations and brands are
ever harder to protect.”
EY “Trust in Data and Why it Matters”, 2017.
The importance of data quality
in the enterprise:
• Decision making – Trust the data
that drives your business
• Customer centricity – Get a single,
complete and accurate view of your
customer for better sales, marketing
and customer service
• Compliance – Know your data, and
ensure its accuracy to meet industry
and government regulations
• Machine learning & AI – High quality
models require training on high
quality, accurate data
Four Emerging
Data Quality Trends
Four Emerging Data Quality Trends
All the traditional DQ issues remain, but now consider:
1. New DQ considerations for new types of data
2. New application considerations (e.g. Machine learning)
3. Processing at scale/meeting SLAs
4. Data Democratization and resource/knowledge constraints
12 Emerging Data Quality Trends
1. New Data, New Measures
Common Data Quality Problems
All the traditional data quality issues
remain, but now at greater scale and
in more places
• Many data records with different layouts
• Inconsistent data formats (number
formatting, measurements, languages,
postal conventions and dates)
• Lack of standardization of the different
fields
• Names spelled differently, partially entered,
or multiple names provided
• Misspellings and keystroke errors
• Data sourced from third parties does not
contain all the necessary fields or is out-of-
date
• Invalid values: codes, reference data, out-of-
range, future dates
Lack of Standardization
14 Emerging Data Quality Trends
Common Data Quality Measurements
What measures can we take advantage of?
• Completeness – Are the relevant fields populated?
• Integrity – Does the data maintain an internal structural
integrity or a relational integrity across sources
• Uniqueness – Are keys or records unique?
• Validity – Does the data have the correct values?
• Code and reference values
• Valid ranges
• Valid value combinations
• Consistency – Is the data at consistent levels of aggregation
or does it have consistent valid values over time?
15 Emerging Data Quality Trends
• Timeliness – Did the data arrive in a time period
that makes it useful or usable?
Example: Call Center Record
Unique ✓
Integrity ✓
Complete ?
Consistent ✓
Timely ✓
Valid ?
Is Duration = 0 important?
Is 01/01/20xx a defaulted date?
And how will this be linked or
connected with my other data?
The file appears complete, but does
it cover all call centers?
16 Emerging Data Quality Trends
Example: Social Media Feed
Unique?
Integrity?
Complete?
Consistent?
Timely?
Valid?
17 Emerging Data Quality Trends
New Data Quality Problems
New data, new data quality challenges
• 3rd Party and external data with unknown provenance or relevance
• Bias in the data – whether in collection, extraction, or other processing
• Data without standardized structure or formatting
• Continuously streaming data
• Disjointed data (e.g. gaps in receipt)
• Consistency and verification of data sources
• Changes and transformation applied to data (i.e. does it really represent the
original input)
18 Emerging Data Quality Trends
“34 percent of bankers in our survey report that their
organization has been the target of adversarial AI at least
once, and 78 percent believe automated systems create new
risks, such as fake data, external data manipulation, and
inherent bias.”
Accenture Banking Technology Vision 2018
What else can we review or measure?
Provenance – Where did the data originate, who gathered it, and what criteria was used to create it?
• E.g. government agency, 3rd party provider, free or paid data
Coverage (Relevance) – How well does the data source meet the defined needs?
• E.g. does it cover the relevant geography? Is it biased (and if so, how)?
Continuity – Data points for all intervals or expected intervals?
• E.g. sensors, weather records, call data records
Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent measurements from
related points of reference.
• E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is unlikely to be 70°
Transformation from origin – how many layers and/or changes has the data passed through?
• E.g. has the original data source already been merged with two other record sources? And is the result accurate?
Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals or across multiple
sensors.
• E.g. is there tampering with sensors or call data?
Additional Measures of Data Quality
19 Emerging Data Quality Trends
20 Emerging Data Quality Trends
Example: New Data Quality Measures applied
Triangulated
Continuity
Provenance
Coverage
Usage
Repeated
patterns
Transformation
Jane Doe pulled from
Twitter based on
#Blackberry
All items for #Blackberry in
relevant time interval
appear to be included
Marketing confirms this
data has high value
Good association with
current product & sales
data
All tweets appear
unique within the date
& vs. prior feeds
This needed to include
#BB and #Crackberry as
well!
No changes or merges of
the data were applied
2. Machine Learning & Data Quality
“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Common Machine Learning Applications
Marketing
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
Risk Management
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Know your customer
23 Emerging Data Quality Trends
Data Challenges with Machine Learning
Five Big Challenges of Enabling Machine Learning
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, and ATM machines in incompatible formats, making it difficult to gather
and prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data.
3. Entity Resolution and Customer Identification
Distinguishing matches across massive datasets that indicate a single specific entity requires sophisticated multi-field matching algorithms and a lot of
compute power. Essentially everything has to be compared to everything.
4. Need for Near Real-Time Current Data
Tracking and detection needs to happen very rapidly. Current transactions need to be constantly added to combined datasets, prepared and presented
to models as close to real-time as possible.
5. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data,
and for required audit trails. Capture of complete lineage, from source to end point is needed.
24 Emerging Data Quality Trends
Data Quality Challenges with Machine Learning
Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” – Mistakes
and errors are almost never the patterns you’re looking for in a data set.
Sparse data generates other issues. Correcting and standardizing will tend
to boost the signal, but must account for bias.
Missing context – Many data sources lack context around location or
population segments. Unless enriched with other data sets, (e.g.
geospatial, demographics, or firmographics data), some ML algorithms
will not be usable.
Multiple copies – If your data comes from many sources, as it often does,
it may contain multiple records of information about the same person,
company, product or other entity. Removing duplicates and enhancing the
overall depth and accuracy of knowledge about a single entity can make a
huge difference.
Spurious correlations – Just as missing context may hinder some ML
algorithms, inclusion of already correlated data (e.g. city and postal code)
may result in overfitting of ML algorithms.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
However, traditional data quality software is
designed to work on smaller data sets.
And data analysts may not be aware of
specific data quality issues that must be
addressed to support machine learning.
Traditional data quality processes are an
effective method to remove defects.
25 Emerging Data Quality Trends
Example: Missing segments of populations
Event: Hurricane Sandy
20 million tweets
• Majority of tweets from Manhattan not the hard
hit areas such as Seaside Heights and Midland
Beach due to power outages and diminishing
cell phone batteries
• Despite the millions of Spanish-speakers
affected, very few Spanish-language tweets
collected
• Assess % across and against all likely
locations
• Seek out disconfirming information
Data: Boston Potholes
Street Bump App
• Draws on accelerometer and GPS data to help
passively detect potholes
• Lower income groups in the US are less likely to
have smartphones, particularly older residents -
penetration as low as 16%
• Result is underreporting of road problems in
more elderly communities
• Assess % across all likely locations
• Add other sources
• Utilize demographics for evaluations
26 Emerging Data Quality Trends
Example: Noise, or Inserted content
“Bots are just a tool for making the
numbers look how you want them
to look.”
Sam Woolley
Researcher, Oxford University’s
Project on Computational
Propaganda
Wired: Nov 8, 2016
“The Political Twitter Bots Will Rage This Election Day”
Event: Election
Bot tweets
• ~400,000 bots tweeting on the election
• ~20% of all election-related tweets came from an army of influential
bots
• 55-80% of Twitter activity—the likes, follows, and retweets —are
from bots
• It had been easier to identify earlier bots, but now it’s incredibly
difficult for a human to make a determination
• Evaluate patterns
• Is there any real sentiment here?
• How much repetitive content is there?
• How much “influence” comes from a single or a
few sources (negative or positive)?
• Will it skew the analysis?
27 Emerging Data Quality Trends
Example: Simple bias
“The “black sheep problem” is that if you
were to try to guess what color most sheep
were by looking [at] language data, it would
be very difficult for you to conclude that
they weren't almost all black. In English,
“black sheep” outnumbers “white sheep”
about 25:1 (many "black sheeps” are movie
references); in French it's 3:1; in German it's
12:1. Some languages get it right; in Korean
it's 1:1.5 in favor of white sheep…”
Hal Daumé
Associate Professor, University of Maryland
Blog: June 24, 2016
“Language bias and black sheep”
http://nlpers.blogspot.com/2016/06/language-bias-
and-black-sheep.html
Data: Google Word2Vec data set
Word2vec
• Converts words into a vector space for analysis
• “Numerous researchers have begun to use the data to better understand
everything from machine translation to intelligent Web searching.”
• Embeddings based on a group of 300 million words taken from Google News
• Researchers from Boston University and Microsoft have found it is
“blatantly sexist”
• Impacts the ability to create personalized services
• Evaluate % of words & associations
• How do I interpret a sentiment?
• Does this data set contain hidden and
unexpressed bias?
• Will I miss opportunities because of hidden
assumptions?
28 Emerging Data Quality Trends
3. Data Quality at Scale
Challenges To Ensuring
Data Quality
Many sources of data (70%) and volume of data (48%)
are among the top 3 challenges companies face when
ensuring high quality data.
Applying governance processes to manage and measure
data quality is second with 50%.
* Syncsort, 2019 Enterprise Data Quality survey
70%
50%
48%
47%
46%
43%
32%
27%
27%
25%
15%
Many sources of data
Applying governance processes
to manage and measure data…
Volume of data
Inconsistent formats of data
Inconsistent definitions of data
Missing information
Connecting policies and rules to
data
Misfielded data
Lack of skills/staff
Lack of tools (or inadequate
tools)
Not seen as an organizational
priority
What are the greatest challenges you face
when ensuring high data quality?
30 Emerging Data Quality Trends
Processing at Scale
New Data Quality considerations
• Handling data volumes and distributed data
• Profiling data – assessing high volumes and streaming data
• Standardizing and enriching data content
• Matching entities – not just master data – e.g. transactions for fraud detection
• Meeting Service Level Agreements (SLA’s)
• Running consistently on new and regularly changing platforms (Hadoop,
Spark, Cloud)
31 Emerging Data Quality Trends
Big Data at scale distributes data across many nodes –
not necessarily with other relevant data!
• Data Quality functions must be performed in a consistent
manner, no matter where actual processing takes place, how
the data is segmented, and what the data volume is
• Cleansing, standardization, and data validation will generally scale
linearly
• Data Enrichment: Reference data, lookups must be readily
accessible by any process wherever executed
Handling distributed data volumes
Source: HP Analyst Briefing
32 Emerging Data Quality Trends
• But particular implications for profiling, joining, sorting, and
matching data
• Profiling: Identification of outliers necessitates full volume views
and need to aggregate statistics and frequencies of data
distributed across cluster
• Joins & sorts: Efficient shuffling of data stored across cluster is
critical
• Entity Resolution: Distinguishing matches that indicate a single
specific entity across so much data requires multiple passes with
sophisticated multi-field matching algorithms – with results that
are understandable by business users in order to be meaningful
Handling distributed data volumes
33 Emerging Data Quality Trends
Anti-Money
Laundering on
Hadoop at
Global Bank
• Must provide cluster-native
data verification,
enrichment, and demanding
multi-field fuzzy matching for
entity resolution to Golden
Record
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must be secure – Kerberos,
LDAP
• Must have lineage – data
origin to end point
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance
results at massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas and ASG
Data Intelligence
• Cluster-native data verification,
enrichment, and demanding
multi-field entity resolution on
Spark
• Unmodified mainframe “Golden
Records” stored on Hadoop
Bank must monitor transactions to
detect Money Laundering for FCA
compliance.
Leverage Machine learning at scale
to detect patterns, but …
Requires large amounts of current,
clean data.
34 Emerging Data Quality Trends
4. Data Literacy / Democratization
Data Democratization
Data Quality is a key component to user empowerment
• Data Literacy - critical to understand:
• Business context and language
• Data (including data structures and data types)
• Data access (how and where to find)
• Data usage (how will the data be used by the business)
• Basic Statistics
• Data Quality dimensions
• Data Quality techniques and tools
• Resource constraints – in both Data Quality and technologies
• What questions to ask?
• Where to find answers?
36 Emerging Data Quality Trends
Approaches to Addressing
Emerging Data Quality Trends
Approaches
Data Literacy / Data Governance
• Communicating Best Practices in Data Quality for everyone
38 Emerging Data Quality Trends
“Universal” Data Quality Best Practices
• Establish Scope: ask core questions
• Identifying data requirements
• Address bias
• Understand context
• Address and resolve data quality issues
• Apply data governance processes
Solving “Big Data” Data Quality Challenges
• Handle scale
• Ensure consistent data quality application
across platforms
Culture of Data Literacy
• “Democratization of Data” requires cultural support
• Empowered to ask questions about the data
• Trained to understand and use data
• Trained to understand approaching and evaluating data quality
• Traditional data, new data, machine learning requirements, …
• Understand the business context of the data
Program of Data Governance
• Provide the processes and practices necessary for success
• Measure, monitor, and improve
• Continuous iteration and development
Center of Excellence/Knowledge Base
• Where do you go to find answers?
• Who can help show you how?
Communicate!
39 Emerging Data Quality Trends
Data Literacy: challenges & best practices
• Lack of Common Terminology
• Organizational Barriers & Silos
• Isolated or Unknown Work
• Lack of Engagement
Establish a Common Language
• Define terminology – a ‘stake in the ground’
• Map information
• Support with policies/standards
Gain Broader Buy In
• Bring stakeholders together
• Build the structure, culture,
ownership, steering groups,
stewardship over time
Enrich Information
• Discover what you don’t know
• Resolve differences
• Enhance/annotate to increase insight
Share Insights Regularly
• Produce and share tangible outcomes
• Highlight ‘wins’
• Demonstrate efficiencies & savings
Copyright © Syncsort 2019
“If you don’t know what you want to
get out of the data, how can you
know what data you need – and
what insight you’re looking for?”
Wolf Ruzicka
Chairman of the Board at EastBanc
Technologies
Blog post: June 1, 2017
“Grow A Data Tree Out Of The “Big Data” Swamp”
Establish Scope
• Understand the business objective and problem
• Asking the “right questions” about your data (not just “what”
and “how”)
• “Empowering users (“Who”) to gain new clarity into the core
problem (“Why”)
• “High-quality data” definition will vary by business problem
Identify Requirements & Processes
• Do you have all the data required?
• Do you understand the characteristics and context of the data?
• How will data be matched, consolidated, or connected?
• What’s needed to facilitate the matching, consolidation, or
connection required?
• Have you evaluated the sources?
• What’s the Fitness for your Purpose?
Universal Data Quality best practices
41 Emerging Data Quality Trends
Understand Context
• What are the Critical Data Elements?
• What qualities do we need to address, or leave alone?
• When, and where, do we need to transform or enrich the data
content?
• How are we connecting, relating, or combining data?
Develop, Test, and Deploy Corrective Measures
• Consistent application of standardization, transformation,
enrichment, and entity resolution
• Common templates, rules, metrics, and processes that can be
leveraged
• Deploy into batch, real-time, or embedded services
Apply Data Governance
• Deploy and implement metrics and measures for ongoing
assessment and evaluation
Universal Data Quality best practices
“Never lead with a data set;
lead with a question.”
Anthony Scriffignano
Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017
“The Data Differentiator”
42 Emerging Data Quality Trends
Quantify: challenges & best practices
• Hidden Activities
• Money, Time and Resource
Waste
• Lack of Transparency and Trust
• Disconnect Between Process
and Measures
Identify Baseline Measures
• Keep a focus on lean and agile
• Define value accurately for the business
Link to Business Performance
• Create and refine streams of value
• Transform culture through action
and empowerment
Monitor, Report and Remediate Issues
• Continuously review
• Ensure issues are visible and understood
• Understand root causes
• Address/resolve issues
Quantify Impact of Changes
• Demonstrate through clearly understood measures
• Establish value continuously
• Finish early, finish often
Copyright © Syncsort 2019
Leverage tools built for Big Data
• Focus on the data quality challenges, not the Big Data ones
• Connect to and process hundreds of millions of records of data
• Standardize, enhance, and match international data sets with postal and
country-code validation
• Integrate, enrich, and match new and legacy customer data from multiple
disparate sources
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency on premises or in the Cloud
• Increase processing efficiency by expanding cluster, not rebuilding
processes
• Support failover through fault-tolerant designs; during a node failure,
processing is redirected to another node
44 Emerging Data Quality Trends
Simplify: Design Once, Deploy Anywhere
Intelligent Execution - Insulate your organization from underlying complexities of Big Data
Get excellent performance every time
without tuning, load balancing, etc.
Avoid re-design, re-compile, re-work
• Future-proof job designs for emerging compute
frameworks
• Move from dev to test to production
• Move from on-premises to Cloud
• Move from one Cloud to another
Use existing Data Quality skills
• Focus on data quality problems, not technical ones
Design Once
in visual GUI
Deploy Anywhere!
On-Premises,
Cloud
MapReduce, Spark,
Future Platforms
Windows, Linux,
Unix
Batch,
Streaming
Single Node,
Cluster
Emerging Data Quality Trends45
Data Quality remains Data Quality, even at scale
“Data and analytics leaders need to understand the
business priorities and challenges of their organization.
Only then will they be in the right position to create
compelling business cases that connect data quality
improvement with key business priorities.”
Ted Friedman
VP Distinguished Analyst, Gartner
Smarter with Gartner at Gartner.com: June 12, 2018
“How to Create a Business Case for Data Quality Improvement”
“Never lead with a data set;
lead with a question.”
Anthony Scriffignano
Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017
“The Data Differentiator”
46 Emerging Data Quality Trends
Q&A
harald.smith@syncsort.com

Más contenido relacionado

La actualidad más candente

Accelerating Personalization to Cut Through Digital Noise
Accelerating Personalization to Cut Through Digital NoiseAccelerating Personalization to Cut Through Digital Noise
Accelerating Personalization to Cut Through Digital NoisePrecisely
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackPrecisely
 
Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...
Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...
Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...Path of the Blue Eye Project
 
Big Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White PaperBig Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White PaperExperian
 
Building an Effective Data Governance Framework
Building an Effective Data Governance FrameworkBuilding an Effective Data Governance Framework
Building an Effective Data Governance FrameworkEnsighten
 
Lecture notes on being Data-Driven and doing Data Science
Lecture notes on being Data-Driven and doing Data Science Lecture notes on being Data-Driven and doing Data Science
Lecture notes on being Data-Driven and doing Data Science Johan Himberg
 
Poster presetation for "Using Big Data for Marketing Analytics"
Poster presetation for "Using Big Data for Marketing Analytics"Poster presetation for "Using Big Data for Marketing Analytics"
Poster presetation for "Using Big Data for Marketing Analytics"Touseef Ahmed
 
Waters USA 2013: Data Leaders vs. Data Laggards
Waters USA 2013: Data Leaders vs. Data LaggardsWaters USA 2013: Data Leaders vs. Data Laggards
Waters USA 2013: Data Leaders vs. Data LaggardsState Street
 
Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...
Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...
Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...Molly Alexander
 
In the Absence of Fact - Stephen Harris
In the Absence of Fact - Stephen HarrisIn the Absence of Fact - Stephen Harris
In the Absence of Fact - Stephen HarrisMolly Alexander
 
Hidden security and privacy consequences around mobility (Infosec 2013)
Hidden security and privacy consequences around mobility (Infosec 2013)Hidden security and privacy consequences around mobility (Infosec 2013)
Hidden security and privacy consequences around mobility (Infosec 2013)Huntsman Security
 
Understanding the impact of your fraud strategy
Understanding the impact of your fraud strategy Understanding the impact of your fraud strategy
Understanding the impact of your fraud strategy European Merchant Services
 
Emergence of Big Data in Digital Marketing
Emergence of Big Data  in Digital MarketingEmergence of Big Data  in Digital Marketing
Emergence of Big Data in Digital MarketingKrishnan Parasuraman
 
Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1Jenawahl
 
Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...
Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...
Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...Molly Alexander
 
Big Data: The Road to Know More About Your Business
Big Data:  The Road to Know More About Your BusinessBig Data:  The Road to Know More About Your Business
Big Data: The Road to Know More About Your BusinessOAUGNJ
 

La actualidad más candente (20)

Accelerating Personalization to Cut Through Digital Noise
Accelerating Personalization to Cut Through Digital NoiseAccelerating Personalization to Cut Through Digital Noise
Accelerating Personalization to Cut Through Digital Noise
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
 
Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...
Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...
Most Marketers Unaware of What Digital ROI Means, Fail to Measure it Appropri...
 
Big Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White PaperBig Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White Paper
 
Building an Effective Data Governance Framework
Building an Effective Data Governance FrameworkBuilding an Effective Data Governance Framework
Building an Effective Data Governance Framework
 
Lecture notes on being Data-Driven and doing Data Science
Lecture notes on being Data-Driven and doing Data Science Lecture notes on being Data-Driven and doing Data Science
Lecture notes on being Data-Driven and doing Data Science
 
Poster presetation for "Using Big Data for Marketing Analytics"
Poster presetation for "Using Big Data for Marketing Analytics"Poster presetation for "Using Big Data for Marketing Analytics"
Poster presetation for "Using Big Data for Marketing Analytics"
 
Big data baddata-gooddata
Big data baddata-gooddataBig data baddata-gooddata
Big data baddata-gooddata
 
Bridgei2i Analytics Solutions Introduction
Bridgei2i Analytics Solutions IntroductionBridgei2i Analytics Solutions Introduction
Bridgei2i Analytics Solutions Introduction
 
Who is 1010data?
Who is 1010data?Who is 1010data?
Who is 1010data?
 
Waters USA 2013: Data Leaders vs. Data Laggards
Waters USA 2013: Data Leaders vs. Data LaggardsWaters USA 2013: Data Leaders vs. Data Laggards
Waters USA 2013: Data Leaders vs. Data Laggards
 
Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...
Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...
Company Evolution – Evolving Beyond the Traditional Scope Through Data Moneti...
 
In the Absence of Fact - Stephen Harris
In the Absence of Fact - Stephen HarrisIn the Absence of Fact - Stephen Harris
In the Absence of Fact - Stephen Harris
 
Hidden security and privacy consequences around mobility (Infosec 2013)
Hidden security and privacy consequences around mobility (Infosec 2013)Hidden security and privacy consequences around mobility (Infosec 2013)
Hidden security and privacy consequences around mobility (Infosec 2013)
 
Understanding the impact of your fraud strategy
Understanding the impact of your fraud strategy Understanding the impact of your fraud strategy
Understanding the impact of your fraud strategy
 
Emergence of Big Data in Digital Marketing
Emergence of Big Data  in Digital MarketingEmergence of Big Data  in Digital Marketing
Emergence of Big Data in Digital Marketing
 
Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1Smarter analytics101 v2.0.1
Smarter analytics101 v2.0.1
 
MTBiz February 2014
MTBiz February 2014MTBiz February 2014
MTBiz February 2014
 
Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...
Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...
Integrate Your Data Science & Omni-channel Strategy to Reduce Cost and Increa...
 
Big Data: The Road to Know More About Your Business
Big Data:  The Road to Know More About Your BusinessBig Data:  The Road to Know More About Your Business
Big Data: The Road to Know More About Your Business
 

Similar a Emerging Data Quality Trends for Governing and Analyzing Big Data

20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...Steven Callahan
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScalePrecisely
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
Valuing the data asset
Valuing the data assetValuing the data asset
Valuing the data assetBala Iyer
 
Big data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkBig data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkNeerajsabhnani
 
Fate of the Chief Data Officer
Fate of the Chief Data OfficerFate of the Chief Data Officer
Fate of the Chief Data OfficerTamarah Usher
 
Data Integrity Trends
Data Integrity TrendsData Integrity Trends
Data Integrity TrendsPrecisely
 
BBDO Connect Big Data
BBDO Connect Big DataBBDO Connect Big Data
BBDO Connect Big DataBBDO Belgium
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataPrecisely
 
Big data
Big dataBig data
Big dataRiya
 
Marketsoft and marketing cube data quality to cc-v3
Marketsoft and marketing cube   data quality to cc-v3Marketsoft and marketing cube   data quality to cc-v3
Marketsoft and marketing cube data quality to cc-v3Marketsoft
 
Information Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer SatisfactionInformation Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer SatisfactionCapgemini
 
Data Governance in a big data era
Data Governance in a big data eraData Governance in a big data era
Data Governance in a big data eraPieter De Leenheer
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsPrecisely
 
CRM is not enough
CRM is not enoughCRM is not enough
CRM is not enoughSegment
 
Big Data, Big Investment
Big Data, Big InvestmentBig Data, Big Investment
Big Data, Big InvestmentGGV Capital
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challengeLenia Miltiadous
 

Similar a Emerging Data Quality Trends for Governing and Analyzing Big Data (20)

20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
uae views on big data
  uae views on  big data  uae views on  big data
uae views on big data
 
Valuing the data asset
Valuing the data assetValuing the data asset
Valuing the data asset
 
Big data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkBig data initiative justification and prioritization framework
Big data initiative justification and prioritization framework
 
Fate of the Chief Data Officer
Fate of the Chief Data OfficerFate of the Chief Data Officer
Fate of the Chief Data Officer
 
Data Integrity Trends
Data Integrity TrendsData Integrity Trends
Data Integrity Trends
 
BBDO Connect Big Data
BBDO Connect Big DataBBDO Connect Big Data
BBDO Connect Big Data
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
Big data
Big dataBig data
Big data
 
Marketsoft and marketing cube data quality to cc-v3
Marketsoft and marketing cube   data quality to cc-v3Marketsoft and marketing cube   data quality to cc-v3
Marketsoft and marketing cube data quality to cc-v3
 
Information Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer SatisfactionInformation Governance: Reducing Costs and Increasing Customer Satisfaction
Information Governance: Reducing Costs and Increasing Customer Satisfaction
 
Data Governance in a big data era
Data Governance in a big data eraData Governance in a big data era
Data Governance in a big data era
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity Trends
 
National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015
 
CRM is not enough
CRM is not enoughCRM is not enough
CRM is not enough
 
Big Data, Big Investment
Big Data, Big InvestmentBig Data, Big Investment
Big Data, Big Investment
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
 

Más de Precisely

Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenPrecisely
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfPrecisely
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Precisely
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Precisely
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Precisely
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fPrecisely
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPPrecisely
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenPrecisely
 
Automatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsAutomatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsPrecisely
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyPrecisely
 
Effective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowEffective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowPrecisely
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellencePrecisely
 
5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation ManagementPrecisely
 
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowUnlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowPrecisely
 
Navigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckNavigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckPrecisely
 
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformanceMainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformancePrecisely
 
Preventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations ManagementPreventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations ManagementPrecisely
 

Más de Precisely (20)

Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAP
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
 
Automatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsAutomatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIs
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and Precisely
 
Effective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowEffective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to Know
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center Excellence
 
5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management
 
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowUnlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
 
Navigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckNavigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar Deck
 
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformanceMainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
 
Preventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations ManagementPreventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations Management
 

Último

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Último (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Emerging Data Quality Trends for Governing and Analyzing Big Data

  • 1. Emerging Data Quality Trends for Governing and Analyzing Big Data Harald Smith
  • 2. Speaker Harald Smith • Director of Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blog author: “Data Democratized”
  • 3. Agenda • Ongoing Data Challenges • Four Emerging Data Quality Trends • Approaches to addressing Data Quality needs • Questions
  • 4. Why is Data Quality so important?
  • 5. Data: the fuel of the future Data is to this century, what oil was to the last one: a driver of growth and change. The Economist: Fuel of the future - Data is giving rise to a new economy: 6th May 2017 Flows of data have created new infrastructures, new businesses, new monopolies, new politics and crucially new economics. Digital information is unlike any previous resource: it is extracted, refined, valued, bought and sold in different ways. It changes the rules for markets and it demands new approaches from regulators. Many a battle will be fought over who should own, and benefit from, data. 5 Emerging Data Quality Trends
  • 6. Analysis Segmentation Data compliance Access Scheduling All reports! Competitor analysis Sales reports Single Customer / 360 View Data regulation Security Workloads Aggregations HR / recruitment Dashboards CRM Content Governance Capacity Management Performance planning Forecasting & modelling Overall business strategy! Performance metrics Campaign management Risk Optimization & SLA’s Route planning Cash flow Territory management ROI Disaster Recovery Inventory Contingency planning UX Data impacts all areas of the business Sales Marketing FinanceLegal IT Operations Management 6 Emerging Data Quality Trends
  • 7. Data Governance & Quality are top of mind 3V’s of Big Data Volume, variety, and velocity of data is growing Ever more Analysis New tools allowing more granular data dissection and segmentation Dichotomy in Outcomes Expectations of data is increasing yet confidence in data is falling Governance Requirements Broader and deeper compliance & regulation expectations trust & confidence 7 Emerging Data Quality Trends
  • 8. “Get to Know Me”… • Design and deliver rich, individualized experiences that build customer loyalty • Increasingly broad spectrum of data sources involved in, and required for, effectively personalizing customer experiences and targeted marketing offers What Types of Data? • Internal sources – often many/overlapping • 3rd Party data – geospatial, demographics, firmographics • Suppression data – keeping customer information updated • New sources – mobile, social media What Data Challenges? • Incorporating and managing the expected exponential increase in digital demographic data • Tapping into customer technology histories to build and evolve an understanding of individual customers Use Case: 360 View of Customer Internal Data ▪ Customer Master Data ▪ Point-of-Sale Data ▪ Contact Form Data ▪ Loyalty Program Data ▪ ecommerce Data ▪ Customer Service Data Suppression Data ▪ Change of Address ▪ Mortality ▪ Do Not Call Third-Party Data ▪ Age ▪ Occupation ▪ Education ▪ Gender ▪ Income ▪ Geospatial/Location Social Data ▪ Digital demographics ▪ Sentiment ▪ Opinions ▪ Interests ▪ Social handles 8 Emerging Data Quality Trends
  • 9. Protect Financial Assets and Ensure Compliance • Flag credit card fraud in real time • Identify and report on money laundering What Types of Data? • Internal sources – often many/overlapping • Suppression data – keeping customer information updated • Mobile data – devices, locations • New sources – social media, 3rd party data, … What Data Challenges? • Fraudulent transaction detection requires: • Huge volumes of customer profile data • Recent transaction activity with “last known” values • Device data with geolocation and time-based tagging • Data used to refine Machine Learning models (e.g., anomaly detection, implausible behavior analysis) to review new transactions in real time Use Case: Anti-Fraud/Anti-Money Laundering Internal Data ▪ Customer Master Data ▪ Point-of-Sale Data ▪ Contact Form Data ▪ Loyalty Program Data ▪ ecommerce Data ▪ Customer Service Data Mobile Data ▪ Device ▪ Location ▪ Wearables ▪ Mobile wallets Suppression Data ▪ Change of Address ▪ Mortality ▪ Do Not Call Social Data ▪ Digital Demographics ▪ Sentiment ▪ Opinions ▪ Interests ▪ Social handles 9 Emerging Data Quality Trends
  • 10. Only 35%of senior executives have a high level of trust in the accuracy of their Big Data Analytics KPMG 2016 Global CEO Outlook 92% of executives are concerned about the negative impact of data and analytics on corporate reputation KPMG 2017 Global CEO Outlook 80%of AI/ML projects are stalling due to poor data quality Dimensional Research, 2019 Big Data Needs Data Quality 10 Emerging Data Quality Trends “Societal trust in business is arguably at an all-time low and, in a world increasingly driven by data and technology, reputations and brands are ever harder to protect.” EY “Trust in Data and Why it Matters”, 2017. The importance of data quality in the enterprise: • Decision making – Trust the data that drives your business • Customer centricity – Get a single, complete and accurate view of your customer for better sales, marketing and customer service • Compliance – Know your data, and ensure its accuracy to meet industry and government regulations • Machine learning & AI – High quality models require training on high quality, accurate data
  • 12. Four Emerging Data Quality Trends All the traditional DQ issues remain, but now consider: 1. New DQ considerations for new types of data 2. New application considerations (e.g. Machine learning) 3. Processing at scale/meeting SLAs 4. Data Democratization and resource/knowledge constraints 12 Emerging Data Quality Trends
  • 13. 1. New Data, New Measures
  • 14. Common Data Quality Problems All the traditional data quality issues remain, but now at greater scale and in more places • Many data records with different layouts • Inconsistent data formats (number formatting, measurements, languages, postal conventions and dates) • Lack of standardization of the different fields • Names spelled differently, partially entered, or multiple names provided • Misspellings and keystroke errors • Data sourced from third parties does not contain all the necessary fields or is out-of- date • Invalid values: codes, reference data, out-of- range, future dates Lack of Standardization 14 Emerging Data Quality Trends
  • 15. Common Data Quality Measurements What measures can we take advantage of? • Completeness – Are the relevant fields populated? • Integrity – Does the data maintain an internal structural integrity or a relational integrity across sources • Uniqueness – Are keys or records unique? • Validity – Does the data have the correct values? • Code and reference values • Valid ranges • Valid value combinations • Consistency – Is the data at consistent levels of aggregation or does it have consistent valid values over time? 15 Emerging Data Quality Trends • Timeliness – Did the data arrive in a time period that makes it useful or usable?
  • 16. Example: Call Center Record Unique ✓ Integrity ✓ Complete ? Consistent ✓ Timely ✓ Valid ? Is Duration = 0 important? Is 01/01/20xx a defaulted date? And how will this be linked or connected with my other data? The file appears complete, but does it cover all call centers? 16 Emerging Data Quality Trends
  • 17. Example: Social Media Feed Unique? Integrity? Complete? Consistent? Timely? Valid? 17 Emerging Data Quality Trends
  • 18. New Data Quality Problems New data, new data quality challenges • 3rd Party and external data with unknown provenance or relevance • Bias in the data – whether in collection, extraction, or other processing • Data without standardized structure or formatting • Continuously streaming data • Disjointed data (e.g. gaps in receipt) • Consistency and verification of data sources • Changes and transformation applied to data (i.e. does it really represent the original input) 18 Emerging Data Quality Trends “34 percent of bankers in our survey report that their organization has been the target of adversarial AI at least once, and 78 percent believe automated systems create new risks, such as fake data, external data manipulation, and inherent bias.” Accenture Banking Technology Vision 2018
  • 19. What else can we review or measure? Provenance – Where did the data originate, who gathered it, and what criteria was used to create it? • E.g. government agency, 3rd party provider, free or paid data Coverage (Relevance) – How well does the data source meet the defined needs? • E.g. does it cover the relevant geography? Is it biased (and if so, how)? Continuity – Data points for all intervals or expected intervals? • E.g. sensors, weather records, call data records Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent measurements from related points of reference. • E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is unlikely to be 70° Transformation from origin – how many layers and/or changes has the data passed through? • E.g. has the original data source already been merged with two other record sources? And is the result accurate? Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals or across multiple sensors. • E.g. is there tampering with sensors or call data? Additional Measures of Data Quality 19 Emerging Data Quality Trends
  • 20. 20 Emerging Data Quality Trends Example: New Data Quality Measures applied Triangulated Continuity Provenance Coverage Usage Repeated patterns Transformation Jane Doe pulled from Twitter based on #Blackberry All items for #Blackberry in relevant time interval appear to be included Marketing confirms this data has high value Good association with current product & sales data All tweets appear unique within the date & vs. prior feeds This needed to include #BB and #Crackberry as well! No changes or merges of the data were applied
  • 21. 2. Machine Learning & Data Quality
  • 22. “ ” The magic of machine learning is that you build a statistical model based on the most valid dataset for the domain of interest. If the data is junk, then you’ll be building a junk model that will not be able to do its job. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 23. Common Machine Learning Applications Marketing • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention Risk Management • Anti-money laundering • Fraud detection • Cybersecurity • Know your customer 23 Emerging Data Quality Trends
  • 24. Data Challenges with Machine Learning Five Big Challenges of Enabling Machine Learning 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, and ATM machines in incompatible formats, making it difficult to gather and prepare the data for model training. 2. Data Cleansing at Scale Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data. 3. Entity Resolution and Customer Identification Distinguishing matches across massive datasets that indicate a single specific entity requires sophisticated multi-field matching algorithms and a lot of compute power. Essentially everything has to be compared to everything. 4. Need for Near Real-Time Current Data Tracking and detection needs to happen very rapidly. Current transactions need to be constantly added to combined datasets, prepared and presented to models as close to real-time as possible. 5. Tracking Lineage from the Source Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data, and for required audit trails. Capture of complete lineage, from source to end point is needed. 24 Emerging Data Quality Trends
  • 25. Data Quality Challenges with Machine Learning Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Sparse data generates other issues. Correcting and standardizing will tend to boost the signal, but must account for bias. Missing context – Many data sources lack context around location or population segments. Unless enriched with other data sets, (e.g. geospatial, demographics, or firmographics data), some ML algorithms will not be usable. Multiple copies – If your data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Spurious correlations – Just as missing context may hinder some ML algorithms, inclusion of already correlated data (e.g. city and postal code) may result in overfitting of ML algorithms. Correcting data problems vastly increases a data set’s usefulness for machine learning. However, traditional data quality software is designed to work on smaller data sets. And data analysts may not be aware of specific data quality issues that must be addressed to support machine learning. Traditional data quality processes are an effective method to remove defects. 25 Emerging Data Quality Trends
  • 26. Example: Missing segments of populations Event: Hurricane Sandy 20 million tweets • Majority of tweets from Manhattan not the hard hit areas such as Seaside Heights and Midland Beach due to power outages and diminishing cell phone batteries • Despite the millions of Spanish-speakers affected, very few Spanish-language tweets collected • Assess % across and against all likely locations • Seek out disconfirming information Data: Boston Potholes Street Bump App • Draws on accelerometer and GPS data to help passively detect potholes • Lower income groups in the US are less likely to have smartphones, particularly older residents - penetration as low as 16% • Result is underreporting of road problems in more elderly communities • Assess % across all likely locations • Add other sources • Utilize demographics for evaluations 26 Emerging Data Quality Trends
  • 27. Example: Noise, or Inserted content “Bots are just a tool for making the numbers look how you want them to look.” Sam Woolley Researcher, Oxford University’s Project on Computational Propaganda Wired: Nov 8, 2016 “The Political Twitter Bots Will Rage This Election Day” Event: Election Bot tweets • ~400,000 bots tweeting on the election • ~20% of all election-related tweets came from an army of influential bots • 55-80% of Twitter activity—the likes, follows, and retweets —are from bots • It had been easier to identify earlier bots, but now it’s incredibly difficult for a human to make a determination • Evaluate patterns • Is there any real sentiment here? • How much repetitive content is there? • How much “influence” comes from a single or a few sources (negative or positive)? • Will it skew the analysis? 27 Emerging Data Quality Trends
  • 28. Example: Simple bias “The “black sheep problem” is that if you were to try to guess what color most sheep were by looking [at] language data, it would be very difficult for you to conclude that they weren't almost all black. In English, “black sheep” outnumbers “white sheep” about 25:1 (many "black sheeps” are movie references); in French it's 3:1; in German it's 12:1. Some languages get it right; in Korean it's 1:1.5 in favor of white sheep…” Hal Daumé Associate Professor, University of Maryland Blog: June 24, 2016 “Language bias and black sheep” http://nlpers.blogspot.com/2016/06/language-bias- and-black-sheep.html Data: Google Word2Vec data set Word2vec • Converts words into a vector space for analysis • “Numerous researchers have begun to use the data to better understand everything from machine translation to intelligent Web searching.” • Embeddings based on a group of 300 million words taken from Google News • Researchers from Boston University and Microsoft have found it is “blatantly sexist” • Impacts the ability to create personalized services • Evaluate % of words & associations • How do I interpret a sentiment? • Does this data set contain hidden and unexpressed bias? • Will I miss opportunities because of hidden assumptions? 28 Emerging Data Quality Trends
  • 29. 3. Data Quality at Scale
  • 30. Challenges To Ensuring Data Quality Many sources of data (70%) and volume of data (48%) are among the top 3 challenges companies face when ensuring high quality data. Applying governance processes to manage and measure data quality is second with 50%. * Syncsort, 2019 Enterprise Data Quality survey 70% 50% 48% 47% 46% 43% 32% 27% 27% 25% 15% Many sources of data Applying governance processes to manage and measure data… Volume of data Inconsistent formats of data Inconsistent definitions of data Missing information Connecting policies and rules to data Misfielded data Lack of skills/staff Lack of tools (or inadequate tools) Not seen as an organizational priority What are the greatest challenges you face when ensuring high data quality? 30 Emerging Data Quality Trends
  • 31. Processing at Scale New Data Quality considerations • Handling data volumes and distributed data • Profiling data – assessing high volumes and streaming data • Standardizing and enriching data content • Matching entities – not just master data – e.g. transactions for fraud detection • Meeting Service Level Agreements (SLA’s) • Running consistently on new and regularly changing platforms (Hadoop, Spark, Cloud) 31 Emerging Data Quality Trends
  • 32. Big Data at scale distributes data across many nodes – not necessarily with other relevant data! • Data Quality functions must be performed in a consistent manner, no matter where actual processing takes place, how the data is segmented, and what the data volume is • Cleansing, standardization, and data validation will generally scale linearly • Data Enrichment: Reference data, lookups must be readily accessible by any process wherever executed Handling distributed data volumes Source: HP Analyst Briefing 32 Emerging Data Quality Trends
  • 33. • But particular implications for profiling, joining, sorting, and matching data • Profiling: Identification of outliers necessitates full volume views and need to aggregate statistics and frequencies of data distributed across cluster • Joins & sorts: Efficient shuffling of data stored across cluster is critical • Entity Resolution: Distinguishing matches that indicate a single specific entity across so much data requires multiple passes with sophisticated multi-field matching algorithms – with results that are understandable by business users in order to be meaningful Handling distributed data volumes 33 Emerging Data Quality Trends
  • 34. Anti-Money Laundering on Hadoop at Global Bank • Must provide cluster-native data verification, enrichment, and demanding multi-field fuzzy matching for entity resolution to Golden Record • Massive data volumes • Scattered data – Mainframe, RDBMS, Cloud, … • Must be secure – Kerberos, LDAP • Must have lineage – data origin to end point • Must archive unaltered mainframe data Full Anti-Money Laundering regulatory compliance with financial crimes data lake – high performance results at massive scale. • Full end-to-end data lineage supplied to Apache Atlas and ASG Data Intelligence • Cluster-native data verification, enrichment, and demanding multi-field entity resolution on Spark • Unmodified mainframe “Golden Records” stored on Hadoop Bank must monitor transactions to detect Money Laundering for FCA compliance. Leverage Machine learning at scale to detect patterns, but … Requires large amounts of current, clean data. 34 Emerging Data Quality Trends
  • 35. 4. Data Literacy / Democratization
  • 36. Data Democratization Data Quality is a key component to user empowerment • Data Literacy - critical to understand: • Business context and language • Data (including data structures and data types) • Data access (how and where to find) • Data usage (how will the data be used by the business) • Basic Statistics • Data Quality dimensions • Data Quality techniques and tools • Resource constraints – in both Data Quality and technologies • What questions to ask? • Where to find answers? 36 Emerging Data Quality Trends
  • 37. Approaches to Addressing Emerging Data Quality Trends
  • 38. Approaches Data Literacy / Data Governance • Communicating Best Practices in Data Quality for everyone 38 Emerging Data Quality Trends “Universal” Data Quality Best Practices • Establish Scope: ask core questions • Identifying data requirements • Address bias • Understand context • Address and resolve data quality issues • Apply data governance processes Solving “Big Data” Data Quality Challenges • Handle scale • Ensure consistent data quality application across platforms
  • 39. Culture of Data Literacy • “Democratization of Data” requires cultural support • Empowered to ask questions about the data • Trained to understand and use data • Trained to understand approaching and evaluating data quality • Traditional data, new data, machine learning requirements, … • Understand the business context of the data Program of Data Governance • Provide the processes and practices necessary for success • Measure, monitor, and improve • Continuous iteration and development Center of Excellence/Knowledge Base • Where do you go to find answers? • Who can help show you how? Communicate! 39 Emerging Data Quality Trends
  • 40. Data Literacy: challenges & best practices • Lack of Common Terminology • Organizational Barriers & Silos • Isolated or Unknown Work • Lack of Engagement Establish a Common Language • Define terminology – a ‘stake in the ground’ • Map information • Support with policies/standards Gain Broader Buy In • Bring stakeholders together • Build the structure, culture, ownership, steering groups, stewardship over time Enrich Information • Discover what you don’t know • Resolve differences • Enhance/annotate to increase insight Share Insights Regularly • Produce and share tangible outcomes • Highlight ‘wins’ • Demonstrate efficiencies & savings Copyright © Syncsort 2019
  • 41. “If you don’t know what you want to get out of the data, how can you know what data you need – and what insight you’re looking for?” Wolf Ruzicka Chairman of the Board at EastBanc Technologies Blog post: June 1, 2017 “Grow A Data Tree Out Of The “Big Data” Swamp” Establish Scope • Understand the business objective and problem • Asking the “right questions” about your data (not just “what” and “how”) • “Empowering users (“Who”) to gain new clarity into the core problem (“Why”) • “High-quality data” definition will vary by business problem Identify Requirements & Processes • Do you have all the data required? • Do you understand the characteristics and context of the data? • How will data be matched, consolidated, or connected? • What’s needed to facilitate the matching, consolidation, or connection required? • Have you evaluated the sources? • What’s the Fitness for your Purpose? Universal Data Quality best practices 41 Emerging Data Quality Trends
  • 42. Understand Context • What are the Critical Data Elements? • What qualities do we need to address, or leave alone? • When, and where, do we need to transform or enrich the data content? • How are we connecting, relating, or combining data? Develop, Test, and Deploy Corrective Measures • Consistent application of standardization, transformation, enrichment, and entity resolution • Common templates, rules, metrics, and processes that can be leveraged • Deploy into batch, real-time, or embedded services Apply Data Governance • Deploy and implement metrics and measures for ongoing assessment and evaluation Universal Data Quality best practices “Never lead with a data set; lead with a question.” Anthony Scriffignano Chief Data Scientist, Dun & Bradstreet Forbes Insights, May 31, 2017 “The Data Differentiator” 42 Emerging Data Quality Trends
  • 43. Quantify: challenges & best practices • Hidden Activities • Money, Time and Resource Waste • Lack of Transparency and Trust • Disconnect Between Process and Measures Identify Baseline Measures • Keep a focus on lean and agile • Define value accurately for the business Link to Business Performance • Create and refine streams of value • Transform culture through action and empowerment Monitor, Report and Remediate Issues • Continuously review • Ensure issues are visible and understood • Understand root causes • Address/resolve issues Quantify Impact of Changes • Demonstrate through clearly understood measures • Establish value continuously • Finish early, finish often Copyright © Syncsort 2019
  • 44. Leverage tools built for Big Data • Focus on the data quality challenges, not the Big Data ones • Connect to and process hundreds of millions of records of data • Standardize, enhance, and match international data sets with postal and country-code validation • Integrate, enrich, and match new and legacy customer data from multiple disparate sources • Deploy data quality workflows as native, parallel MapReduce or Spark processes for optimal efficiency on premises or in the Cloud • Increase processing efficiency by expanding cluster, not rebuilding processes • Support failover through fault-tolerant designs; during a node failure, processing is redirected to another node 44 Emerging Data Quality Trends
  • 45. Simplify: Design Once, Deploy Anywhere Intelligent Execution - Insulate your organization from underlying complexities of Big Data Get excellent performance every time without tuning, load balancing, etc. Avoid re-design, re-compile, re-work • Future-proof job designs for emerging compute frameworks • Move from dev to test to production • Move from on-premises to Cloud • Move from one Cloud to another Use existing Data Quality skills • Focus on data quality problems, not technical ones Design Once in visual GUI Deploy Anywhere! On-Premises, Cloud MapReduce, Spark, Future Platforms Windows, Linux, Unix Batch, Streaming Single Node, Cluster Emerging Data Quality Trends45
  • 46. Data Quality remains Data Quality, even at scale “Data and analytics leaders need to understand the business priorities and challenges of their organization. Only then will they be in the right position to create compelling business cases that connect data quality improvement with key business priorities.” Ted Friedman VP Distinguished Analyst, Gartner Smarter with Gartner at Gartner.com: June 12, 2018 “How to Create a Business Case for Data Quality Improvement” “Never lead with a data set; lead with a question.” Anthony Scriffignano Chief Data Scientist, Dun & Bradstreet Forbes Insights, May 31, 2017 “The Data Differentiator” 46 Emerging Data Quality Trends
  • 47. Q&A
  • 48.