Emerging Data Quality Trends for Governing and Analyzing Big Data

Emerging Data Quality Trends for
Governing and Analyzing Big Data
Harald Smith

Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus on
data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog author: “Data Democratized”

Agenda
• Ongoing Data Challenges
• Four Emerging Data Quality Trends
• Approaches to addressing Data Quality needs
• Questions

Why is Data Quality
so important?

Data: the fuel of the future
Data is to this century, what oil was to the last one: a driver of
growth and change.
The Economist: Fuel of the future - Data is giving rise to a new economy: 6th May 2017
Flows of data have created new infrastructures, new businesses,
new monopolies, new politics and crucially new economics.
Digital information is unlike any previous resource: it is extracted,
refined, valued, bought and sold in different ways.
It changes the rules for markets and it demands new approaches
from regulators.
Many a battle will be fought over who should own, and benefit
from, data.
5 Emerging Data Quality Trends

Analysis
Segmentation
Data compliance Access Scheduling All reports!
Competitor
analysis
Sales reports
Single Customer /
360 View
Data regulation Security Workloads Aggregations HR / recruitment
Dashboards CRM Content Governance
Capacity
Management
Performance
planning
Forecasting &
modelling
Overall business
strategy!
Performance
metrics
Campaign
management
Risk
Optimization &
SLA’s
Route planning Cash flow
Territory
management
ROI Disaster Recovery Inventory
Contingency
planning
UX
Data impacts all areas of the business
Sales Marketing FinanceLegal IT Operations Management

Data Governance & Quality are top of mind
3V’s of Big Data
Volume, variety, and velocity
of data is growing
Ever more Analysis
New tools allowing more
granular data dissection and
segmentation
Dichotomy in Outcomes
Expectations of data is
increasing yet confidence in
data is falling
Governance Requirements
Broader and deeper
compliance & regulation
expectations
trust & confidence

“Get to Know Me”…
• Design and deliver rich, individualized experiences that build customer loyalty
• Increasingly broad spectrum of data sources involved in, and required for,
effectively personalizing customer experiences and targeted marketing offers
What Types of Data?
• Internal sources – often many/overlapping
• 3rd Party data – geospatial, demographics, firmographics
• Suppression data – keeping customer information updated
• New sources – mobile, social media
What Data Challenges?
• Incorporating and managing the expected exponential increase in digital
demographic data
• Tapping into customer technology histories to build and evolve an understanding
of individual customers
Use Case: 360 View of Customer
Internal Data
▪ Customer Master Data
▪ Point-of-Sale Data
▪ Contact Form Data
▪ Loyalty Program Data
▪ ecommerce Data
▪ Customer Service Data
Suppression Data
▪ Change of Address
▪ Mortality
▪ Do Not Call
Third-Party Data
▪ Age
▪ Occupation
▪ Education
▪ Gender
▪ Income
▪ Geospatial/Location
Social Data
▪ Digital demographics
▪ Sentiment
▪ Opinions
▪ Interests
▪ Social handles

Protect Financial Assets and Ensure Compliance
• Flag credit card fraud in real time
• Identify and report on money laundering
What Types of Data?
• Internal sources – often many/overlapping
• Suppression data – keeping customer information updated
• Mobile data – devices, locations
• New sources – social media, 3rd party data, …
What Data Challenges?
• Fraudulent transaction detection requires:
• Huge volumes of customer profile data
• Recent transaction activity with “last known” values
• Device data with geolocation and time-based tagging
• Data used to refine Machine Learning models (e.g., anomaly detection,
implausible behavior analysis) to review new transactions in real time
Use Case: Anti-Fraud/Anti-Money Laundering
Internal Data
▪ Customer Master Data
▪ Point-of-Sale Data
▪ Contact Form Data
▪ Loyalty Program Data
▪ ecommerce Data
▪ Customer Service Data
Mobile Data
▪ Device
▪ Location
▪ Wearables
▪ Mobile wallets
Suppression Data
▪ Change of Address
▪ Mortality
▪ Do Not Call
Social Data
▪ Digital Demographics
▪ Sentiment
▪ Opinions
▪ Interests
▪ Social handles

Only 35%of senior executives have a high
level of trust in the accuracy of
their Big Data Analytics
KPMG 2016 Global CEO Outlook
92% of
executives are concerned about
the negative impact of data and
analytics on corporate
reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling due
to poor data quality
Dimensional Research, 2019
Big Data
Needs
Data Quality
“Societal trust in business is
arguably at an all-time low
and, in a world increasingly
driven by data and technology,
reputations and brands are
ever harder to protect.”
EY “Trust in Data and Why it Matters”, 2017.
The importance of data quality
in the enterprise:
• Decision making – Trust the data
that drives your business
• Customer centricity – Get a single,
complete and accurate view of your
customer for better sales, marketing
and customer service
• Compliance – Know your data, and
ensure its accuracy to meet industry
and government regulations
• Machine learning & AI – High quality
models require training on high
quality, accurate data

Four Emerging
Data Quality Trends

Four Emerging Data Quality Trends
All the traditional DQ issues remain, but now consider:
1. New DQ considerations for new types of data
2. New application considerations (e.g. Machine learning)
3. Processing at scale/meeting SLAs
4. Data Democratization and resource/knowledge constraints

Common Data Quality Problems
All the traditional data quality issues
remain, but now at greater scale and
in more places
• Many data records with different layouts
• Inconsistent data formats (number
formatting, measurements, languages,
postal conventions and dates)
• Lack of standardization of the different
fields
• Names spelled differently, partially entered,
or multiple names provided
• Misspellings and keystroke errors
• Data sourced from third parties does not
contain all the necessary fields or is out-of-
date
• Invalid values: codes, reference data, out-of-
range, future dates
Lack of Standardization

Common Data Quality Measurements
What measures can we take advantage of?
• Completeness – Are the relevant fields populated?
• Integrity – Does the data maintain an internal structural
integrity or a relational integrity across sources
• Uniqueness – Are keys or records unique?
• Validity – Does the data have the correct values?
• Code and reference values
• Valid ranges
• Valid value combinations
• Consistency – Is the data at consistent levels of aggregation
or does it have consistent valid values over time?
• Timeliness – Did the data arrive in a time period
that makes it useful or usable?

Example: Call Center Record
Unique ✓
Integrity ✓
Complete ?
Consistent ✓
Timely ✓
Valid ?
Is Duration = 0 important?
Is 01/01/20xx a defaulted date?
And how will this be linked or
connected with my other data?
The file appears complete, but does
it cover all call centers?

Example: Social Media Feed
Unique?
Integrity?
Complete?
Consistent?
Timely?
Valid?

New Data Quality Problems
New data, new data quality challenges
• 3rd Party and external data with unknown provenance or relevance
• Bias in the data – whether in collection, extraction, or other processing
• Data without standardized structure or formatting
• Continuously streaming data
• Disjointed data (e.g. gaps in receipt)
• Consistency and verification of data sources
• Changes and transformation applied to data (i.e. does it really represent the
original input)
“34 percent of bankers in our survey report that their
organization has been the target of adversarial AI at least
once, and 78 percent believe automated systems create new
risks, such as fake data, external data manipulation, and
inherent bias.”
Accenture Banking Technology Vision 2018

What else can we review or measure?
Provenance – Where did the data originate, who gathered it, and what criteria was used to create it?
• E.g. government agency, 3rd party provider, free or paid data
Coverage (Relevance) – How well does the data source meet the defined needs?
• E.g. does it cover the relevant geography? Is it biased (and if so, how)?
Continuity – Data points for all intervals or expected intervals?
• E.g. sensors, weather records, call data records
Triangulation – What Gartner describes as ‘consistency of data across proximate data points’, i.e. consistent measurements from
related points of reference.
• E.g. if temperatures in Chicago and Louisville are 30°and 32°then temperature in Indianapolis for same day is unlikely to be 70°
Transformation from origin – how many layers and/or changes has the data passed through?
• E.g. has the original data source already been merged with two other record sources? And is the result accurate?
Repetition or duplication of data patterns – Data points exactly the same across multiple recording intervals or across multiple
sensors.
• E.g. is there tampering with sensors or call data?
Additional Measures of Data Quality

Example: New Data Quality Measures applied
Triangulated
Continuity
Provenance
Coverage
Usage
Repeated
patterns
Transformation
Jane Doe pulled from
Twitter based on
#Blackberry
All items for #Blackberry in
relevant time interval
appear to be included
Marketing confirms this
data has high value
Good association with
current product & sales
data
All tweets appear
unique within the date
& vs. prior feeds
This needed to include
#BB and #Crackberry as
well!
No changes or merges of
the data were applied

2. Machine Learning & Data Quality

“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018

Common Machine Learning Applications
Marketing
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
Risk Management
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Know your customer

Data Challenges with Machine Learning
Five Big Challenges of Enabling Machine Learning
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, and ATM machines in incompatible formats, making it difficult to gather
and prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data.
3. Entity Resolution and Customer Identification
Distinguishing matches across massive datasets that indicate a single specific entity requires sophisticated multi-field matching algorithms and a lot of
compute power. Essentially everything has to be compared to everything.
4. Need for Near Real-Time Current Data
Tracking and detection needs to happen very rapidly. Current transactions need to be constantly added to combined datasets, prepared and presented
to models as close to real-time as possible.
5. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data,
and for required audit trails. Capture of complete lineage, from source to end point is needed.

Data Quality Challenges with Machine Learning
Incorrect, Incomplete, Mis-Formatted, and Sparse “Dirty Data” – Mistakes
and errors are almost never the patterns you’re looking for in a data set.
Sparse data generates other issues. Correcting and standardizing will tend
to boost the signal, but must account for bias.
Missing context – Many data sources lack context around location or
population segments. Unless enriched with other data sets, (e.g.
geospatial, demographics, or firmographics data), some ML algorithms
will not be usable.
Multiple copies – If your data comes from many sources, as it often does,
it may contain multiple records of information about the same person,
company, product or other entity. Removing duplicates and enhancing the
overall depth and accuracy of knowledge about a single entity can make a
huge difference.
Spurious correlations – Just as missing context may hinder some ML
algorithms, inclusion of already correlated data (e.g. city and postal code)
may result in overfitting of ML algorithms.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
However, traditional data quality software is
designed to work on smaller data sets.
And data analysts may not be aware of
specific data quality issues that must be
addressed to support machine learning.
Traditional data quality processes are an
effective method to remove defects.

Example: Missing segments of populations
Event: Hurricane Sandy
20 million tweets
• Majority of tweets from Manhattan not the hard
hit areas such as Seaside Heights and Midland
Beach due to power outages and diminishing
cell phone batteries
• Despite the millions of Spanish-speakers
affected, very few Spanish-language tweets
collected
• Assess % across and against all likely
locations
• Seek out disconfirming information
Data: Boston Potholes
Street Bump App
• Draws on accelerometer and GPS data to help
passively detect potholes
• Lower income groups in the US are less likely to
have smartphones, particularly older residents -
penetration as low as 16%
• Result is underreporting of road problems in
more elderly communities
• Assess % across all likely locations
• Add other sources
• Utilize demographics for evaluations

Example: Noise, or Inserted content
“Bots are just a tool for making the
numbers look how you want them
to look.”
Sam Woolley
Researcher, Oxford University’s
Project on Computational
Propaganda
Wired: Nov 8, 2016
“The Political Twitter Bots Will Rage This Election Day”
Event: Election
Bot tweets
• ~400,000 bots tweeting on the election
• ~20% of all election-related tweets came from an army of influential
bots
• 55-80% of Twitter activity—the likes, follows, and retweets —are
from bots
• It had been easier to identify earlier bots, but now it’s incredibly
difficult for a human to make a determination
• Evaluate patterns
• Is there any real sentiment here?
• How much repetitive content is there?
• How much “influence” comes from a single or a
few sources (negative or positive)?
• Will it skew the analysis?

Example: Simple bias
“The “black sheep problem” is that if you
were to try to guess what color most sheep
were by looking [at] language data, it would
be very difficult for you to conclude that
they weren't almost all black. In English,
“black sheep” outnumbers “white sheep”
about 25:1 (many "black sheeps” are movie
references); in French it's 3:1; in German it's
12:1. Some languages get it right; in Korean
it's 1:1.5 in favor of white sheep…”
Hal Daumé
Associate Professor, University of Maryland
Blog: June 24, 2016
“Language bias and black sheep”
http://nlpers.blogspot.com/2016/06/language-bias-
and-black-sheep.html
Data: Google Word2Vec data set
Word2vec
• Converts words into a vector space for analysis
• “Numerous researchers have begun to use the data to better understand
everything from machine translation to intelligent Web searching.”
• Embeddings based on a group of 300 million words taken from Google News
• Researchers from Boston University and Microsoft have found it is
“blatantly sexist”
• Impacts the ability to create personalized services
• Evaluate % of words & associations
• How do I interpret a sentiment?
• Does this data set contain hidden and
unexpressed bias?
• Will I miss opportunities because of hidden
assumptions?

Challenges To Ensuring
Data Quality
Many sources of data (70%) and volume of data (48%)
are among the top 3 challenges companies face when
ensuring high quality data.
Applying governance processes to manage and measure
data quality is second with 50%.
* Syncsort, 2019 Enterprise Data Quality survey
70%
50%
48%
47%
46%
43%
32%
27%
27%
25%
15%
Many sources of data
Applying governance processes
to manage and measure data…
Volume of data
Inconsistent formats of data
Inconsistent definitions of data
Missing information
Connecting policies and rules to
data
Misfielded data
Lack of skills/staff
Lack of tools (or inadequate
tools)
Not seen as an organizational
priority
What are the greatest challenges you face
when ensuring high data quality?

Processing at Scale
New Data Quality considerations
• Handling data volumes and distributed data
• Profiling data – assessing high volumes and streaming data
• Standardizing and enriching data content
• Matching entities – not just master data – e.g. transactions for fraud detection
• Meeting Service Level Agreements (SLA’s)
• Running consistently on new and regularly changing platforms (Hadoop,
Spark, Cloud)

Big Data at scale distributes data across many nodes –
not necessarily with other relevant data!
• Data Quality functions must be performed in a consistent
manner, no matter where actual processing takes place, how
the data is segmented, and what the data volume is
• Cleansing, standardization, and data validation will generally scale
linearly
• Data Enrichment: Reference data, lookups must be readily
accessible by any process wherever executed
Handling distributed data volumes
Source: HP Analyst Briefing

• But particular implications for profiling, joining, sorting, and
matching data
• Profiling: Identification of outliers necessitates full volume views
and need to aggregate statistics and frequencies of data
distributed across cluster
• Joins & sorts: Efficient shuffling of data stored across cluster is
critical
• Entity Resolution: Distinguishing matches that indicate a single
specific entity across so much data requires multiple passes with
sophisticated multi-field matching algorithms – with results that
are understandable by business users in order to be meaningful
Handling distributed data volumes

Anti-Money
Laundering on
Hadoop at
Global Bank
• Must provide cluster-native
data verification,
enrichment, and demanding
multi-field fuzzy matching for
entity resolution to Golden
Record
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must be secure – Kerberos,
LDAP
• Must have lineage – data
origin to end point
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance
results at massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas and ASG
Data Intelligence
• Cluster-native data verification,
enrichment, and demanding
multi-field entity resolution on
Spark
• Unmodified mainframe “Golden
Records” stored on Hadoop
Bank must monitor transactions to
detect Money Laundering for FCA
compliance.
Leverage Machine learning at scale
to detect patterns, but …
Requires large amounts of current,
clean data.

4. Data Literacy / Democratization

Data Democratization
Data Quality is a key component to user empowerment
• Data Literacy - critical to understand:
• Business context and language
• Data (including data structures and data types)
• Data access (how and where to find)
• Data usage (how will the data be used by the business)
• Basic Statistics
• Data Quality dimensions
• Data Quality techniques and tools
• Resource constraints – in both Data Quality and technologies
• What questions to ask?
• Where to find answers?

Approaches to Addressing
Emerging Data Quality Trends

Approaches
Data Literacy / Data Governance
• Communicating Best Practices in Data Quality for everyone
“Universal” Data Quality Best Practices
• Establish Scope: ask core questions
• Identifying data requirements
• Address bias
• Understand context
• Address and resolve data quality issues
• Apply data governance processes
Solving “Big Data” Data Quality Challenges
• Handle scale
• Ensure consistent data quality application
across platforms

Culture of Data Literacy
• “Democratization of Data” requires cultural support
• Empowered to ask questions about the data
• Trained to understand and use data
• Trained to understand approaching and evaluating data quality
• Traditional data, new data, machine learning requirements, …
• Understand the business context of the data
Program of Data Governance
• Provide the processes and practices necessary for success
• Measure, monitor, and improve
• Continuous iteration and development
Center of Excellence/Knowledge Base
• Where do you go to find answers?
• Who can help show you how?
Communicate!

Data Literacy: challenges & best practices
• Lack of Common Terminology
• Organizational Barriers & Silos
• Isolated or Unknown Work
• Lack of Engagement
Establish a Common Language
• Define terminology – a ‘stake in the ground’
• Map information
• Support with policies/standards
Gain Broader Buy In
• Bring stakeholders together
• Build the structure, culture,
ownership, steering groups,
stewardship over time
Enrich Information
• Discover what you don’t know
• Resolve differences
• Enhance/annotate to increase insight
Share Insights Regularly
• Produce and share tangible outcomes
• Highlight ‘wins’
• Demonstrate efficiencies & savings
Copyright © Syncsort 2019

“If you don’t know what you want to
get out of the data, how can you
know what data you need – and
what insight you’re looking for?”
Wolf Ruzicka
Chairman of the Board at EastBanc
Technologies
Blog post: June 1, 2017
“Grow A Data Tree Out Of The “Big Data” Swamp”
Establish Scope
• Understand the business objective and problem
• Asking the “right questions” about your data (not just “what”
and “how”)
• “Empowering users (“Who”) to gain new clarity into the core
problem (“Why”)
• “High-quality data” definition will vary by business problem
Identify Requirements & Processes
• Do you have all the data required?
• Do you understand the characteristics and context of the data?
• How will data be matched, consolidated, or connected?
• What’s needed to facilitate the matching, consolidation, or
connection required?
• Have you evaluated the sources?
• What’s the Fitness for your Purpose?
Universal Data Quality best practices

Understand Context
• What are the Critical Data Elements?
• What qualities do we need to address, or leave alone?
• When, and where, do we need to transform or enrich the data
content?
• How are we connecting, relating, or combining data?
Develop, Test, and Deploy Corrective Measures
• Consistent application of standardization, transformation,
enrichment, and entity resolution
• Common templates, rules, metrics, and processes that can be
leveraged
• Deploy into batch, real-time, or embedded services
Apply Data Governance
• Deploy and implement metrics and measures for ongoing
assessment and evaluation
Universal Data Quality best practices
“Never lead with a data set;
lead with a question.”
Anthony Scriffignano
Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017
“The Data Differentiator”

Quantify: challenges & best practices
• Hidden Activities
• Money, Time and Resource
Waste
• Lack of Transparency and Trust
• Disconnect Between Process
and Measures
Identify Baseline Measures
• Keep a focus on lean and agile
• Define value accurately for the business
Link to Business Performance
• Create and refine streams of value
• Transform culture through action
and empowerment
Monitor, Report and Remediate Issues
• Continuously review
• Ensure issues are visible and understood
• Understand root causes
• Address/resolve issues
Quantify Impact of Changes
• Demonstrate through clearly understood measures
• Establish value continuously
• Finish early, finish often
Copyright © Syncsort 2019

Leverage tools built for Big Data
• Focus on the data quality challenges, not the Big Data ones
• Connect to and process hundreds of millions of records of data
• Standardize, enhance, and match international data sets with postal and
country-code validation
• Integrate, enrich, and match new and legacy customer data from multiple
disparate sources
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency on premises or in the Cloud
• Increase processing efficiency by expanding cluster, not rebuilding
processes
• Support failover through fault-tolerant designs; during a node failure,
processing is redirected to another node

Simplify: Design Once, Deploy Anywhere
Intelligent Execution - Insulate your organization from underlying complexities of Big Data
Get excellent performance every time
without tuning, load balancing, etc.
Avoid re-design, re-compile, re-work
• Future-proof job designs for emerging compute
frameworks
• Move from dev to test to production
• Move from on-premises to Cloud
• Move from one Cloud to another
Use existing Data Quality skills
• Focus on data quality problems, not technical ones
Design Once
in visual GUI
Deploy Anywhere!
On-Premises,
Cloud
MapReduce, Spark,
Future Platforms
Windows, Linux,
Unix
Batch,
Streaming
Single Node,
Cluster
Emerging Data Quality Trends45

Data Quality remains Data Quality, even at scale
“Data and analytics leaders need to understand the
business priorities and challenges of their organization.
Only then will they be in the right position to create
compelling business cases that connect data quality
improvement with key business priorities.”
Ted Friedman
VP Distinguished Analyst, Gartner
Smarter with Gartner at Gartner.com: June 12, 2018
“How to Create a Business Case for Data Quality Improvement”
“Never lead with a data set;
lead with a question.”
Anthony Scriffignano
Chief Data Scientist, Dun & Bradstreet
Forbes Insights, May 31, 2017
“The Data Differentiator”

Emerging Data Quality Trends for Governing and Analyzing Big Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Emerging Data Quality Trends for Governing and Analyzing Big Data

Similar a Emerging Data Quality Trends for Governing and Analyzing Big Data (20)

Más de Precisely

Más de Precisely (20)

Último

Último (20)

Emerging Data Quality Trends for Governing and Analyzing Big Data