SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Presented By:
Mastering Your
Customer Data
on Apache Spark
About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
What motivated our solution
How many people do you know??
Anthropologists say it’s 150…
MAX
What if we could increase this number?
Opportunities
Closing
Revenue
Project Dunbar - Internal
Developed in:
• Python
• Neo4j Database
• Build a social graph based on internal and external data
• Run pathing algorithms to understand strategic opportunity advantages
..and then
And then one of our customers wanted us to build it for
them:
• Much larger dataset
• 6 million customers > 30% duplication rate
• 100’s of millions of customer interactions
• Few direct links across channels
How to?
Throwing a bunch of unrelated points in a graph
will not give us a useable solution.
We need to
MASTER
our contact data…
• We need to clean and
normalize our incoming
interaction and relationship data
(edges)
• Clean normalize and match our
entities (vertexes)
Mastering Customer Data
Customer Data Integration (CDI):
is the process of consolidating and managing customer information
from all available sources.
In other words…
We need to figure out how to
LINK people across systems!
Steps required
Standardization
Matching
Survivorship
Validation
Traditional Standardization and Matching
Cleanse and Parse:
• Names
• Resolve nicknames
• Create deterministic hash,
phonetic representation
• Addresses
• Emails
• Phone Numbers
Matching:
join based on combinations
of cleansed and standardized
data to create match results
Great – But the NEW data is different
Reveal
• Wait for the customer to “reveal” themselves
• Create a link between anonymous self and known
profile
Vector
• May need behavioral statistical profiling
• Compare use vectors
Rebuild
• Recluster all prior activities
• Rebuild the Graph
Why Spark
“Big Box” MDM tools vs ROI?
• Prohibitively expensive à limited by licensing $$$
• Typically limited to the scalability of a single server
We Spark!
• Development local or distributed is identical
• Beautiful high level API’s
• Databricks cloud is soo Easy
• Full universe of Python modules
• Open source and Free**
• Blazing fast!
Spark has become our default
processing engine for a myriad of
engineering problems
Spark map operations
Cleansing, transformation, and standardization of both
interaction and customer data
Amazing universe of Python modules:
• Address Parsing: usaddress, postal-address, etc
• Name Hashing: fuzzy, etc
• Genderization: sexmachine, etc
And all the goodies of the standard library!
We can now parallelize our workload against a large
number of machines:
Matching process
• We now have clean standardized, linkable data
• We need to resolve our links between our customer
• Large table self joins
• We can even use SQL:
Matching process
The matching process output gives us the relationships between
customers:
Great, but it’s not very useable, you need to traverse the dataset
to find out 1234 and 1235 are the same person (and this is a
trivial case)
And we need to cluster and identify our survivors (vertex)
xid yid match_type
1234 4849 phone
4849 5499 email
5499 1235 address
4849 7788 cookie
5499 7788 cookie
4849 1234 phone
Graphx to the rescue
1234 4849
5499
7788
We just need to import our
edges into a graph and “dump”
out communities
Don’t think table…
think Graph!
These matches
are actually
communities
1235
Connected components
Connected Components algorithm labels each connected
component of the graph with the ID of its lowest-numbered
vertex
This lowest number vertex can serve as our “survivor” (not
field survivorship)
Is it possible to write less code?
Field level survivorship rules
We now need to survive fields to our survivor to make it a
“best record”.
Depending on the attribution we choose:
• Latest reported
• Most frequently used
We then do some simple ranking
(thanks windowed functions)
What you really want to know...
With a customer dataset of
approximately 6 million customers
and 100’s of millions of data points.
… when directly compared to
traditional “big box” enterprise MDM
software
Was Spark faster….
Map operations like cleansing, standardization saw a
10x improvement
on a 4 node r3.2xlarge cluster
But the real winner was the “Graph Trick”…
The survivorship step was reduced from nearly 2 hours using
traditional tools to under 5 minutes using Spark SQL and GraphX**
Connected components itself took less than 1 minute
**loading RDD’s from S3, creating matches, building the graph, and running connected
components and storing the results to S3
Community
Elliott Cordo
Chief Architect, Caserta Concepts
elliott@casertaconcepts.com
info@casertaconcepts.com
1 (855) 755-2246
www.casertaconcepts.com
Thank You
Kevin
Data Engineer
kevin@casertaconcepts.com

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Role of Data in Digital Transformation
Role of Data in Digital TransformationRole of Data in Digital Transformation
Role of Data in Digital Transformation
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Migration to Azure
Data Migration to AzureData Migration to Azure
Data Migration to Azure
 
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as Product
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Graph Databases – Benefits and Risks
Graph Databases – Benefits and RisksGraph Databases – Benefits and Risks
Graph Databases – Benefits and Risks
 
Master Data Management's Place in the Data Governance Landscape
Master Data Management's Place in the Data Governance Landscape Master Data Management's Place in the Data Governance Landscape
Master Data Management's Place in the Data Governance Landscape
 
How to Sell Acquia DXP, Marketing Cloud, and Drupal Cloud
How to Sell Acquia DXP, Marketing Cloud, and Drupal Cloud How to Sell Acquia DXP, Marketing Cloud, and Drupal Cloud
How to Sell Acquia DXP, Marketing Cloud, and Drupal Cloud
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 

Destacado

Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Spark Summit
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 

Destacado (20)

Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Big MDM Part 2: Using a Graph Database for MDM and Relationship ManagementBig MDM Part 2: Using a Graph Database for MDM and Relationship Management
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Product recommendation guide for eCommerce
Product recommendation guide for eCommerceProduct recommendation guide for eCommerce
Product recommendation guide for eCommerce
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Key insights in ecommerce personalisation - via Shopboostr
Key insights in ecommerce personalisation - via ShopboostrKey insights in ecommerce personalisation - via Shopboostr
Key insights in ecommerce personalisation - via Shopboostr
 
Omnichannel First Steps
Omnichannel First StepsOmnichannel First Steps
Omnichannel First Steps
 
More Channels = More Money
More Channels = More MoneyMore Channels = More Money
More Channels = More Money
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
 
Wejście do Omnichannel - 5 czynników sukcesu
Wejście do Omnichannel - 5 czynników sukcesuWejście do Omnichannel - 5 czynników sukcesu
Wejście do Omnichannel - 5 czynników sukcesu
 
Electronic Shopper Archetypes
Electronic Shopper ArchetypesElectronic Shopper Archetypes
Electronic Shopper Archetypes
 
BIg Data Trends in 2016
BIg Data Trends in 2016BIg Data Trends in 2016
BIg Data Trends in 2016
 
1DMP: Marketing Data Platform - the future of data-driven marketing
1DMP: Marketing Data Platform - the future of data-driven marketing1DMP: Marketing Data Platform - the future of data-driven marketing
1DMP: Marketing Data Platform - the future of data-driven marketing
 
Neo4j Solutions - Master Data Management
Neo4j Solutions - Master Data ManagementNeo4j Solutions - Master Data Management
Neo4j Solutions - Master Data Management
 
Internet of Things and Big Data
Internet of Things and Big DataInternet of Things and Big Data
Internet of Things and Big Data
 
Thank Bunny - Customer Engagement Platform
Thank Bunny - Customer Engagement PlatformThank Bunny - Customer Engagement Platform
Thank Bunny - Customer Engagement Platform
 
The Big Data Ecosystem for Financial Services
The Big Data Ecosystem for Financial ServicesThe Big Data Ecosystem for Financial Services
The Big Data Ecosystem for Financial Services
 
Cloud transition - The Trivadis approach
Cloud transition - The Trivadis approachCloud transition - The Trivadis approach
Cloud transition - The Trivadis approach
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 

Similar a Mastering Your Customer Data on Apache Spark by Elliott Cordo

Similar a Mastering Your Customer Data on Apache Spark by Elliott Cordo (20)

A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
How to Empower Your Business Users with Oracle Data Visualization
How to Empower Your Business Users with Oracle Data VisualizationHow to Empower Your Business Users with Oracle Data Visualization
How to Empower Your Business Users with Oracle Data Visualization
 
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4jNeo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
 
Knowledge Graphs Webinar- 11/7/2017
Knowledge Graphs Webinar- 11/7/2017Knowledge Graphs Webinar- 11/7/2017
Knowledge Graphs Webinar- 11/7/2017
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
 
Beyond Big Data: Leverage Large-Scale Connections
Beyond Big Data: Leverage Large-Scale ConnectionsBeyond Big Data: Leverage Large-Scale Connections
Beyond Big Data: Leverage Large-Scale Connections
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databases
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - Webinar
 

Más de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Mastering Your Customer Data on Apache Spark by Elliott Cordo

  • 2. About Caserta Concepts • Award-winning technology innovation consulting with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 3. What motivated our solution How many people do you know??
  • 5. What if we could increase this number? Opportunities Closing Revenue
  • 6. Project Dunbar - Internal Developed in: • Python • Neo4j Database • Build a social graph based on internal and external data • Run pathing algorithms to understand strategic opportunity advantages
  • 7. ..and then And then one of our customers wanted us to build it for them: • Much larger dataset • 6 million customers > 30% duplication rate • 100’s of millions of customer interactions • Few direct links across channels
  • 8. How to? Throwing a bunch of unrelated points in a graph will not give us a useable solution. We need to MASTER our contact data… • We need to clean and normalize our incoming interaction and relationship data (edges) • Clean normalize and match our entities (vertexes)
  • 9. Mastering Customer Data Customer Data Integration (CDI): is the process of consolidating and managing customer information from all available sources. In other words… We need to figure out how to LINK people across systems!
  • 11. Traditional Standardization and Matching Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers Matching: join based on combinations of cleansed and standardized data to create match results
  • 12. Great – But the NEW data is different Reveal • Wait for the customer to “reveal” themselves • Create a link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
  • 13. Why Spark “Big Box” MDM tools vs ROI? • Prohibitively expensive à limited by licensing $$$ • Typically limited to the scalability of a single server We Spark! • Development local or distributed is identical • Beautiful high level API’s • Databricks cloud is soo Easy • Full universe of Python modules • Open source and Free** • Blazing fast! Spark has become our default processing engine for a myriad of engineering problems
  • 14. Spark map operations Cleansing, transformation, and standardization of both interaction and customer data Amazing universe of Python modules: • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc And all the goodies of the standard library! We can now parallelize our workload against a large number of machines:
  • 15. Matching process • We now have clean standardized, linkable data • We need to resolve our links between our customer • Large table self joins • We can even use SQL:
  • 16. Matching process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
  • 17. Graphx to the rescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
  • 18. Connected components Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Is it possible to write less code?
  • 19. Field level survivorship rules We now need to survive fields to our survivor to make it a “best record”. Depending on the attribution we choose: • Latest reported • Most frequently used We then do some simple ranking (thanks windowed functions)
  • 20. What you really want to know... With a customer dataset of approximately 6 million customers and 100’s of millions of data points. … when directly compared to traditional “big box” enterprise MDM software Was Spark faster….
  • 21. Map operations like cleansing, standardization saw a 10x improvement on a 4 node r3.2xlarge cluster But the real winner was the “Graph Trick”… The survivorship step was reduced from nearly 2 hours using traditional tools to under 5 minutes using Spark SQL and GraphX** Connected components itself took less than 1 minute **loading RDD’s from S3, creating matches, building the graph, and running connected components and storing the results to S3
  • 23. Elliott Cordo Chief Architect, Caserta Concepts elliott@casertaconcepts.com info@casertaconcepts.com 1 (855) 755-2246 www.casertaconcepts.com Thank You Kevin Data Engineer kevin@casertaconcepts.com