SlideShare una empresa de Scribd logo
1 de 33
Joe Caserta
President
September 30, 2016, Javits Center, New York City
Building new Data Ecosystem for
Customer Analytics
Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and Business
Intelligence since 1996
Began consulting database programing and data
modeling 25+ years hands-on experience building database
solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in Intelligent
Enterprise magazine
Launched Data Science, Data Interaction and Cloud
practices
Laser focus on extending Data Analytics with Big Data
solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance Techniques on Big
Data (Innovation)
Awarded Top 20 Big Data
Companies 2016
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing (BDW) Meetup NYC:
2,000+ Members
2016 Awarded Fastest Growing Big Data
Companies 2016
Established best practices for big data ecosystem
implementations
About Caserta Concepts
– Consulting Data Innovation and Modern Data Engineering
– Award-winning company
– Internationally recognized work force
– Strategy, Architecture, Implementation, Governance
– Innovation Partner
– Strategic Consulting
– Advanced Architecture
– Build & Deploy
• Leader in Enterprise Data Solutions
– Big Data Analytics
– Data Warehousing
– Business Intelligence
Data Science
Cloud Computing
Data Governance
Awards & Recognition
Top 10
Fastest Growing
Big Data Companies
2016
Our Partners
Use Case Objectives
• Cross-Channel Behavior Tracking / Identity Resolution
• Access to Atomic Level Transaction Data
• Blend Data Assets
• Fast Data Onboarding and Discovery
• Ensure Data Quality
• Single platform / Landscape simplification
• Understand Path-to-Purchase Customer Journey
• Improve Customer Experience & Increase Sales
The Recipe: Build a Dynamic Data Platform
OLD WAY:
• Structure Data Ingest Data Analyze Data
• Fixed Capacity
• Monolith
NEW WAY:
• Ingest Data Analyze Data Structure Data
• Dynamic Capacity
• Ecosystem
RECIPE:
• Cloud
• Data Lake
• Holistic Architecture & Framework
Analytics: The Whole Brain Challenge
Front
Back
Analytics Oriented
• Data Science
• Research
Process Oriented
• Data Governance
• Compliance
Operations Oriented
• Shared Services
• Data Engineering
Revenue Oriented
• Revenue Goals
• Monetizing Data
Chief Data Organization (Oversight)
Vertical Business Area
[Sales/Finance/Marketing/Operations/Customer Svc]
Product Owner
SCRUM Master
Development Team
Business Subject Matter Expertise
Data Librarian/Data Stewardship
Data Science/ Statistical Skills
Data Engineering / Architecture
Presentation/ BI Report Development Skills
Data Quality Assurance
DevOps
IT Organization
(Oversight)
Enterprise Data Architect
Solution Engineers
Data Integration Practice
User Experience Practice
QA Practice
Operations Practice
Advanced Analytics
Business Analysts
Data Analysts
Data Scientists
Statisticians
Data Engineers
Planning Organization
Project Managers
Data Organization
Data Gov Coordinator
Data Librarians
Data Stewards
It Takes a Village!
Unexpected Reaction to Change
Global economics
Intensity of competition
Reduce costs
Move to cross-functional teams
New executive leadership
Speed of technical change
Social trends and changes
Period of time in present role
Status & perks of office/dept under threat
No apparent reasons for proposed changes
Lack of understanding of proposed changes
Fear of inability to cope with new technology
Concern over job security
Forces for Change Forces Resisting Change
Status Quo
Moving the Status Quo
http://www.change-management-coach.com/force-field-analysis.html
The Data Lake Paradigm
Technology:
• Scalable distributed storage S3
• Pluggable fit-for-purpose processing EMR
• Consistent extensible framework Spark
• Dimensional Data Warehouse Redshift
Functional Capabilities:
• Remove barriers between data ingestion and analysis
• Democratize Data with Just Enough Data Governance
Why AWS?
Why Spark?
“Big Box” tools vs ROI?
– Prohibitively expensive limited by licensing $$$
– Typically limited to the scalability of a single server
We Spark!
• Development local or distributed is identical
• Beautiful high level API’s
• Full universe of Python modules
• Open source and Free
• Blazing fast!
• Databricks cloud makes it easier
Spark has become our default processing engine for a
myriad of engineering & science problems
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries
and Reporting
Big
Data
Ware
house
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Usage Pattern Data Governance
Metadata, ILM,
Security
Data Architecture is the new Data Model
Data Lifecycle
Ingest
TransformAnalyze
Output
Understand
Problem
Ingest
Data
Explore and
Understand
Data
Clean and
Shape Data
Evaluate
Data
Create and
build Models
Communic
ate Results
Deliver &
Deploy Model
Data Engineer
Architect how data is organized
& ensure operability
Data Scientist
Deep analytics and modeling
for hidden insights
Business Analyst
Work with data to apply
insights to business strategy
App Developer
Integrates data & insights with
existing or new applications
Type
Comments
Single Touch Rules-Based Statistically Driven
Assign the credit
to the first or last
exposure
Assign the credit to
each interaction
based on business
rules
Assign the credit to
interactions based
on data-driven
model
Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click
100% 33% 33% 33% 27% 49% 24%
- Last touch only
- Ignores bulk of
customer journey
- Undervalues
other interactions
and influencers
- Subjective
- Assigns arbitrary
values to each
interaction
- Lacks analytics rigor
to determine weights
ü Looks at full behavior
patterns
ü Consider all touch points
ü Can apply different
models for best results
ü Use data to find
correlations between
touch points (winning
combinations)
Path-to-Purchase Methods
Unifying the Customer Across Channels
Customer Data Integration (CDI):
Match and manage customer information from all available sources
Marketing channels: DMP, Salesforce, Adobe, Social, Direct Mail, Call Center, CRM
In other words…
We need to figure out how to
LINK people across systems!
Mastering Master Data is Still MDM
Standardize
Match
Survivorship
Validate
Standardization and Matching
Cleanse and Parse:
• Names
• Resolve nicknames
• Create deterministic hash, phonetic
representation
• Addresses
• Emails
• Phone Numbers
Matching:
Join based on combinations of cleansed
and standardized data to create match
results:
Spark map operations:
• Data cleansing, transformation, and
standardization
– Address Parsing: usaddress, postal-
address, etc
– Name Hashing: fuzzy, etc
– Genderization: sexmachine, etc
Mastering Unmanageable Source Data
Reveal
• Wait for the customer to “reveal” themselves
• Create link between anonymous self and known profile
Vector
• May need behavioral statistical profiling
• Compare use vectors
Rebuild
• Recluster all prior activities
• Rebuild the Graph
The Matching Process
The matching process output gives us the relationships between customers:
Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the
same person (and this is a trivial case)
And we need to cluster and identify our survivors (vertex)
xid yid match_type
1234 4849 phone
4849 5499 email
5499 1235 address
4849 7788 cookie
5499 7788 cookie
4849 1234 phone
Graph to the Rescue
1234 4849
5499
7788
We just need to import our edges into a graph
and “dump” out communities
Don’t think table…
think Graph! These matches are
actually communities
1235
Connected Components algorithm labels each connected component of the
graph with the ID of its lowest-numbered vertex
This lowest number vertex can serve as our “survivor” (not field survivorship)
Connected Components
xid yid
1234 4849
1234 5499
1234 1235
1234 7788
1234 7788
1234 1234
Identity Resolution Process
Data Ecosystem Architecture
ODS
ETL/ID Res
The Notebook is the ETL Tool
The Notebook is the Data Science Tool
The Redshift DW is still Dimensional
Use Graph for Data Lineage
The Goal: Top Conversion Paths
Source: Accelerom AG, Zurich
• Data Lake: S3 / NoSQL or SQL as Needed
• Identity Resolution: Spark / Graph Frames
• ETL: Spark / Airflow
• Path-to-Purchase: S3 / Spark / MLlib
• Data Warehouse: RedShift
• Shared Interface: Notebooks (Jupyter or Databricks)
• Business Intelligence: Tableau
Recap
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
@joe_Caserta
• Award-winning company
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Innovation Partner
• Strategic Consulting
• Advanced Technical Design
• Build & Deploy Solutions
• BDW Meetup
• New York City
• 3,000+ members
• Knowledge sharing
Data is not important, it’s what you do with it that’s important!
Thank You

Más contenido relacionado

La actualidad más candente

Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18
Harvinder Atwal
 

La actualidad más candente (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data WrongThe Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
 
DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18
 

Destacado

Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
magda3695
 

Destacado (15)

Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Big Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated DronesBig Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated Drones
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Webinar: The Hybrid Data Ecosystem: Are You Battling an Illogical Data Wareho...
Webinar: The Hybrid Data Ecosystem: Are You Battling an Illogical Data Wareho...Webinar: The Hybrid Data Ecosystem: Are You Battling an Illogical Data Wareho...
Webinar: The Hybrid Data Ecosystem: Are You Battling an Illogical Data Wareho...
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
KNOWLEDGE SCIENCE; NOT INFORMATION SCIENCE OR TECHNOLOGY- SCOPE,THEORIES AND...
KNOWLEDGE SCIENCE; NOT INFORMATION SCIENCE OR TECHNOLOGY-  SCOPE,THEORIES AND...KNOWLEDGE SCIENCE; NOT INFORMATION SCIENCE OR TECHNOLOGY-  SCOPE,THEORIES AND...
KNOWLEDGE SCIENCE; NOT INFORMATION SCIENCE OR TECHNOLOGY- SCOPE,THEORIES AND...
 
Towards Neuro–Information Science
Towards Neuro–Information ScienceTowards Neuro–Information Science
Towards Neuro–Information Science
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Big data + data science startup focus points
Big data + data science startup focus pointsBig data + data science startup focus points
Big data + data science startup focus points
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Data Driven Decisions - Big Data Warehousing Meetup, FICO
Data Driven Decisions - Big Data Warehousing Meetup, FICOData Driven Decisions - Big Data Warehousing Meetup, FICO
Data Driven Decisions - Big Data Warehousing Meetup, FICO
 

Similar a Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016

Sterling IT
Sterling ITSterling IT
Sterling IT
kbass101
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
CCG
 

Similar a Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016 (20)

Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Delivering Value Through Business Analytics
Delivering Value Through Business AnalyticsDelivering Value Through Business Analytics
Delivering Value Through Business Analytics
 
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient..."Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scient...
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business Success
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018
 
Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?
 
Self-service Analytic for Business Users-19july2017-final
Self-service Analytic for Business Users-19july2017-finalSelf-service Analytic for Business Users-19july2017-final
Self-service Analytic for Business Users-19july2017-final
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 
Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence
 
Sterling IT
Sterling ITSterling IT
Sterling IT
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Paper
 
Effectively Leveraging Graph Technology - Ann Grubbs, Lockheed Martin
Effectively Leveraging Graph Technology - Ann Grubbs, Lockheed MartinEffectively Leveraging Graph Technology - Ann Grubbs, Lockheed Martin
Effectively Leveraging Graph Technology - Ann Grubbs, Lockheed Martin
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
 
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
DataEd Slides: Unlock Business Value Using Reference and Master Data Manageme...
 

Más de Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 

Más de Caserta (6)

Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 

Último

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016

  • 1. Joe Caserta President September 30, 2016, Javits Center, New York City Building new Data Ecosystem for Customer Analytics
  • 2. Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise magazine Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Awarded Top 20 Big Data Companies 2016 Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 2,000+ Members 2016 Awarded Fastest Growing Big Data Companies 2016 Established best practices for big data ecosystem implementations
  • 3. About Caserta Concepts – Consulting Data Innovation and Modern Data Engineering – Award-winning company – Internationally recognized work force – Strategy, Architecture, Implementation, Governance – Innovation Partner – Strategic Consulting – Advanced Architecture – Build & Deploy • Leader in Enterprise Data Solutions – Big Data Analytics – Data Warehousing – Business Intelligence Data Science Cloud Computing Data Governance
  • 4. Awards & Recognition Top 10 Fastest Growing Big Data Companies 2016
  • 6. Use Case Objectives • Cross-Channel Behavior Tracking / Identity Resolution • Access to Atomic Level Transaction Data • Blend Data Assets • Fast Data Onboarding and Discovery • Ensure Data Quality • Single platform / Landscape simplification • Understand Path-to-Purchase Customer Journey • Improve Customer Experience & Increase Sales
  • 7. The Recipe: Build a Dynamic Data Platform OLD WAY: • Structure Data Ingest Data Analyze Data • Fixed Capacity • Monolith NEW WAY: • Ingest Data Analyze Data Structure Data • Dynamic Capacity • Ecosystem RECIPE: • Cloud • Data Lake • Holistic Architecture & Framework
  • 8. Analytics: The Whole Brain Challenge Front Back Analytics Oriented • Data Science • Research Process Oriented • Data Governance • Compliance Operations Oriented • Shared Services • Data Engineering Revenue Oriented • Revenue Goals • Monetizing Data
  • 9. Chief Data Organization (Oversight) Vertical Business Area [Sales/Finance/Marketing/Operations/Customer Svc] Product Owner SCRUM Master Development Team Business Subject Matter Expertise Data Librarian/Data Stewardship Data Science/ Statistical Skills Data Engineering / Architecture Presentation/ BI Report Development Skills Data Quality Assurance DevOps IT Organization (Oversight) Enterprise Data Architect Solution Engineers Data Integration Practice User Experience Practice QA Practice Operations Practice Advanced Analytics Business Analysts Data Analysts Data Scientists Statisticians Data Engineers Planning Organization Project Managers Data Organization Data Gov Coordinator Data Librarians Data Stewards It Takes a Village!
  • 11. Global economics Intensity of competition Reduce costs Move to cross-functional teams New executive leadership Speed of technical change Social trends and changes Period of time in present role Status & perks of office/dept under threat No apparent reasons for proposed changes Lack of understanding of proposed changes Fear of inability to cope with new technology Concern over job security Forces for Change Forces Resisting Change Status Quo Moving the Status Quo http://www.change-management-coach.com/force-field-analysis.html
  • 12. The Data Lake Paradigm Technology: • Scalable distributed storage S3 • Pluggable fit-for-purpose processing EMR • Consistent extensible framework Spark • Dimensional Data Warehouse Redshift Functional Capabilities: • Remove barriers between data ingestion and analysis • Democratize Data with Just Enough Data Governance
  • 14. Why Spark? “Big Box” tools vs ROI? – Prohibitively expensive limited by licensing $$$ – Typically limited to the scalability of a single server We Spark! • Development local or distributed is identical • Beautiful high level API’s • Full universe of Python modules • Open source and Free • Blazing fast! • Databricks cloud makes it easier Spark has become our default processing engine for a myriad of engineering & science problems
  • 15. Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Big Data Ware house Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Usage Pattern Data Governance Metadata, ILM, Security Data Architecture is the new Data Model
  • 16. Data Lifecycle Ingest TransformAnalyze Output Understand Problem Ingest Data Explore and Understand Data Clean and Shape Data Evaluate Data Create and build Models Communic ate Results Deliver & Deploy Model Data Engineer Architect how data is organized & ensure operability Data Scientist Deep analytics and modeling for hidden insights Business Analyst Work with data to apply insights to business strategy App Developer Integrates data & insights with existing or new applications
  • 17. Type Comments Single Touch Rules-Based Statistically Driven Assign the credit to the first or last exposure Assign the credit to each interaction based on business rules Assign the credit to interactions based on data-driven model Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click 100% 33% 33% 33% 27% 49% 24% - Last touch only - Ignores bulk of customer journey - Undervalues other interactions and influencers - Subjective - Assigns arbitrary values to each interaction - Lacks analytics rigor to determine weights ü Looks at full behavior patterns ü Consider all touch points ü Can apply different models for best results ü Use data to find correlations between touch points (winning combinations) Path-to-Purchase Methods
  • 18. Unifying the Customer Across Channels Customer Data Integration (CDI): Match and manage customer information from all available sources Marketing channels: DMP, Salesforce, Adobe, Social, Direct Mail, Call Center, CRM In other words… We need to figure out how to LINK people across systems!
  • 19. Mastering Master Data is Still MDM Standardize Match Survivorship Validate
  • 20. Standardization and Matching Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers Matching: Join based on combinations of cleansed and standardized data to create match results: Spark map operations: • Data cleansing, transformation, and standardization – Address Parsing: usaddress, postal- address, etc – Name Hashing: fuzzy, etc – Genderization: sexmachine, etc
  • 21. Mastering Unmanageable Source Data Reveal • Wait for the customer to “reveal” themselves • Create link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
  • 22. The Matching Process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
  • 23. Graph to the Rescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
  • 24. Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Connected Components xid yid 1234 4849 1234 5499 1234 1235 1234 7788 1234 7788 1234 1234
  • 27. The Notebook is the ETL Tool
  • 28. The Notebook is the Data Science Tool
  • 29. The Redshift DW is still Dimensional
  • 30. Use Graph for Data Lineage
  • 31. The Goal: Top Conversion Paths Source: Accelerom AG, Zurich
  • 32. • Data Lake: S3 / NoSQL or SQL as Needed • Identity Resolution: Spark / Graph Frames • ETL: Spark / Airflow • Path-to-Purchase: S3 / Spark / MLlib • Data Warehouse: RedShift • Shared Interface: Notebooks (Jupyter or Databricks) • Business Intelligence: Tableau Recap
  • 33. Joe Caserta President, Caserta Concepts joe@casertaconcepts.com @joe_Caserta • Award-winning company • Transformative Data Strategies • Modern Data Engineering • Advanced Architecture • Innovation Partner • Strategic Consulting • Advanced Technical Design • Build & Deploy Solutions • BDW Meetup • New York City • 3,000+ members • Knowledge sharing Data is not important, it’s what you do with it that’s important! Thank You