This session covers how to capture and analyize customer behavior to create more relevent contexts for customers. We will cover how to use your current BI features, and more importantly, how newer technologies approach the challenge. You will walk away with a good idea on how to build and drive even more contextually relevant experiences to customers for even more successful engagements.
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Personalization
1. Leveraging Customer Data to Enhance Relevancy
in Personalization
“Using Apache Data Processing Projects on top of MongoDB”
Marc Schwering
Sr. Solution Architect – EMEA
marc@mongodb.com
@m4rcsch
2. 2
Big Data Analytics Track
1. Driving Personalized Experiences Using Customer Profiles
2. Leveraging Data to Enhance Relevancy in Personalization
3. Machine Learning to Engage the Customer, with Apache Spark,
IBM Watson, and MongoDB
3. 3
Agenda For This Session
• Personalization Process Review
• The Life of an Application
• Separation of Concerns / Real World Architecture
• Apache Spark and Flink Data Processing Projects
• Clustering with Apache Flink
• Next Steps
4. 4
High Level Personalization Process
1. Profile created
2. Enrich with public data
3. Capture activity
4. Clustering analysis
5. Define Personas
6. Tag with personas
7. Personalize interactions
Batch analytics
Public data
Common
technologies
• R
• Hadoop
• Spark
• Python
• Java
• Many other
options Personas
changed much
less often than
tagging
7. 7
One size/document fits all?
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Application
Batch analytics
8. 8
Separation of Concerns
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Batch analytics Layer
Frontend - System
Profile Service
Customer
Service
Session Service Persona Service
9. 9
Benefits
• Code does less, Document and Code stays focused
• Split ability
– Different Teams
– New Languages
– Defined Dependencies
10. 10
Result
• Code does less, Document and Code stays focused
• Split ability
– Different Teams
– New Languages
– Defined Dependencies
KISS
=> Keep it simple and save!
=> Clean Code <=
• Robert C. Marten: https://cleancoders.com/
• M. Fowler / B. Meyer. et. al.: Command Query Separation
12. 12
Separation of Concerns
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Batch analytics Layer
Frontend – System
Profile Service
Customer
Service
Session Service Persona Service
13. 13
Separation of Concerns
• Profile Data
– Preferences
– Personal information
• Contact information
• DOB, gender, ZIP...
• Customer Data
– Purchase History
– Marketing History
• „Session Data“
– View History
– Shopping Cart Data
– Information Broker Data
• Personalisation Data
– Persona Vectors
– Product and Category recommendations
Batch analytics Layer
Frontend – System
Profile Service
Customer
Service
Session Service Persona Service
17. 17
Hadoop in a Nutshell
• An open source distributed storage and
distributed batch oriented processing framework
• Hadoop Distributed File System (HDFS) to store data on
commodity hardware
• Yarn as resource management platform
• MapReduce as programming model working on top of HDFS
18. 18
Spark in a Nutshell
• Spark is a top-level Apache project
• Can be run on top of YARN and can read any
Hadoop API data, including HDFS or MongoDB
• Fast and general engine for large-scale data processing and
analytics
• Advanced DAG execution engine with support for data locality
and in-memory computing
19. 19
Flink in a Nutshell
• Flink is a top-level Apache project
• Can be run on top of YARN and can read any
Hadoop API data, including HDFS or MongoDB
• A distributed streaming dataflow engine
• Streaming and batch
• Iterative in memory execution and handling
• Cost based optimizer
20. 20
Latency of query operations
Query Aggregation MapReduce Cluster Algorithms
time
MongoDB
Hadoop
Spark/Flink
25. 25
Iterations in Flink
• Dedicated iteration operators
• Tasks keep running for the iterations, not redeployed for each step
• Caching and optimizations done automatically
29. 29
Takeaways
• Stay focussed => Start and stay small
– Evaluate with BigDocuments but do a PoC focussed on the
topic
• Extending functionality is easy
– Aggregation, MapReduce
– Hadoop Connector opens a new variety of Use Cases
• Extending functionality could be challenging
– Evolution is outpacing help channels
– A lot of options (Spark, Flink, Storm, Hadoop….)
– More than just a binary
30. 30
Next Steps
• Next Session => Hands on Spark and Whatson Content!
– „Machine Learning to Engage the Customer, with Apache Spark, IBM Watson,
and MongoDB“
– RDD Examples
• Try out Spark and Flink
– http://bit.ly/MongoDB_Hadoop_Spark_Webinar
– http://flink.apache.org/
– https://github.com/mongodb/mongo-hadoop
– https://github.com/m4rcsch/flink-mongodb-example
• Participate and ask Questions!
– @m4rcsch
– marc@mongodb.com
Personalization Process Review (What We Heard)
Access Pattern and Development Cycle
Separation of Concerns (MongoDB Point of View)
Todo: zoom in common tech
Even counts and therefore persona very helpful.
A good problem to have is too much information to personalize with – start simple, measure, and add
Profile: show logical document parts
Frontent caching system like varnish
KISS => Keep it simple, stupid!
Todo: References!!!
Hadoop: great for big data that is partitionable
Spark: MapReduce iterations are fast
Amongst Hadoop and others these ar...
In a distributed system, a conventional program would not work as the data is split across nodes. DAG (Directed Acyclic Graph) is a programming style for distributed systems - You can think of it as an alternative to Map Reduce. While MR has just two steps (map and reduce), DAG can have multiple levels that can form a tree structure. Say if you want to execute a SQL query, DAG is more flexible with more functions like map, filter, union etc. Also DAG execution is faster as in case of Apache Tez that succeeds MR due to intermediate results not being written to disk.
Coming to Spark, the main concept is "RDD" - Resilient Distributed Dataset. To understand Spark architecture, it's best to read Berkley paper - Page on berkeley.edu
In brief, RDDs are distributed data sets that can stay in memory and fallback to disk gracefully. RDDs if lost can be easily rebuilt using a graph that says how to reconstruct. RDDs are great if you want to keep holding a data set in memory and fire a series of queries - this works better than fetching data from disk every time. Another important RDD concept is that there are two types of things that can be done on an RDD - 1) Transformations like, map, filter than results in another RDD. 2) Actions like count that result in an output. A spark job comprises of a DAG of tasks executing transformations and actions on RDDs.
Amongst Hadoop and others these ar...
In a distributed system, a conventional program would not work as the data is split across nodes. DAG (Directed Acyclic Graph) is a programming style for distributed systems - You can think of it as an alternative to Map Reduce. While MR has just two steps (map and reduce), DAG can have multiple levels that can form a tree structure. Say if you want to execute a SQL query, DAG is more flexible with more functions like map, filter, union etc. Also DAG execution is faster as in case of Apache Tez that succeeds MR due to intermediate results not being written to disk.
Coming to Spark, the main concept is "RDD" - Resilient Distributed Dataset. To understand Spark architecture, it's best to read Berkley paper - Page on berkeley.edu
In brief, RDDs are distributed data sets that can stay in memory and fallback to disk gracefully. RDDs if lost can be easily rebuilt using a graph that says how to reconstruct. RDDs are great if you want to keep holding a data set in memory and fire a series of queries - this works better than fetching data from disk every time. Another important RDD concept is that there are two types of things that can be done on an RDD - 1) Transformations like, map, filter than results in another RDD. 2) Actions like count that result in an output. A spark job comprises of a DAG of tasks executing transformations and actions on RDDs.
In a distributed system, a conventional program would not work as the data is split across nodes. DAG (Directed Acyclic Graph) is a programming style for distributed systems - You can think of it as an alternative to Map Reduce. While MR has just two steps (map and reduce), DAG can have multiple levels that can form a tree structure. Say if you want to execute a SQL query, DAG is more flexible with more functions like map, filter, union etc. Also DAG execution is faster as in case of Apache Tez that succeeds MR due to intermediate results not being written to disk.
Coming to Spark, the main concept is "RDD" - Resilient Distributed Dataset. To understand Spark architecture, it's best to read Berkley paper - Page on berkeley.edu
In brief, RDDs are distributed data sets that can stay in memory and fallback to disk gracefully. RDDs if lost can be easily rebuilt using a graph that says how to reconstruct. RDDs are great if you want to keep holding a data set in memory and fire a series of queries - this works better than fetching data from disk every time. Another important RDD concept is that there are two types of things that can be done on an RDD - 1) Transformations like, map, filter than results in another RDD. 2) Actions like count that result in an output. A spark job comprises of a DAG of tasks executing transformations and actions on RDDs.
Better graphic.. Ggf die von chris nehmen und abaendern
Cluster Alrgorithms…
Chris Slides
Wikipedia! Gray sqares!
Todo: proper graphic
Todo: add reference
Todo: redesign graphic into MongoDB Version
No black box, Logic and hook
K means explained, more complex theme also expained
Insert grpa
Don‘t buy in too early. Solving real problems,
Choose the right tool.
RDD and / or Clustering Jobs are “natural”
Staying operational and low latency focused