SlideShare una empresa de Scribd logo
1 de 56
© 2014 MapR Technologies 1
© MapR Technologies, confidential
Hadoop Summit 2014
Which Algorithms Really Matter?
© 2014 MapR Technologies 2
Me, Us
• Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
• MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
• Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
© 2014 MapR Technologies 4
Topic For Today
• What is important? What is not?
• Why?
• What is the difference from academic research?
• Some examples
© 2014 MapR Technologies 5
What is Important?
• Deployable
• Robust
• Transparent
• Skillset and mindset matched?
• Proportionate
© 2014 MapR Technologies 6
What is Important?
• Deployable
– Clever prototypes don’t count if they can’t be standardized
• Robust
• Transparent
• Skillset and mindset matched?
• Proportionate
© 2014 MapR Technologies 7
What is Important?
• Deployable
– Clever prototypes don’t count
• Robust
– Mishandling is common
• Transparent
– Will degradation be obvious?
• Skillset and mindset matched?
• Proportionate
© 2014 MapR Technologies 8
What is Important?
• Deployable
– Clever prototypes don’t count
• Robust
– Mishandling is common
• Transparent
– Will degradation be obvious?
• Skillset and mindset matched?
– How long will your fancy data scientist enjoy doing standard ops tasks?
• Proportionate
– Where is the highest value per minute of effort?
© 2014 MapR Technologies 9
Academic Goals vs Pragmatics
• Academic goals
– Reproducible
– Isolate theoretically important aspects
– Work on novel problems
• Pragmatics
– Highest net value
– Available data is constantly changing
– Diligence and consistency have larger impact than cleverness
– Many systems feed themselves, exploration and exploitation are both
important
– Engineering constraints on budget and schedule
© 2014 MapR Technologies 10
Example 1:
Making Recommendations Better
© 2014 MapR Technologies 11
Recommendation Advances
• What are the most important algorithmic advances in
recommendations over the last 10 years?
• Cooccurrence analysis?
• Matrix completion via factorization?
• Latent factor log-linear models?
• Temporal dynamics?
© 2014 MapR Technologies 12
The Winner – None of the Above
• What are the most important algorithmic advances in
recommendations over the last 10 years?
1. Result dithering
2. Anti-flood
© 2014 MapR Technologies 13
The Real Issues
• Exploration
• Diversity
• Speed
• Not the last fraction of a percent
© 2014 MapR Technologies 14
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual
performance much better
© 2014 MapR Technologies 15
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual
performance much better
“Made more difference than any other change”
© 2014 MapR Technologies 16
Simple Dithering Algorithm
• Generate synthetic score from log rank plus Gaussian
• Pick noise scale to provide desired level of mixing
• Typically
• Oh… use floor(t/T) as seed
s = logr + N(0,e)
e Î 0.4, 0.8[ ]
Dr µrexpe
© 2014 MapR Technologies 17
Example … ε = 0.5
1 2 6 5 3 4 13 16
1 2 3 8 5 7 6 34
1 4 3 2 6 7 11 10
1 2 4 3 15 7 13 19
1 6 2 3 4 16 9 5
1 2 3 5 24 7 17 13
1 2 3 4 6 12 5 14
2 1 3 5 7 6 4 17
4 1 2 7 3 9 8 5
2 1 5 3 4 7 13 6
3 1 5 4 2 7 8 6
2 1 3 4 7 12 17 16
© 2014 MapR Technologies 18
Example … ε = log 2 = 0.69
1 2 8 3 9 15 7 6
1 8 14 15 3 2 22 10
1 3 8 2 10 5 7 4
1 2 10 7 3 8 6 14
1 5 33 15 2 9 11 29
1 2 7 3 5 4 19 6
1 3 5 23 9 7 4 2
2 4 11 8 3 1 44 9
2 3 1 4 6 7 8 33
3 4 1 2 10 11 15 14
11 1 2 4 5 7 3 14
1 8 7 3 22 11 2 33
© 2014 MapR Technologies 19
Exploring The Second Page
© 2014 MapR Technologies 20
Lesson 1:
Exploration is good
© 2014 MapR Technologies 21
Example 2:
Bayesian Bandits
© 2014 MapR Technologies 22
Bayesian Bandits
• Based on Thompson sampling
• Very general sequential test
• Near optimal regret
• Trade-off exploration and exploitation
• Possibly best known solution for exploration/exploitation
• Incredibly simple
© 2014 MapR Technologies 23
Thompson Sampling
• Select each shell according to the probability that it is the best
• Probability that it is the best can be computed using posterior
• But I promised a simple answer
P(i is best) = I E[ri |q]= max
j
E[rj |q]
é
ëê
ù
ûúò P(q | D) dq
© 2014 MapR Technologies 24
Thompson Sampling – Take 2
• Sample θ
• Pick i to maximize reward
• Record result from using i
q ~P(q | D)
i = argmax
j
E[rj |q]
© 2014 MapR Technologies 25
Fast Convergence
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
© 2014 MapR Technologies 26
Thompson Sampling on Ads
An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
© 2014 MapR Technologies 27
Bayesian Bandits versus Result Dithering
• Many useful systems are difficult to frame in fully Bayesian form
• Thompson sampling cannot be applied without posterior
sampling
• Can still do useful exploration with dithering
• But better to use Thompson sampling if possible
© 2014 MapR Technologies 28
Lesson 2:
Exploration is pretty easy to
do and pays big benefits.
© 2014 MapR Technologies 29
Example 3:
On-line Clustering
© 2014 MapR Technologies 30
The Problem
• K-means clustering is useful for feature extraction or
compression
• At scale and at high dimension, the desirable number of clusters
increases
• Very large number of clusters may require more passes through
the data
• Super-linear scaling is generally infeasible
© 2014 MapR Technologies 31
The Solution
• Sketch-based algorithms produce a sketch of the data
• Streaming k-means uses adaptive dp-means to produce this
sketch in the form of many weighted centroids which
approximate the original distribution
• The size of the sketch grows very slowly with increasing data
size
• Many operations such as clustering are well behaved on
sketches
Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.
Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.
© 2014 MapR Technologies 32
An Example
© 2014 MapR Technologies 33
An Example
© 2014 MapR Technologies 34
The Cluster Proximity Features
• Every point can be described by the nearest cluster
– 4.3 bits per point in this case
– Significant error that can be decreased (to a point) by increasing
number of clusters
• Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)
– Error is negligible
– Unwinds the data into a simple representation
• Or we can increase the number of clusters (n fold increase adds
log n bits per point, decreases error by sqrt(n)
© 2014 MapR Technologies 35
Diagonalized Cluster Proximity
© 2014 MapR Technologies 36
Lots of Clusters Are Fine
© 2014 MapR Technologies 37
Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these
two clusters get glued
together
© 2014 MapR Technologies 38
Streaming k-means Ideas
• By using a sketch with lots (k log N) of centroids, we avoid
pathological cases
• We still get a very good result if the sketch is created
– in one pass
– with approximate search
• In fact, adaptive dp-means works just fine
• In the end, the sketch can be used for clustering or …
© 2014 MapR Technologies 39
Lesson 3:
Sketches make big data small.
© 2014 MapR Technologies 40
Example 4:
Search Abuse
© 2014 MapR Technologies 41
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple
© 2014 MapR Technologies 42
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple. What else would Bob like?
© 2014 MapR Technologies 43
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 44
History Matrix: Users x Items
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
© 2014 MapR Technologies 45
Co-Occurrence Matrix: Items x Items
-
1 2
1 1
1
1
2 1
0
0
0 0
Use LLR test to turn co-
occurrence into indicators of
interesting co-occurrence
© 2014 MapR Technologies 46
Indicator Matrix: Anomalous Co-Occurrence
✔
✔
© 2014 MapR Technologies 47
Co-occurrence Binary Matrix
1
1not
not
1
© 2014 MapR Technologies 48
Indicator Matrix: Anomalous Co-Occurrence
✔
✔
Result: The marked row will be added to the indicator field in the
item document…
© 2014 MapR Technologies 49
Indicator Matrix
✔
id: t4
title: puppy
desc: The sweetest little puppy ever.
keywords: puppy, dog, pet
indicators: (t1)
That one row from indicator matrix becomes the indicator field in the
Solr document used to deploy the recommendation engine.
Note: data for the
indicator field is added
directly to meta-data
for a document in Solr
index. You don’t need
to create a separate
index for the
indicators.
© 2014 MapR Technologies 50
Internals of the Recommender Engine
50
© 2014 MapR Technologies 51
Internals of the Recommender Engine
51
© 2014 MapR Technologies 52
Looking Inside LucidWorks
What to recommend if new user listened to 2122: Fats Domino & 303: Beatles?
Recommendation is “1710 : Chuck Berry”
52
Real-time recommendation query and results: Evaluation
© 2014 MapR Technologies 53
Real-life example
© 2014 MapR Technologies 54
Lesson 4:
Recursive search abuse pays
Search can implement recs
Which can implement search
© 2014 MapR Technologies 55
Summary
© 2014 MapR Technologies 56
© 2014 MapR Technologies 57
Me, Us
• Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
• MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
• Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR

Más contenido relacionado

La actualidad más candente

Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendationsTed Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
On the Effect of Geometries Simplification on Geo-spatial Link Discovery
On the Effect of Geometries Simplification on Geo-spatial Link DiscoveryOn the Effect of Geometries Simplification on Geo-spatial Link Discovery
On the Effect of Geometries Simplification on Geo-spatial Link DiscoveryAbdullah Ahmed
 
Development Infographic
Development InfographicDevelopment Infographic
Development InfographicRealMassive
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with ChaosMapR Technologies
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With ChaosDataWorks Summit
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...KAIST
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...KAIST
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...Edge AI and Vision Alliance
 
2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...
2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...
2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...GIS in the Rockies
 

La actualidad más candente (18)

Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
On the Effect of Geometries Simplification on Geo-spatial Link Discovery
On the Effect of Geometries Simplification on Geo-spatial Link DiscoveryOn the Effect of Geometries Simplification on Geo-spatial Link Discovery
On the Effect of Geometries Simplification on Geo-spatial Link Discovery
 
Development Infographic
Development InfographicDevelopment Infographic
Development Infographic
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
7-DIG_FINAL_paper
7-DIG_FINAL_paper7-DIG_FINAL_paper
7-DIG_FINAL_paper
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
 
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...
2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...
2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...
 

Similar a How to Determine which Algorithms Really Matter

Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies
 
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouMatt Stubbs
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentMapR Technologies
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 

Similar a How to Determine which Algorithms Really Matter (20)

Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

How to Determine which Algorithms Really Matter

  • 1. © 2014 MapR Technologies 1 © MapR Technologies, confidential Hadoop Summit 2014 Which Algorithms Really Matter?
  • 2. © 2014 MapR Technologies 2 Me, Us • Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG • MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  • 3. © 2014 MapR Technologies 4 Topic For Today • What is important? What is not? • Why? • What is the difference from academic research? • Some examples
  • 4. © 2014 MapR Technologies 5 What is Important? • Deployable • Robust • Transparent • Skillset and mindset matched? • Proportionate
  • 5. © 2014 MapR Technologies 6 What is Important? • Deployable – Clever prototypes don’t count if they can’t be standardized • Robust • Transparent • Skillset and mindset matched? • Proportionate
  • 6. © 2014 MapR Technologies 7 What is Important? • Deployable – Clever prototypes don’t count • Robust – Mishandling is common • Transparent – Will degradation be obvious? • Skillset and mindset matched? • Proportionate
  • 7. © 2014 MapR Technologies 8 What is Important? • Deployable – Clever prototypes don’t count • Robust – Mishandling is common • Transparent – Will degradation be obvious? • Skillset and mindset matched? – How long will your fancy data scientist enjoy doing standard ops tasks? • Proportionate – Where is the highest value per minute of effort?
  • 8. © 2014 MapR Technologies 9 Academic Goals vs Pragmatics • Academic goals – Reproducible – Isolate theoretically important aspects – Work on novel problems • Pragmatics – Highest net value – Available data is constantly changing – Diligence and consistency have larger impact than cleverness – Many systems feed themselves, exploration and exploitation are both important – Engineering constraints on budget and schedule
  • 9. © 2014 MapR Technologies 10 Example 1: Making Recommendations Better
  • 10. © 2014 MapR Technologies 11 Recommendation Advances • What are the most important algorithmic advances in recommendations over the last 10 years? • Cooccurrence analysis? • Matrix completion via factorization? • Latent factor log-linear models? • Temporal dynamics?
  • 11. © 2014 MapR Technologies 12 The Winner – None of the Above • What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering 2. Anti-flood
  • 12. © 2014 MapR Technologies 13 The Real Issues • Exploration • Diversity • Speed • Not the last fraction of a percent
  • 13. © 2014 MapR Technologies 14 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better
  • 14. © 2014 MapR Technologies 15 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change”
  • 15. © 2014 MapR Technologies 16 Simple Dithering Algorithm • Generate synthetic score from log rank plus Gaussian • Pick noise scale to provide desired level of mixing • Typically • Oh… use floor(t/T) as seed s = logr + N(0,e) e Î 0.4, 0.8[ ] Dr µrexpe
  • 16. © 2014 MapR Technologies 17 Example … ε = 0.5 1 2 6 5 3 4 13 16 1 2 3 8 5 7 6 34 1 4 3 2 6 7 11 10 1 2 4 3 15 7 13 19 1 6 2 3 4 16 9 5 1 2 3 5 24 7 17 13 1 2 3 4 6 12 5 14 2 1 3 5 7 6 4 17 4 1 2 7 3 9 8 5 2 1 5 3 4 7 13 6 3 1 5 4 2 7 8 6 2 1 3 4 7 12 17 16
  • 17. © 2014 MapR Technologies 18 Example … ε = log 2 = 0.69 1 2 8 3 9 15 7 6 1 8 14 15 3 2 22 10 1 3 8 2 10 5 7 4 1 2 10 7 3 8 6 14 1 5 33 15 2 9 11 29 1 2 7 3 5 4 19 6 1 3 5 23 9 7 4 2 2 4 11 8 3 1 44 9 2 3 1 4 6 7 8 33 3 4 1 2 10 11 15 14 11 1 2 4 5 7 3 14 1 8 7 3 22 11 2 33
  • 18. © 2014 MapR Technologies 19 Exploring The Second Page
  • 19. © 2014 MapR Technologies 20 Lesson 1: Exploration is good
  • 20. © 2014 MapR Technologies 21 Example 2: Bayesian Bandits
  • 21. © 2014 MapR Technologies 22 Bayesian Bandits • Based on Thompson sampling • Very general sequential test • Near optimal regret • Trade-off exploration and exploitation • Possibly best known solution for exploration/exploitation • Incredibly simple
  • 22. © 2014 MapR Technologies 23 Thompson Sampling • Select each shell according to the probability that it is the best • Probability that it is the best can be computed using posterior • But I promised a simple answer P(i is best) = I E[ri |q]= max j E[rj |q] é ëê ù ûúò P(q | D) dq
  • 23. © 2014 MapR Technologies 24 Thompson Sampling – Take 2 • Sample θ • Pick i to maximize reward • Record result from using i q ~P(q | D) i = argmax j E[rj |q]
  • 24. © 2014 MapR Technologies 25 Fast Convergence 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 25. © 2014 MapR Technologies 26 Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
  • 26. © 2014 MapR Technologies 27 Bayesian Bandits versus Result Dithering • Many useful systems are difficult to frame in fully Bayesian form • Thompson sampling cannot be applied without posterior sampling • Can still do useful exploration with dithering • But better to use Thompson sampling if possible
  • 27. © 2014 MapR Technologies 28 Lesson 2: Exploration is pretty easy to do and pays big benefits.
  • 28. © 2014 MapR Technologies 29 Example 3: On-line Clustering
  • 29. © 2014 MapR Technologies 30 The Problem • K-means clustering is useful for feature extraction or compression • At scale and at high dimension, the desirable number of clusters increases • Very large number of clusters may require more passes through the data • Super-linear scaling is generally infeasible
  • 30. © 2014 MapR Technologies 31 The Solution • Sketch-based algorithms produce a sketch of the data • Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution • The size of the sketch grows very slowly with increasing data size • Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.
  • 31. © 2014 MapR Technologies 32 An Example
  • 32. © 2014 MapR Technologies 33 An Example
  • 33. © 2014 MapR Technologies 34 The Cluster Proximity Features • Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters • Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation • Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n)
  • 34. © 2014 MapR Technologies 35 Diagonalized Cluster Proximity
  • 35. © 2014 MapR Technologies 36 Lots of Clusters Are Fine
  • 36. © 2014 MapR Technologies 37 Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 37. © 2014 MapR Technologies 38 Streaming k-means Ideas • By using a sketch with lots (k log N) of centroids, we avoid pathological cases • We still get a very good result if the sketch is created – in one pass – with approximate search • In fact, adaptive dp-means works just fine • In the end, the sketch can be used for clustering or …
  • 38. © 2014 MapR Technologies 39 Lesson 3: Sketches make big data small.
  • 39. © 2014 MapR Technologies 40 Example 4: Search Abuse
  • 40. © 2014 MapR Technologies 41 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  • 41. © 2014 MapR Technologies 42 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple. What else would Bob like?
  • 42. © 2014 MapR Technologies 43 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  • 43. © 2014 MapR Technologies 44 History Matrix: Users x Items Alice Bob Charles ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 44. © 2014 MapR Technologies 45 Co-Occurrence Matrix: Items x Items - 1 2 1 1 1 1 2 1 0 0 0 0 Use LLR test to turn co- occurrence into indicators of interesting co-occurrence
  • 45. © 2014 MapR Technologies 46 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔
  • 46. © 2014 MapR Technologies 47 Co-occurrence Binary Matrix 1 1not not 1
  • 47. © 2014 MapR Technologies 48 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔ Result: The marked row will be added to the indicator field in the item document…
  • 48. © 2014 MapR Technologies 49 Indicator Matrix ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.
  • 49. © 2014 MapR Technologies 50 Internals of the Recommender Engine 50
  • 50. © 2014 MapR Technologies 51 Internals of the Recommender Engine 51
  • 51. © 2014 MapR Technologies 52 Looking Inside LucidWorks What to recommend if new user listened to 2122: Fats Domino & 303: Beatles? Recommendation is “1710 : Chuck Berry” 52 Real-time recommendation query and results: Evaluation
  • 52. © 2014 MapR Technologies 53 Real-life example
  • 53. © 2014 MapR Technologies 54 Lesson 4: Recursive search abuse pays Search can implement recs Which can implement search
  • 54. © 2014 MapR Technologies 55 Summary
  • 55. © 2014 MapR Technologies 56
  • 56. © 2014 MapR Technologies 57 Me, Us • Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG • MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR

Notas del editor

  1. TED: consider using the word “interesting” instead of “anomalous”… people may think you are talking about anomaly detection…
  2. Old joke: all the world can be divided into 2 categories: Scotch tape and non-Scotch tape… This is a way to think about the co-occurrence
  3. Only important co-occurrence is puppy follows apple
  4. *Take that row of matrix and combine with all the meta data we might have… *Important thing to get from the co-occurrence matrix is this indicator.. Cool thing: analogous to what a lot of recommendation engines do *This row forms the indicator field in a Solr document containing meta-data (you do NOT have to build a separate index for the indicators) Find the useful co-occurrence and get rid of the rest. Sparsify and get the anomalous co-occurrence
  5. Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
  6. *This indicator field is where the output of the Mahout recommendation engine are stored (the row from the indicator matrix that identified significant or interesting co-occurrence. *Keep in mind that this recommendation indicator data is added to the same original document in the Solr index that contains meta data for the item in question
  7. This is a diagnostics window in the LucidWorks Solr index (not the web interface a user would see). It’s a way for the developer to do a rough evaluation (laugh test) of the choices offered by the recommendation engine. In other words, do these indicator artists represented by their indicator Id make reasonable recommendations Note to trainer: artist 303 happens to be The Beatles. Is that a good match for Chuck Berry?