SlideShare una empresa de Scribd logo
1 de 52
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
What’s New in Apache Mahout:
A Preview of Mahout 1.0
21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM
Ted Dunning, Chief Applications Architect MapR Technologies
Twitter @Ted_Dunning
Email tdunning@mapr.com tdunning@apache.org
© 2014 MapR Technologies 3
There was just an explosion
in Apache Mahout…
© 2014 MapR Technologies 4
Apache Mahout up to now…
• Open source Apache project http://mahout.apache.org/
• Mahout version is 0.9 released Feb 2014; included Scala
– Summary 0.9 blog at http://bit.ly/1rirUUL
• Library of scalable algorithms for machine learning
– Some run on Apache Hadoop distributions; others do not require Hadoop
– Some can be run at small scale
– Some are run in parallel; others are sequential
• Includes the following main areas:
– Clustering & related techniques
– Classification
– Recommendation
– Mahout Math Library
© 2014 MapR Technologies 5
Roadmap to Mahout 1.0
• Say good-bye to MapReduce
– New MR algorithms will not be accepted
– Support for existing ones will continue for now
• Support for Apache Spark
– Under construction; some features already available
• Support for h2o being explored
• Support for Apache Stratosphere possibly in future
© 2014 MapR Technologies 6
Roadmap: Apache Mahout 1.0
© 2014 MapR Technologies 7
Apache Spark
• Apache Spark http://spark.apache.org/
– Open source “fast and general engine for large scale data processing”
– Especially fast in-memory
– Made top level open Apache project
• Feb 2014
• http://spark.apache.org/
• over 100 committers
– Original developers have started company called Databricks (Berkeley CA)
http://databricks.com/
© 2014 MapR Technologies 8
Mahout and Scala
• Scala http://www.scala-lang.org/
– Open source; appeared in 2003
– Wiki describes as “object-functional programming and scripting
language”
• Scala provides functional style
– Makes lazy evaluation much safer
– Notationally compact
– Minor syntax extensions allowed
– Makes math much easier
© 2014 MapR Technologies 9
Here’s what DSL & Spark will mean for Mahout
• Scala DSL provides convenient notation for expressing parallel
machine learning
• Spark (and other engines) provide execution environment
• Overview of Scala and Apache Spark bindings in Mahout can be
found at
https://mahout.apache.org/users/sparkbindings/home.html
© 2014 MapR Technologies 10
What do clusters, Cap’n Crunch
and Coco Puffs have in common?
© 2014 MapR Technologies 11
They’re part of the data in the
new Mahout Spark shell tutorial…
© 2014 MapR Technologies 12
And you shouldn’t be eating them.
© 2014 MapR Technologies 13
Tutorial: Mahout- Spark Shell
• Find it here http://bit.ly/RSTeMr
• Early stage code - play with Mahout Scala’s DSL for linear
algebra and Mahout-Spark shell
– Uses publicly available breakfast cereal data set
– Challenge: Fit linear model that infers customer ratings from ingredients
– Toy data set but load with Mahout to mimic a huge data set
• Mahout's linear algebra DSL has an abstraction called
DistributedRowMatrix (DRM)
– models a matrix that is partitioned by rows and stored in the memory of
a cluster of machines
© 2014 MapR Technologies 14
Dissecting the Model
• Components
– Cereal ingredients are the features
– Ratings are the target variables
• Linear regression assumes that target variable y is generated by
linear combination of feature matrix X with parameter vector β
plus the noise ε
y = Xβ + ε
• Goal: Find estimate of parameter vector β that explains data
© 2014 MapR Technologies 15
What do you see in this matrix?
val drmData = drmParallelize(dense(
(2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios
(1, 2, 12, 12, 18.042851), // Cap'n'Crunch
(1, 1, 12, 13, 22.736446), // Cocoa Puffs
(2, 1, 11, 13, 32.207582), // Froot Loops
(1, 2, 12, 11, 21.871292), // Honey Graham Ohs
(2, 1, 16, 8, 36.187559), // Wheaties Honey Gold
(6, 2, 17, 1, 50.764999), // Cheerios
(3, 2, 13, 7, 40.400208), // Clusters
(3, 3, 13, 4, 45.811716)), // Great Grains Pecan
numPartitions = 2);
© 2014 MapR Technologies 16
Add Bias Column
val drmX1 = drmX.mapBlock(ncol = drmX.ncol + 1) {
case(keys, block) =>
// create a new block with an additional column
val blockWithBiasColumn =
block.like(block.nrow, block.ncol + 1)
// copy data from current block into the new block
blockWithBiasColumn(::, 0 until block.ncol) := block
// last column consists of ones
blockWithBiasColumn(::, block.ncol) := 1
keys -> blockWithBiasColumn
}
© 2014 MapR Technologies 17
Solve Linear System, Compute Error
val XtX = (drmX1.t %*% drmX1).collect
val Xty = (drmX1.t %*% y).collect(::, 0)
beta = solve(XtX, Xty)
val fittedY = (drmX1 %*% beta).collect(::, 0)
error = (y - fittedY).norm(2)
© 2014 MapR Technologies 18
In R
all = matrix(
c(2, 2, 10.5, 10, 29.509541,
1, 2, 12, 12, 18.042851,
1, 1, 12, 13, 22.736446,
2, 1, 11, 13, 32.207582,
1, 2, 12, 11, 21.871292,
2, 1, 16, 8, 36.187559,
6, 2, 17, 1, 50.764999,
3, 2, 13, 7, 40.400208,
3, 3, 13, 4, 45.811716), byrow=T, ncol=5)
© 2014 MapR Technologies 19
More R
a1 = cbind(a, 1)
ata = t(a1) %*% a1
aty = t(a1) %*% y
x1 = solve(a=ata, b=aty)
© 2014 MapR Technologies 20
Well, Actually
all = data.frame(all)
m = lm(X5 ~ X1 + X2 + X3 + X4, df)
plot(df$X5, predict(m))
abline(lm(y ~ x,
data.frame(x=df$X5, y=predict(m))), col='red’)
© 2014 MapR Technologies 21
R Wins
© 2014 MapR Technologies 22
R Wins … For Now
© 2014 MapR Technologies 23
R Wins … For Now … at Small Scale
© 2014 MapR Technologies 24
Recommendation
Behavior of a crowd
helps us understand
what individuals will do
© 2014 MapR Technologies 25
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple
© 2014 MapR Technologies 26
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple. What else would Bob like?
© 2014 MapR Technologies 27
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 28
You get the idea of how
recommenders work…
© 2014 MapR Technologies 29
By the way, like me, Bob also
wants a pony…
© 2014 MapR Technologies 30
Recommendation
?
Alice
Bob
Charles
Amelia
What if everybody gets a
pony?
What else would you recommend
for new user Amelia?
© 2014 MapR Technologies 31
Recommendation
?
Alice
Bob
Charles
Amelia
If everybody gets a pony, it’s not a
very good indicator of what to else
predict...
What we want is anomalous co-occurrence
© 2014 MapR Technologies 32
Get Useful Indicators from Behaviors
• Use log files to build history matrix of users x items
– Remember: this history of interactions will be sparse compared to all
potential combinations
• Transform to a co-occurrence matrix of items x items
• Look for useful co-occurrence by looking for anomalous co-
occurrences to make an indicator matrix
– Log Likelihood Ratio (LLR) can be helpful to judge which co-
occurrences can with confidence be used as indicators of preference
– ItemSimilarityJob in Apache Mahout uses LLR
• (pony book said RowSimilarityJob,not as good )
© 2014 MapR Technologies 33
Model uses three matrices…
© 2014 MapR Technologies 34
History Matrix: Users x Items
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
© 2014 MapR Technologies 35
Co-Occurrence Matrix: Items x Items
-
1 2
1 1
1
1
2 1
0
0
0 0
Use LLR test to turn co-
occurrence into indicators of
interesting co-occurrence
© 2014 MapR Technologies 36
Indicator Matrix: Anomalous Co-Occurrence
✔
✔
© 2014 MapR Technologies 37
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
© 2014 MapR Technologies 38
Collection of Documents: Insert Meta-Data
Search
Technology
Item
meta-data
Document for
“puppy” id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
Ingest easily via NFS
© 2014 MapR Technologies 39
A Quick Simplification
• Users who do h
• Also do
Ah
User-centric recommendations
Item-centric recommendations
AT
(Ah)
(AT
A)h
© 2014 MapR Technologies 40
val drmA = sampleDownAndBinarize(
drmARaw, randomSeed, maxNumInteractions).checkpoint()
val numUsers = drmA.nrow.toInt
// Compute number of interactions per thing in A
val csums = drmBroadcast(drmA.colSums)
// Compute co-occurrence matrix A'A
val drmAtA = drmA.t %*% drmA
© 2014 MapR Technologies 41
What’s New in Apache Mahout:
A Preview of Mahout 1.0
21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM
Ted Dunning, Chief Applications Architect MapR Technologies
Twitter @Ted_Dunning
Email tdunning@mapr.com tdunning@apache.org
© 2014 MapR Technologies 42
© 2014 MapR Technologies 43
Sandbox
© 2014 MapR Technologies 44
Going Further: Multi-Modal Recommendation
© 2014 MapR Technologies 45
Going Further: Multi-Modal Recommendation
© 2014 MapR Technologies 46
Better Long-Term Recommendations
• Anti-flood
Avoid having too much of a good thing
• Dithering
“When making it worse makes it better”
© 2014 MapR Technologies 47
Why Use Dithering?
© 2014 MapR Technologies 48
What’s New in Apache Mahout?
A Preview of Mahout 1.0
21 May 2014 #BDBDM
Ted Dunning, Chief Applications Architect MapR Technologies
Twitter @Ted_Dunning
Email tdunning@mapr.com tdunning@apache.org
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
© 2014 MapR Technologies 49
Sample Music Log Files
13 START 10113 2182654281
23 BEACON 10113 2182654281
24 START 10113 79600611935028
34 BEACON 10113 79600611935028
44 BEACON 10113 79600611935028
54 BEACON 10113 79600611935028
64 BEACON 10113 79600611935028
74 BEACON 10113 79600611935028
84 BEACON 10113 79600611935028
94 BEACON 10113 79600611935028
104 BEACON 10113 79600611935028
109 FINISH10113 79600611935028
111 START 10113 58999912011972
121 BEACON 10113 58999912011972
Time
Event type
User ID
Artist ID
Track ID
© 2014 MapR Technologies 50
id 1710
mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499
name Chuck Berry
area United States
gender Male
indicator_artists 386685,875994,637954,3418,1344,789739,1460, …
id 541902
mbid 983d4f8f-473e-4091-8394-415c105c4656
name Charlie Winston
area United Kingdom
gender None
indicator_artists 997727,815,830794,59588,900,2591,1344,696268, …
Documents for Music Recommendation
© 2014 MapR Technologies 51
Practical Machine Learning:
Innovations in Recommendation
28 April 2014 NoSQL Matters Conference #NoSQLMatters
Ted Dunning, Chief Applications Architect MapR Technologies
Twitter @Ted_Dunning
Email tdunning@mapr.com tdunning@apache.org
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
© 2014 MapR Technologies 52

Más contenido relacionado

La actualidad más candente

Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendationsTed Dunning
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and RecommendationsTed Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 

La actualidad más candente (20)

Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 

Similar a Mahout 1.0 Preview: Goodbye MapReduce, Hello Spark

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 

Similar a Mahout 1.0 Preview: Goodbye MapReduce, Hello Spark (20)

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 

Más de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 

Más de Ted Dunning (10)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Último

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 

Último (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 

Mahout 1.0 Preview: Goodbye MapReduce, Hello Spark

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 What’s New in Apache Mahout: A Preview of Mahout 1.0 21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org
  • 3. © 2014 MapR Technologies 3 There was just an explosion in Apache Mahout…
  • 4. © 2014 MapR Technologies 4 Apache Mahout up to now… • Open source Apache project http://mahout.apache.org/ • Mahout version is 0.9 released Feb 2014; included Scala – Summary 0.9 blog at http://bit.ly/1rirUUL • Library of scalable algorithms for machine learning – Some run on Apache Hadoop distributions; others do not require Hadoop – Some can be run at small scale – Some are run in parallel; others are sequential • Includes the following main areas: – Clustering & related techniques – Classification – Recommendation – Mahout Math Library
  • 5. © 2014 MapR Technologies 5 Roadmap to Mahout 1.0 • Say good-bye to MapReduce – New MR algorithms will not be accepted – Support for existing ones will continue for now • Support for Apache Spark – Under construction; some features already available • Support for h2o being explored • Support for Apache Stratosphere possibly in future
  • 6. © 2014 MapR Technologies 6 Roadmap: Apache Mahout 1.0
  • 7. © 2014 MapR Technologies 7 Apache Spark • Apache Spark http://spark.apache.org/ – Open source “fast and general engine for large scale data processing” – Especially fast in-memory – Made top level open Apache project • Feb 2014 • http://spark.apache.org/ • over 100 committers – Original developers have started company called Databricks (Berkeley CA) http://databricks.com/
  • 8. © 2014 MapR Technologies 8 Mahout and Scala • Scala http://www.scala-lang.org/ – Open source; appeared in 2003 – Wiki describes as “object-functional programming and scripting language” • Scala provides functional style – Makes lazy evaluation much safer – Notationally compact – Minor syntax extensions allowed – Makes math much easier
  • 9. © 2014 MapR Technologies 9 Here’s what DSL & Spark will mean for Mahout • Scala DSL provides convenient notation for expressing parallel machine learning • Spark (and other engines) provide execution environment • Overview of Scala and Apache Spark bindings in Mahout can be found at https://mahout.apache.org/users/sparkbindings/home.html
  • 10. © 2014 MapR Technologies 10 What do clusters, Cap’n Crunch and Coco Puffs have in common?
  • 11. © 2014 MapR Technologies 11 They’re part of the data in the new Mahout Spark shell tutorial…
  • 12. © 2014 MapR Technologies 12 And you shouldn’t be eating them.
  • 13. © 2014 MapR Technologies 13 Tutorial: Mahout- Spark Shell • Find it here http://bit.ly/RSTeMr • Early stage code - play with Mahout Scala’s DSL for linear algebra and Mahout-Spark shell – Uses publicly available breakfast cereal data set – Challenge: Fit linear model that infers customer ratings from ingredients – Toy data set but load with Mahout to mimic a huge data set • Mahout's linear algebra DSL has an abstraction called DistributedRowMatrix (DRM) – models a matrix that is partitioned by rows and stored in the memory of a cluster of machines
  • 14. © 2014 MapR Technologies 14 Dissecting the Model • Components – Cereal ingredients are the features – Ratings are the target variables • Linear regression assumes that target variable y is generated by linear combination of feature matrix X with parameter vector β plus the noise ε y = Xβ + ε • Goal: Find estimate of parameter vector β that explains data
  • 15. © 2014 MapR Technologies 15 What do you see in this matrix? val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2);
  • 16. © 2014 MapR Technologies 16 Add Bias Column val drmX1 = drmX.mapBlock(ncol = drmX.ncol + 1) { case(keys, block) => // create a new block with an additional column val blockWithBiasColumn = block.like(block.nrow, block.ncol + 1) // copy data from current block into the new block blockWithBiasColumn(::, 0 until block.ncol) := block // last column consists of ones blockWithBiasColumn(::, block.ncol) := 1 keys -> blockWithBiasColumn }
  • 17. © 2014 MapR Technologies 17 Solve Linear System, Compute Error val XtX = (drmX1.t %*% drmX1).collect val Xty = (drmX1.t %*% y).collect(::, 0) beta = solve(XtX, Xty) val fittedY = (drmX1 %*% beta).collect(::, 0) error = (y - fittedY).norm(2)
  • 18. © 2014 MapR Technologies 18 In R all = matrix( c(2, 2, 10.5, 10, 29.509541, 1, 2, 12, 12, 18.042851, 1, 1, 12, 13, 22.736446, 2, 1, 11, 13, 32.207582, 1, 2, 12, 11, 21.871292, 2, 1, 16, 8, 36.187559, 6, 2, 17, 1, 50.764999, 3, 2, 13, 7, 40.400208, 3, 3, 13, 4, 45.811716), byrow=T, ncol=5)
  • 19. © 2014 MapR Technologies 19 More R a1 = cbind(a, 1) ata = t(a1) %*% a1 aty = t(a1) %*% y x1 = solve(a=ata, b=aty)
  • 20. © 2014 MapR Technologies 20 Well, Actually all = data.frame(all) m = lm(X5 ~ X1 + X2 + X3 + X4, df) plot(df$X5, predict(m)) abline(lm(y ~ x, data.frame(x=df$X5, y=predict(m))), col='red’)
  • 21. © 2014 MapR Technologies 21 R Wins
  • 22. © 2014 MapR Technologies 22 R Wins … For Now
  • 23. © 2014 MapR Technologies 23 R Wins … For Now … at Small Scale
  • 24. © 2014 MapR Technologies 24 Recommendation Behavior of a crowd helps us understand what individuals will do
  • 25. © 2014 MapR Technologies 25 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  • 26. © 2014 MapR Technologies 26 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple. What else would Bob like?
  • 27. © 2014 MapR Technologies 27 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  • 28. © 2014 MapR Technologies 28 You get the idea of how recommenders work…
  • 29. © 2014 MapR Technologies 29 By the way, like me, Bob also wants a pony…
  • 30. © 2014 MapR Technologies 30 Recommendation ? Alice Bob Charles Amelia What if everybody gets a pony? What else would you recommend for new user Amelia?
  • 31. © 2014 MapR Technologies 31 Recommendation ? Alice Bob Charles Amelia If everybody gets a pony, it’s not a very good indicator of what to else predict... What we want is anomalous co-occurrence
  • 32. © 2014 MapR Technologies 32 Get Useful Indicators from Behaviors • Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations • Transform to a co-occurrence matrix of items x items • Look for useful co-occurrence by looking for anomalous co- occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co- occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR • (pony book said RowSimilarityJob,not as good )
  • 33. © 2014 MapR Technologies 33 Model uses three matrices…
  • 34. © 2014 MapR Technologies 34 History Matrix: Users x Items Alice Bob Charles ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 35. © 2014 MapR Technologies 35 Co-Occurrence Matrix: Items x Items - 1 2 1 1 1 1 2 1 0 0 0 0 Use LLR test to turn co- occurrence into indicators of interesting co-occurrence
  • 36. © 2014 MapR Technologies 36 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔
  • 37. © 2014 MapR Technologies 37 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3
  • 38. © 2014 MapR Technologies 38 Collection of Documents: Insert Meta-Data Search Technology Item meta-data Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet Ingest easily via NFS
  • 39. © 2014 MapR Technologies 39 A Quick Simplification • Users who do h • Also do Ah User-centric recommendations Item-centric recommendations AT (Ah) (AT A)h
  • 40. © 2014 MapR Technologies 40 val drmA = sampleDownAndBinarize( drmARaw, randomSeed, maxNumInteractions).checkpoint() val numUsers = drmA.nrow.toInt // Compute number of interactions per thing in A val csums = drmBroadcast(drmA.colSums) // Compute co-occurrence matrix A'A val drmAtA = drmA.t %*% drmA
  • 41. © 2014 MapR Technologies 41 What’s New in Apache Mahout: A Preview of Mahout 1.0 21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org
  • 42. © 2014 MapR Technologies 42
  • 43. © 2014 MapR Technologies 43 Sandbox
  • 44. © 2014 MapR Technologies 44 Going Further: Multi-Modal Recommendation
  • 45. © 2014 MapR Technologies 45 Going Further: Multi-Modal Recommendation
  • 46. © 2014 MapR Technologies 46 Better Long-Term Recommendations • Anti-flood Avoid having too much of a good thing • Dithering “When making it worse makes it better”
  • 47. © 2014 MapR Technologies 47 Why Use Dithering?
  • 48. © 2014 MapR Technologies 48 What’s New in Apache Mahout? A Preview of Mahout 1.0 21 May 2014 #BDBDM Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 49. © 2014 MapR Technologies 49 Sample Music Log Files 13 START 10113 2182654281 23 BEACON 10113 2182654281 24 START 10113 79600611935028 34 BEACON 10113 79600611935028 44 BEACON 10113 79600611935028 54 BEACON 10113 79600611935028 64 BEACON 10113 79600611935028 74 BEACON 10113 79600611935028 84 BEACON 10113 79600611935028 94 BEACON 10113 79600611935028 104 BEACON 10113 79600611935028 109 FINISH10113 79600611935028 111 START 10113 58999912011972 121 BEACON 10113 58999912011972 Time Event type User ID Artist ID Track ID
  • 50. © 2014 MapR Technologies 50 id 1710 mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499 name Chuck Berry area United States gender Male indicator_artists 386685,875994,637954,3418,1344,789739,1460, … id 541902 mbid 983d4f8f-473e-4091-8394-415c105c4656 name Charlie Winston area United Kingdom gender None indicator_artists 997727,815,830794,59588,900,2591,1344,696268, … Documents for Music Recommendation
  • 51. © 2014 MapR Technologies 51 Practical Machine Learning: Innovations in Recommendation 28 April 2014 NoSQL Matters Conference #NoSQLMatters Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 52. © 2014 MapR Technologies 52

Notas del editor

  1. Ted: Is “Revolution” a better word? Want to imply exciting change but not discension
  2. Talk track: Apache Mahout is an open-source project with international contributors and a vibrant community of users and developers. A new version – 0.8 – was recently released. Mahout is a library of scalable algorithms used for clustering, classification and recommendation. Mahout also includes a math library that is low level, flexible, scalable and makes certain functions very easy to carry out. Talk track: First let’s make a quick comparison of the three main areas of Mahout machine learning…
  3. Ted: I included this as intro slide to set up the content, but I think save details for each following slide
  4. TED: NO Idea???
  5. Ted: Is “Revolution” a better word? Want to imply exciting change but not discension
  6. Ted: Is “Revolution” a better word? Want to imply exciting change but not discension
  7. Ted: Is “Revolution” a better word? Want to imply exciting change but not discension
  8. The first four columns represent the ingredients (our features) and the last column (the rating) is the target variable for our regression. Linear regression assumes that the target variable y is generated by the linear combination of the feature matrix X with the parameter vector β plus the noise ε, summarized in the formula y = Xβ + ε. Our goal is to find an estimate of the parameter vector β that explains the data very well.
  9. Ted: Is “Revolution” a better word? Want to imply exciting change but not discension
  10. Ted: Is “Revolution” a better word? Want to imply exciting change but not discension
  11. Ted: Is “Revolution” a better word? Want to imply exciting change but not discension
  12. TED: consider using the word “interesting” instead of “anomalous”… people may think you are talking about anomaly detection…
  13. TED: Likely this can be skipped
  14. Notes to trainer: A lot of work to do a grid. Represent by math A is history matrix Ah finds users who do the same things as in h H is vector of items for one (new current) user A transpose times Ah gives you the things That computes what these users do Shape of matrix multiplications and many of the same properties. Sometimes have weights etc. Had they been exactly the same, we could just move the parentheses. Our recommender does the item-centric version General relationships in data don’t change fast (what is related to what; nothing happens to change mozart related to Hayden overnight. ) What does change fast is what the user did in the last five minutes. //in first case, we have to compute Ah first. Inputs to that compution (h) only available now, in RT so nothing can be computed ahead of time Second case (Atranspose A) only involves things that change slowly. So pre-compute. Makes it possible to do this offline. Significant because we move a lot of computation for all users into an overnight process. So each RT recommendation involves only a small part, only 1 big matrix multiply in RT. Result: you get a fast response for the recommendations Second form runs on one machine for one user (the RT part)
  15. Talk track: Here are documents for two different artists with indicator IDs that are part of the recommendation model. When recommendations are needed, the web-site uses recent visitor behavior to query against the indicators in these documents.