SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
Collaborative Filtering
                               in
                           Map/Reduce


                             Ole-Martin Mørk - Open AdExchange

tirsdag 14. september 2010
Vision

                     • Learn that Map/Reduce is simple
                     • Learn that Map/Reduce may be powerful
                     • Collaborative Filtering is fun!


tirsdag 14. september 2010
Agenda

                     • Map/Reduce
                     • Collaborative Filtering
                     • Collaborative Filtering with Map/Reduce
                     • Amazon Elastic MapReduce

tirsdag 14. september 2010
Map/Reduce

tirsdag 14. september 2010
Map/Reduce

                     • Very scalable algorithm
                     • Inspirered by map and reduce from
                             functional programming.
                     • Everything is based on key/value


tirsdag 14. september 2010
6 phases
                     • Reader
                     • Map
                     • Partition
                     • Comparison
                     • Reduce
                     • Writer
tirsdag 14. september 2010
6 phases
                     • Reader
                     • Map
                     • Partition
                     • Comparison
                     • Reduce
                     • Writer
tirsdag 14. september 2010
Map
tirsdag 14. september 2010
functional map


   List(“hello”,“dude”).map{x=>x.substring(0,1)}




tirsdag 14. september 2010
Map/Reduce map

                     • Input is key/value

                     • Output is key/value


tirsdag 14. september 2010
Simple Example, Map

                     • Count occurences of words in a document
                     • Input is: <linenumber>, <content of line>
                     • For each word on the line, the output is
                             <word>, <count>




tirsdag 14. september 2010
Map




tirsdag 14. september 2010
Reduce
tirsdag 14. september 2010
functional reduce


                val sum=List(32,40,23).reduceLeft{_+_}




tirsdag 14. september 2010
Map/Reduce reduce

                     • Input is key/list of values

                     • Output is key/value


tirsdag 14. september 2010
Simple Example, Reduce

                     • Reduce input is <word, counts>
                     • For each value we increase the count
                     • Output is <word>, <sum of counts>


tirsdag 14. september 2010
Reduce




tirsdag 14. september 2010
Collaborative
                       Filtering
tirsdag 14. september 2010
Amazon




tirsdag 14. september 2010
Last.fm




tirsdag 14. september 2010
Sceneami.com




tirsdag 14. september 2010
User based

                     • Useful when we have
                      • Small number of users
                      • High correlation between users
                      • Data that changes often

tirsdag 14. september 2010
Item based

                     • Useful for big sites like Amazon etc..
                     • Small overlap between users
                     • Mostly static data


tirsdag 14. september 2010
Euclidean Distance
                                    Rating

                                                            Match
                   Min drømmeapplikasjon




                                                                    Match
                                                                            Rating
                                             Pattern Matching in Scala
tirsdag 14. september 2010
Euclidean Distance

                     • Alf‘s presentations:1,25,56,57,58,98 (6)
                     • Kari’s presentations: 2,25,98,99 (4)

                     • Equal presentations: 25 and 98 (2)
                     • Unmatched presentations: 6-2 + 4-2 = 6
                     • Distance score: 1/1+sqr(6)= 0.29
tirsdag 14. september 2010
Recommended sessions
                     • Me:1,2,5,6,7
                     • Kate (0.31): 5,6,8,9
                     • Paul (0.41): 1,2,4,5,6
                     • Mary(0.31):1,5,8,9


tirsdag 14. september 2010
Recommended sessions
                     • Me:1,2,5,6,7
                     • Kate (0.31): 5,6,8,9
                     • Paul (0.41): 1,2,4,5,6
                     • Mary(0.31):1,5,8,9

                     • Recommended: 8 (0.62)
tirsdag 14. september 2010
Recommended sessions
                     • Me:1,2,5,6,7
                     • Kate (0.31): 5,6,8,9
                     • Paul (0.41): 1,2,4,5,6
                     • Mary(0.31):1,5,8,9

                     • Recommended: 8 (0.62), 9 (0.62)
tirsdag 14. september 2010
Recommended sessions
                     • Me:1,2,5,6,7
                     • Kate (0.31): 5,6,8,9
                     • Paul (0.41): 1,2,4,5,6
                     • Mary(0.31):1,5,8,9

                     • Recommended: 8 (0.62), 9 (0.62), 4 (0.41)
tirsdag 14. september 2010
Demo



tirsdag 14. september 2010
More Map/Reduce




tirsdag 14. september 2010
Several iterations

                                   Iteration 1


                                   Iteration 2


                                   Iteration 3



tirsdag 14. september 2010
Several iterations

                         Iteration 1                 Iteration 2




                                       Iteration 3



tirsdag 14. september 2010
Partitioning
                       Paul    Mary     Kate   Lea     Jeff    Ali



                                Ali                    Jeff
                                Lea                   Kate
                               Paul                   Mary


                              Reducer                Reducer

tirsdag 14. september 2010
Comparison
                                Pres 1                       Pres 2
                     Paul        Lea         Ali   Jeff      Mary          Kate



                                      Paul                         Kate
                             Pres 1    Ali                Pres 2    Jeff
                                      Lea                          Mary


                               Reducer                      Reducer

tirsdag 14. september 2010
Guidelines

                • Never access external sources during
                        computation.
                • Your functions should be small and fast
                • You might not have all the data available


tirsdag 14. september 2010
Hadoop
                     • Hadoop is reusing objects, so remember to
                             clone if you plan to keep them.
                     • You can read and write all objects
                             implementing hadoop.WritableComparable
                             • write(DataOutput)
                             • readFields(DataInput)
                             • compareTo(Object)
tirsdag 14. september 2010
Collaborative Filtering,
         the Map/Reduce way


tirsdag 14. september 2010
Overview
                     • Create an application that recommends
                             JavaZone presentations.
                     • Overall goal: Scalable performance

                     • 4 iterations
                     • Reading input from text file
tirsdag 14. september 2010
Iteration 1

                     • Map input: <user>, <presentations>
                     • Map output: <presentation>, <user>

                     • Reduce output: <presentation>, <userList>

tirsdag 14. september 2010
Iteration 2
                     • Map input: <presentation>, <userList>
                     • Map output: <user>, <userList>

                     • Reduce input: <user>, <list of userList>
                     • Reduce output: <userTuplet>, <match
                             count>


tirsdag 14. september 2010
Iteration 3

                     • Map input: <userTuplet>, <match count>
                     • Map output: <userTuplet>, <diff>
                     • Map output: <userTuplet reversed>, <diff>

                     • Reduce output: <user>, <similaruser>

tirsdag 14. september 2010
Iteration 4

                     • Map input: <user>, <similaruser>
                     • Map output: <user>, <presentation with
                             score>


                     • Reduce output: <user>, <presentations>

tirsdag 14. september 2010
Demo



tirsdag 14. september 2010
Map/Reduce on EC2




tirsdag 14. september 2010
Elastic Map/Reduce

                     • Same code
                     • Same input
                     • Different configuration


tirsdag 14. september 2010
Upload files
    s3cmd put oax-jz10:jar/oax-jz10.jar target/
    oax.jz10.jar



    s3cmd.rb put oax-jz10:input/data.txt
    data.txt




tirsdag 14. september 2010
Create job flow

    elastic-mapreduce --create --alive --log-uri
    s3n://oax-jz10/log




tirsdag 14. september 2010
Register iterations
    elastic-mapreduce
     --jobflow j-1NLAIW45QUN4B
     --jar s3n://oax-jz10/jar/oax-jz10.jar
     --arg
      com.openadex.pres.iterations.Iteration1
     --arg s3n://oax-jz10/input
     --arg s3n://oax-jz10/output1




tirsdag 14. september 2010
Download output


     s3cmd.rb get oax-jz10:output4/part-00000 out




tirsdag 14. september 2010
Demo

tirsdag 14. september 2010
Summary

                     • Map/Reduce may be simple
                     • Map/Reduce can be really powerful
                     • Collaborative filtering is fun :-)


tirsdag 14. september 2010
tirsdag 14. september 2010
Thank you

                               Ole-Martin Mørk
                             olemartin@gmail.com
                             twitter.com/olemartin



                             del.icio.us/olemartin/jz10


                                                          All images are licensed with Creative Commons.
                                                                     See http://bit.ly/mr-photos for details,

tirsdag 14. september 2010

Más contenido relacionado

Destacado

Browsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInBrowsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInLili Wu
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative FilteringTayfun Sen
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Collaborative filtering at scale
Collaborative filtering at scaleCollaborative filtering at scale
Collaborative filtering at scalehuguk
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Kira
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
 
Recommender System at Scale Using HBase and Hadoop
Recommender System at Scale Using HBase and HadoopRecommender System at Scale Using HBase and Hadoop
Recommender System at Scale Using HBase and HadoopDataWorks Summit
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Wei-Yu Chen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Docker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los AngelesDocker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los AngelesJérôme Petazzoni
 

Destacado (20)

Browsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInBrowsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedIn
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative Filtering
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Collaborative filtering at scale
Collaborative filtering at scaleCollaborative filtering at scale
Collaborative filtering at scale
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
Recommender System at Scale Using HBase and Hadoop
Recommender System at Scale Using HBase and HadoopRecommender System at Scale Using HBase and Hadoop
Recommender System at Scale Using HBase and Hadoop
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
fashionresume
fashionresumefashionresume
fashionresume
 
Docker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los AngelesDocker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los Angeles
 

Más de Ole-Martin Mørk

Más de Ole-Martin Mørk (9)

Graph search with Neo4j
Graph search with Neo4jGraph search with Neo4j
Graph search with Neo4j
 
Polyglot Persistence
Polyglot PersistencePolyglot Persistence
Polyglot Persistence
 
Patterns for key-value stores
Patterns for key-value storesPatterns for key-value stores
Patterns for key-value stores
 
Presentation of Redis
Presentation of RedisPresentation of Redis
Presentation of Redis
 
Evolusjonen av PaaS
Evolusjonen av PaaSEvolusjonen av PaaS
Evolusjonen av PaaS
 
Polyglot persistence
Polyglot persistencePolyglot persistence
Polyglot persistence
 
Du må vite hva som skjer i produksjon
Du må vite hva som skjer i produksjonDu må vite hva som skjer i produksjon
Du må vite hva som skjer i produksjon
 
Presentasjon om skyen
Presentasjon om skyenPresentasjon om skyen
Presentasjon om skyen
 
Hele butikken i skyen
Hele butikken i skyenHele butikken i skyen
Hele butikken i skyen
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Collaborative Filtering in Map/Reduce

  • 1. Collaborative Filtering in Map/Reduce Ole-Martin Mørk - Open AdExchange tirsdag 14. september 2010
  • 2. Vision • Learn that Map/Reduce is simple • Learn that Map/Reduce may be powerful • Collaborative Filtering is fun! tirsdag 14. september 2010
  • 3. Agenda • Map/Reduce • Collaborative Filtering • Collaborative Filtering with Map/Reduce • Amazon Elastic MapReduce tirsdag 14. september 2010
  • 5. Map/Reduce • Very scalable algorithm • Inspirered by map and reduce from functional programming. • Everything is based on key/value tirsdag 14. september 2010
  • 6. 6 phases • Reader • Map • Partition • Comparison • Reduce • Writer tirsdag 14. september 2010
  • 7. 6 phases • Reader • Map • Partition • Comparison • Reduce • Writer tirsdag 14. september 2010
  • 9. functional map List(“hello”,“dude”).map{x=>x.substring(0,1)} tirsdag 14. september 2010
  • 10. Map/Reduce map • Input is key/value • Output is key/value tirsdag 14. september 2010
  • 11. Simple Example, Map • Count occurences of words in a document • Input is: <linenumber>, <content of line> • For each word on the line, the output is <word>, <count> tirsdag 14. september 2010
  • 14. functional reduce val sum=List(32,40,23).reduceLeft{_+_} tirsdag 14. september 2010
  • 15. Map/Reduce reduce • Input is key/list of values • Output is key/value tirsdag 14. september 2010
  • 16. Simple Example, Reduce • Reduce input is <word, counts> • For each value we increase the count • Output is <word>, <sum of counts> tirsdag 14. september 2010
  • 18. Collaborative Filtering tirsdag 14. september 2010
  • 22. User based • Useful when we have • Small number of users • High correlation between users • Data that changes often tirsdag 14. september 2010
  • 23. Item based • Useful for big sites like Amazon etc.. • Small overlap between users • Mostly static data tirsdag 14. september 2010
  • 24. Euclidean Distance Rating Match Min drømmeapplikasjon Match Rating Pattern Matching in Scala tirsdag 14. september 2010
  • 25. Euclidean Distance • Alf‘s presentations:1,25,56,57,58,98 (6) • Kari’s presentations: 2,25,98,99 (4) • Equal presentations: 25 and 98 (2) • Unmatched presentations: 6-2 + 4-2 = 6 • Distance score: 1/1+sqr(6)= 0.29 tirsdag 14. september 2010
  • 26. Recommended sessions • Me:1,2,5,6,7 • Kate (0.31): 5,6,8,9 • Paul (0.41): 1,2,4,5,6 • Mary(0.31):1,5,8,9 tirsdag 14. september 2010
  • 27. Recommended sessions • Me:1,2,5,6,7 • Kate (0.31): 5,6,8,9 • Paul (0.41): 1,2,4,5,6 • Mary(0.31):1,5,8,9 • Recommended: 8 (0.62) tirsdag 14. september 2010
  • 28. Recommended sessions • Me:1,2,5,6,7 • Kate (0.31): 5,6,8,9 • Paul (0.41): 1,2,4,5,6 • Mary(0.31):1,5,8,9 • Recommended: 8 (0.62), 9 (0.62) tirsdag 14. september 2010
  • 29. Recommended sessions • Me:1,2,5,6,7 • Kate (0.31): 5,6,8,9 • Paul (0.41): 1,2,4,5,6 • Mary(0.31):1,5,8,9 • Recommended: 8 (0.62), 9 (0.62), 4 (0.41) tirsdag 14. september 2010
  • 31. More Map/Reduce tirsdag 14. september 2010
  • 32. Several iterations Iteration 1 Iteration 2 Iteration 3 tirsdag 14. september 2010
  • 33. Several iterations Iteration 1 Iteration 2 Iteration 3 tirsdag 14. september 2010
  • 34. Partitioning Paul Mary Kate Lea Jeff Ali Ali Jeff Lea Kate Paul Mary Reducer Reducer tirsdag 14. september 2010
  • 35. Comparison Pres 1 Pres 2 Paul Lea Ali Jeff Mary Kate Paul Kate Pres 1 Ali Pres 2 Jeff Lea Mary Reducer Reducer tirsdag 14. september 2010
  • 36. Guidelines • Never access external sources during computation. • Your functions should be small and fast • You might not have all the data available tirsdag 14. september 2010
  • 37. Hadoop • Hadoop is reusing objects, so remember to clone if you plan to keep them. • You can read and write all objects implementing hadoop.WritableComparable • write(DataOutput) • readFields(DataInput) • compareTo(Object) tirsdag 14. september 2010
  • 38. Collaborative Filtering, the Map/Reduce way tirsdag 14. september 2010
  • 39. Overview • Create an application that recommends JavaZone presentations. • Overall goal: Scalable performance • 4 iterations • Reading input from text file tirsdag 14. september 2010
  • 40. Iteration 1 • Map input: <user>, <presentations> • Map output: <presentation>, <user> • Reduce output: <presentation>, <userList> tirsdag 14. september 2010
  • 41. Iteration 2 • Map input: <presentation>, <userList> • Map output: <user>, <userList> • Reduce input: <user>, <list of userList> • Reduce output: <userTuplet>, <match count> tirsdag 14. september 2010
  • 42. Iteration 3 • Map input: <userTuplet>, <match count> • Map output: <userTuplet>, <diff> • Map output: <userTuplet reversed>, <diff> • Reduce output: <user>, <similaruser> tirsdag 14. september 2010
  • 43. Iteration 4 • Map input: <user>, <similaruser> • Map output: <user>, <presentation with score> • Reduce output: <user>, <presentations> tirsdag 14. september 2010
  • 45. Map/Reduce on EC2 tirsdag 14. september 2010
  • 46. Elastic Map/Reduce • Same code • Same input • Different configuration tirsdag 14. september 2010
  • 47. Upload files s3cmd put oax-jz10:jar/oax-jz10.jar target/ oax.jz10.jar s3cmd.rb put oax-jz10:input/data.txt data.txt tirsdag 14. september 2010
  • 48. Create job flow elastic-mapreduce --create --alive --log-uri s3n://oax-jz10/log tirsdag 14. september 2010
  • 49. Register iterations elastic-mapreduce --jobflow j-1NLAIW45QUN4B --jar s3n://oax-jz10/jar/oax-jz10.jar --arg com.openadex.pres.iterations.Iteration1 --arg s3n://oax-jz10/input --arg s3n://oax-jz10/output1 tirsdag 14. september 2010
  • 50. Download output s3cmd.rb get oax-jz10:output4/part-00000 out tirsdag 14. september 2010
  • 52. Summary • Map/Reduce may be simple • Map/Reduce can be really powerful • Collaborative filtering is fun :-) tirsdag 14. september 2010
  • 54. Thank you Ole-Martin Mørk olemartin@gmail.com twitter.com/olemartin del.icio.us/olemartin/jz10 All images are licensed with Creative Commons. See http://bit.ly/mr-photos for details, tirsdag 14. september 2010