SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
DiscoRank: Optimizing Discoverability
on SoundCloud
Amélie Anglade
• Developer at SoundCloud
• SoundCloud is the
world’s largest social
sound platform
• Academic background in
Music Information
Retrieval (MIR)
• Design, prototype and
implement Machine
Learning algorithms for
music discovery
DISCOVERABILITY ?
PAGERANK
• The web is a graph:
• nodes = web pages
• edges = hyperlinks
• The (Page)rank of a node depends on the link
structure of the graph
WEB AND PAGERANK
RANDOM SURFER
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
Nodes visited more often:
• Nodes with many links
• Coming from frequently visited nodes
RANDOM SURFER
A
B
C
D
E
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
If N nodes in graph,
probability to teleport
to any other node
(including self) = 1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
α
?
1-α
1/N
At regular node: invoke
teleport operation with
probability α and
standard random walk
with probability (1 - α)
Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state:
the PageRank vector.
PAGERANK EQUATION
SOUNDCLOUD
DISCORANK
DISCORANK
A
B
C
D
EUser
User
Track
Playlist
favorite
follow
featured in
• Search across People, Sounds, Sets, Groups
• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track
• Track is featured in Playlist
...
• New big (but sparse)
adjacency matrix:
UNIVERSAL SEARCH
• How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the
higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE
PERFORMANCE
OPTIMIZATION
• Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:
• Sparse matrix
• Optimized storage of the graph in memory
• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank
realtime
A VERY LARGE GRAPH
•
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:
• All edges details are stored in memory in a byte[]
• buffer the byte[] into an opaque byte block pool
• no object
• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:
• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes
• Delta encode the “to” node ids
USING SPARSITY
• We keep versioned copies of:
• the DiscoRank vector of results
• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch
once a week
• In between:
• we create additional graph segments with new
entities and events
• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:
• Also allows for experimentation
VERSIONED DISCORANK
• MySQL batch jobs
• DiscoRank results stored in
HDFS
• At the end of every
DiscoRank run we re-load it
in ElasticSearch:
• For each item we combine
its Lucene score with its
DiscoRank
INTEGRATION IN
OUR INFRASTRUCTURE
Amélie Anglade
Sound/Music Information Retrieval Engineer
about.me/utstikkar
@utstikkar
We’re hiring!
www.soundcloud.com

Más contenido relacionado

La actualidad más candente

Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Anmol Bhasin
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
DataStax
 

La actualidad más candente (20)

Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 Stars
 
Multi-Task Learning for NLP
Multi-Task Learning for NLPMulti-Task Learning for NLP
Multi-Task Learning for NLP
 
Contextualization at Netflix
Contextualization at NetflixContextualization at Netflix
Contextualization at Netflix
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
 
Reward Innovation for long-term member satisfaction
Reward Innovation for long-term member satisfactionReward Innovation for long-term member satisfaction
Reward Innovation for long-term member satisfaction
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation Forests
 
Model storming
Model stormingModel storming
Model storming
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Introduction to Statista
Introduction to Statista Introduction to Statista
Introduction to Statista
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 

Similar a DiscoRank: optimizing discoverability on SoundCloud

Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Implementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesImplementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxies
Jose Enrique Ruiz
 

Similar a DiscoRank: optimizing discoverability on SoundCloud (20)

Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Balboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public ResourceBalboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public Resource
 
JavaScript History
JavaScript HistoryJavaScript History
JavaScript History
 
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
LiveCoding Package for Pharo
LiveCoding Package for PharoLiveCoding Package for Pharo
LiveCoding Package for Pharo
 
Implementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesImplementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxies
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Maa
MaaMaa
Maa
 
RDA for Music: Scores
RDA for Music: ScoresRDA for Music: Scores
RDA for Music: Scores
 
Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

DiscoRank: optimizing discoverability on SoundCloud

  • 1. DiscoRank: Optimizing Discoverability on SoundCloud Amélie Anglade
  • 2. • Developer at SoundCloud • SoundCloud is the world’s largest social sound platform • Academic background in Music Information Retrieval (MIR) • Design, prototype and implement Machine Learning algorithms for music discovery
  • 4.
  • 5.
  • 6.
  • 8. • The web is a graph: • nodes = web pages • edges = hyperlinks • The (Page)rank of a node depends on the link structure of the graph WEB AND PAGERANK
  • 12. Nodes visited more often: • Nodes with many links • Coming from frequently visited nodes RANDOM SURFER A B C D E
  • 13. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 14. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 15. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 16. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 17. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 18. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 22. If N nodes in graph, probability to teleport to any other node (including self) = 1/N TELEPORT A B C D E 1/N 1/N 1/N 1/N 1/N
  • 23. TELEPORT A B C D E 1/N 1/N 1/N 1/N α ? 1-α 1/N At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)
  • 24. Probability distribution of the surfer at any time is a vector. COMPUTING THE PAGERANK That vector converges to a steady state: the PageRank vector.
  • 27.
  • 29. • Search across People, Sounds, Sets, Groups • One unique rank vector that contains all entities • Weight the links based on the type of event: • User favorites Track • Track is featured in Playlist ... • New big (but sparse) adjacency matrix: UNIVERSAL SEARCH
  • 30.
  • 31. • How do we identify content that is trending? • The more recent a listen, favorite, etc. (event) the higher the weight • Multiply each event (=edge) by a time decay: • New adjacency matrix: BACK TO EXPLORE
  • 33. • Millions of entities(=nodes) and events(=edges) • First DiscoRank: several hours of computation • Trimmed down to a few minutes using: • Sparse matrix • Optimized storage of the graph in memory • Versioned copies of the DiscoRank • So technically we could compute the DiscoRank realtime A VERY LARGE GRAPH
  • 34. • • Re-mapping entity ids • Memory optimization so the graph holds in memory: • All edges details are stored in memory in a byte[] • buffer the byte[] into an opaque byte block pool • no object • sort the buffered byte[] in place • On disk and when computing the DiscoRank: • Delta encoded ordered adjacency lists: • One “from” node, several “to” nodes • Delta encode the “to” node ids USING SPARSITY
  • 35. • We keep versioned copies of: • the DiscoRank vector of results • the DiscoRank graph • We rebuild the entire DiscoRank graph from scratch once a week • In between: • we create additional graph segments with new entities and events • and use as prior for the DiscoRank computation the results of the previous DiscoRank run • Side effect: • Also allows for experimentation VERSIONED DISCORANK
  • 36. • MySQL batch jobs • DiscoRank results stored in HDFS • At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine its Lucene score with its DiscoRank INTEGRATION IN OUR INFRASTRUCTURE
  • 37. Amélie Anglade Sound/Music Information Retrieval Engineer about.me/utstikkar @utstikkar We’re hiring! www.soundcloud.com