SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
DiscoRank: Optimizing Discoverability
on SoundCloud
Amélie Anglade
• Developer at SoundCloud
• SoundCloud is the
world’s largest social
sound platform
• Academic background in
Music Information
Retrieval (MIR)
• Design, prototype and
implement Machine
Learning algorithms for
music discovery
DISCOVERABILITY ?
PAGERANK
• The web is a graph:
• nodes = web pages
• edges = hyperlinks
• The (Page)rank of a node depends on the link
structure of the graph
WEB AND PAGERANK
RANDOM SURFER
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
Nodes visited more often:
• Nodes with many links
• Coming from frequently visited nodes
RANDOM SURFER
A
B
C
D
E
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
If N nodes in graph,
probability to teleport
to any other node
(including self) = 1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
α
?
1-α
1/N
At regular node: invoke
teleport operation with
probability α and
standard random walk
with probability (1 - α)
Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state:
the PageRank vector.
PAGERANK EQUATION
SOUNDCLOUD
DISCORANK
DISCORANK
A
B
C
D
EUser
User
Track
Playlist
favorite
follow
featured in
• Search across People, Sounds, Sets, Groups
• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track
• Track is featured in Playlist
...
• New big (but sparse)
adjacency matrix:
UNIVERSAL SEARCH
• How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the
higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE
PERFORMANCE
OPTIMIZATION
• Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:
• Sparse matrix
• Optimized storage of the graph in memory
• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank
realtime
A VERY LARGE GRAPH
•
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:
• All edges details are stored in memory in a byte[]
• buffer the byte[] into an opaque byte block pool
• no object
• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:
• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes
• Delta encode the “to” node ids
USING SPARSITY
• We keep versioned copies of:
• the DiscoRank vector of results
• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch
once a week
• In between:
• we create additional graph segments with new
entities and events
• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:
• Also allows for experimentation
VERSIONED DISCORANK
• MySQL batch jobs
• DiscoRank results stored in
HDFS
• At the end of every
DiscoRank run we re-load it
in ElasticSearch:
• For each item we combine
its Lucene score with its
DiscoRank
INTEGRATION IN
OUR INFRASTRUCTURE
Amélie Anglade
Sound/Music Information Retrieval Engineer
about.me/utstikkar
@utstikkar
We’re hiring!
www.soundcloud.com

Más contenido relacionado

La actualidad más candente

Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at SpotifyRohan Agrawal
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to CypherNeo4j
 
Importing Data into Neo4j quickly and easily - StackOverflow
Importing Data into Neo4j quickly and easily - StackOverflowImporting Data into Neo4j quickly and easily - StackOverflow
Importing Data into Neo4j quickly and easily - StackOverflowNeo4j
 
Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Paolo Cremonesi
 
Playlists at Spotify - Using Cassandra to store version controlled objects
Playlists at Spotify - Using Cassandra to store version controlled objectsPlaylists at Spotify - Using Cassandra to store version controlled objects
Playlists at Spotify - Using Cassandra to store version controlled objectsJimmy Mårdell
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Mounia Lalmas-Roelleke
 
Music Recommendation 2018
Music Recommendation 2018Music Recommendation 2018
Music Recommendation 2018Fabien Gouyon
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...Sudeep Das, Ph.D.
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)sohaib_alam
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingSangwoo Mo
 
GAN in medical imaging
GAN in medical imagingGAN in medical imaging
GAN in medical imagingCheng-Bin Jin
 

La actualidad más candente (20)

Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to Cypher
 
Importing Data into Neo4j quickly and easily - StackOverflow
Importing Data into Neo4j quickly and easily - StackOverflowImporting Data into Neo4j quickly and easily - StackOverflow
Importing Data into Neo4j quickly and easily - StackOverflow
 
Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018
 
Playlists at Spotify - Using Cassandra to store version controlled objects
Playlists at Spotify - Using Cassandra to store version controlled objectsPlaylists at Spotify - Using Cassandra to store version controlled objects
Playlists at Spotify - Using Cassandra to store version controlled objects
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Music Recommendation 2018
Music Recommendation 2018Music Recommendation 2018
Music Recommendation 2018
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Entity2rec recsys
Entity2rec recsysEntity2rec recsys
Entity2rec recsys
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
GAN in medical imaging
GAN in medical imagingGAN in medical imaging
GAN in medical imaging
 

Similar a DiscoRank: optimizing discoverability on SoundCloud

Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...DataStax Academy
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghubit-people
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Balboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public ResourceBalboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public ResourceAnna Chiaretta Lavatelli
 
JavaScript History
JavaScript HistoryJavaScript History
JavaScript HistoryRhio Kim
 
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3jasinb
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.pptEqinNiftalyev
 
LiveCoding Package for Pharo
LiveCoding Package for PharoLiveCoding Package for Pharo
LiveCoding Package for PharoESUG
 
Implementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesImplementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesJose Enrique Ruiz
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
RDA for Music: Scores
RDA for Music: ScoresRDA for Music: Scores
RDA for Music: ScoresALATechSource
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep diveYves Goeleven
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep diveITProceed
 

Similar a DiscoRank: optimizing discoverability on SoundCloud (20)

Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Balboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public ResourceBalboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public Resource
 
JavaScript History
JavaScript HistoryJavaScript History
JavaScript History
 
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
LiveCoding Package for Pharo
LiveCoding Package for PharoLiveCoding Package for Pharo
LiveCoding Package for Pharo
 
Implementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesImplementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxies
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Maa
MaaMaa
Maa
 
RDA for Music: Scores
RDA for Music: ScoresRDA for Music: Scores
RDA for Music: Scores
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 
azure track -04- azure storage deep dive
azure track -04- azure storage deep diveazure track -04- azure storage deep dive
azure track -04- azure storage deep dive
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

DiscoRank: optimizing discoverability on SoundCloud

  • 1. DiscoRank: Optimizing Discoverability on SoundCloud Amélie Anglade
  • 2. • Developer at SoundCloud • SoundCloud is the world’s largest social sound platform • Academic background in Music Information Retrieval (MIR) • Design, prototype and implement Machine Learning algorithms for music discovery
  • 4.
  • 5.
  • 6.
  • 8. • The web is a graph: • nodes = web pages • edges = hyperlinks • The (Page)rank of a node depends on the link structure of the graph WEB AND PAGERANK
  • 12. Nodes visited more often: • Nodes with many links • Coming from frequently visited nodes RANDOM SURFER A B C D E
  • 13. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 14. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 15. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 16. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 17. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 18. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 22. If N nodes in graph, probability to teleport to any other node (including self) = 1/N TELEPORT A B C D E 1/N 1/N 1/N 1/N 1/N
  • 23. TELEPORT A B C D E 1/N 1/N 1/N 1/N α ? 1-α 1/N At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)
  • 24. Probability distribution of the surfer at any time is a vector. COMPUTING THE PAGERANK That vector converges to a steady state: the PageRank vector.
  • 27.
  • 29. • Search across People, Sounds, Sets, Groups • One unique rank vector that contains all entities • Weight the links based on the type of event: • User favorites Track • Track is featured in Playlist ... • New big (but sparse) adjacency matrix: UNIVERSAL SEARCH
  • 30.
  • 31. • How do we identify content that is trending? • The more recent a listen, favorite, etc. (event) the higher the weight • Multiply each event (=edge) by a time decay: • New adjacency matrix: BACK TO EXPLORE
  • 33. • Millions of entities(=nodes) and events(=edges) • First DiscoRank: several hours of computation • Trimmed down to a few minutes using: • Sparse matrix • Optimized storage of the graph in memory • Versioned copies of the DiscoRank • So technically we could compute the DiscoRank realtime A VERY LARGE GRAPH
  • 34. • • Re-mapping entity ids • Memory optimization so the graph holds in memory: • All edges details are stored in memory in a byte[] • buffer the byte[] into an opaque byte block pool • no object • sort the buffered byte[] in place • On disk and when computing the DiscoRank: • Delta encoded ordered adjacency lists: • One “from” node, several “to” nodes • Delta encode the “to” node ids USING SPARSITY
  • 35. • We keep versioned copies of: • the DiscoRank vector of results • the DiscoRank graph • We rebuild the entire DiscoRank graph from scratch once a week • In between: • we create additional graph segments with new entities and events • and use as prior for the DiscoRank computation the results of the previous DiscoRank run • Side effect: • Also allows for experimentation VERSIONED DISCORANK
  • 36. • MySQL batch jobs • DiscoRank results stored in HDFS • At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine its Lucene score with its DiscoRank INTEGRATION IN OUR INFRASTRUCTURE
  • 37. Amélie Anglade Sound/Music Information Retrieval Engineer about.me/utstikkar @utstikkar We’re hiring! www.soundcloud.com