DiscoRank: optimizing discoverability on SoundCloud

•

3 recomendaciones•20,736 vistas

These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013. The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/

Tecnología

DiscoRank: Optimizing Discoverability
on SoundCloud
Amélie Anglade

• Developer at SoundCloud
• SoundCloud is the
world’s largest social
sound platform
• Academic background in
Music Information
Retrieval (MIR)
• Design, prototype and
implement Machine
Learning algorithms for
music discovery

• The web is a graph:
• nodes = web pages
• edges = hyperlinks
• The (Page)rank of a node depends on the link
structure of the graph
WEB AND PAGERANK

Nodes visited more often:
• Nodes with many links
• Coming from frequently visited nodes
RANDOM SURFER
A
B
C
D
E

Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position

If N nodes in graph,
probability to teleport
to any other node
(including self) = 1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
1/N

TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
α
?
1-α
1/N
At regular node: invoke
teleport operation with
probability α and
standard random walk
with probability (1 - α)

Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state:
the PageRank vector.

DISCORANK
A
B
C
D
EUser
User
Track
Playlist
favorite
follow
featured in

• Search across People, Sounds, Sets, Groups
• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track
• Track is featured in Playlist
...
• New big (but sparse)
adjacency matrix:
UNIVERSAL SEARCH

• How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the
higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE

• Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:
• Sparse matrix
• Optimized storage of the graph in memory
• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank
realtime
A VERY LARGE GRAPH

•
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:
• All edges details are stored in memory in a byte[]
• buffer the byte[] into an opaque byte block pool
• no object
• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:
• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes
• Delta encode the “to” node ids
USING SPARSITY

• We keep versioned copies of:
• the DiscoRank vector of results
• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch
once a week
• In between:
• we create additional graph segments with new
entities and events
• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:
• Also allows for experimentation
VERSIONED DISCORANK

• MySQL batch jobs
• DiscoRank results stored in
HDFS
• At the end of every
DiscoRank run we re-load it
in ElasticSearch:
• For each item we combine
its Lucene score with its
DiscoRank
INTEGRATION IN
OUR INFRASTRUCTURE

Amélie Anglade
Sound/Music Information Retrieval Engineer
about.me/utstikkar
@utstikkar
We’re hiring!
www.soundcloud.com

Más contenido relacionado

La actualidad más candente

Recommender SystemsCarlos Castillo (ChaTo)

Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane

Personalized Playlists at SpotifyRohan Agrawal

Intro to CypherNeo4j

Importing Data into Neo4j quickly and easily - StackOverflowNeo4j

Tutorial on sequence aware recommender systems - UMAP 2018Paolo Cremonesi

Playlists at Spotify - Using Cassandra to store version controlled objectsJimmy Mårdell

How Apache Drives Music Recommendations At SpotifyJosh Baer

Recommending and Searching (Research @ Spotify)Mounia Lalmas-Roelleke

Music Recommendation 2018Fabien Gouyon

Collaborative Filtering at SpotifyErik Bernhardsson

Sequential Decision Making in RecommendationsJaya Kawale

Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...Sudeep Das, Ph.D.

Temporal difference learningJie-Han Chen

Music Recommendations at Scale with SparkChris Johnson

Entity2rec recsysEnrico Palumbo

Recurrent and Recursive Nets (part 2)sohaib_alam

Storm at SpotifyNeville Li

Deep Learning for Natural Language ProcessingSangwoo Mo

GAN in medical imagingCheng-Bin Jin

La actualidad más candente (20)

Recommender Systems

Machine learning @ Spotify - Madison Big Data Meetup

Personalized Playlists at Spotify

Intro to Cypher

Importing Data into Neo4j quickly and easily - StackOverflow

Tutorial on sequence aware recommender systems - UMAP 2018

Playlists at Spotify - Using Cassandra to store version controlled objects

How Apache Drives Music Recommendations At Spotify

Recommending and Searching (Research @ Spotify)

Music Recommendation 2018

Collaborative Filtering at Spotify

Sequential Decision Making in Recommendations

Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...

Temporal difference learning

Music Recommendations at Scale with Spark

Entity2rec recsys

Recurrent and Recursive Nets (part 2)

Storm at Spotify

Deep Learning for Natural Language Processing

GAN in medical imaging

Similar a DiscoRank: optimizing discoverability on SoundCloud

Cassandra and Sparknickmbailey

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...DataStax Academy

«Scrapy internals» Александр Сибиряков, Scrapinghubit-people

Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico

Processing Large GraphsNishant Gandhi

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly

Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward

Balboa Park Commons: Collaborative Digitization for a Public ResourceAnna Chiaretta Lavatelli

JavaScript HistoryRhio Kim

Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3jasinb

TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier

WebServices_Grid.pptEqinNiftalyev

LiveCoding Package for PharoESUG

Implementing a VO archive for datacubes of galaxiesJose Enrique Ruiz

Using the SDACK Architecture to Build a Big Data ProductEvans Ye

Maablalbritton

RDA for Music: ScoresALATechSource

Azure storage deep diveYves Goeleven

azure track -04- azure storage deep diveITProceed

Similar a DiscoRank: optimizing discoverability on SoundCloud (20)

Cassandra and Spark

Apache Spark: The Next Gen toolset for Big Data Processing

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...

«Scrapy internals» Александр Сибиряков, Scrapinghub

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Processing Large Graphs

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds

Balboa Park Commons: Collaborative Digitization for a Public Resource

JavaScript History

Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3

TinkerPop: a story of graphs, DBs, and graph DBs

WebServices_Grid.ppt

LiveCoding Package for Pharo

Implementing a VO archive for datacubes of galaxies

Using the SDACK Architecture to Build a Big Data Product

Maa

RDA for Music: Scores

Azure storage deep dive

azure track -04- azure storage deep dive

Último

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"ML in Production",Oleksandr BaganFwdays

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

WordPress Websites for Engineers: Elevate Your Brandgvaughan

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

unit 4 immunoblotting technique complete.pptxBkGupta21

DiscoRank: optimizing discoverability on SoundCloud

1. DiscoRank: Optimizing Discoverability on SoundCloud Amélie Anglade

2. • Developer at SoundCloud • SoundCloud is the world’s largest social sound platform • Academic background in Music Information Retrieval (MIR) • Design, prototype and implement Machine Learning algorithms for music discovery

3. DISCOVERABILITY ?

7. PAGERANK

8. • The web is a graph: • nodes = web pages • edges = hyperlinks • The (Page)rank of a node depends on the link structure of the graph WEB AND PAGERANK

9. RANDOM SURFER

10. RANDOM SURFER A B C D 1/3 1/3 1/3

11. RANDOM SURFER A B C D 1/3 1/3 1/3

12. Nodes visited more often: • Nodes with many links • Coming from frequently visited nodes RANDOM SURFER A B C D E

13. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

14. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

15. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

16. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

17. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

18. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

19. TELEPORT A B C D E

20. TELEPORT A B C D E

21. TELEPORT A B C D E

22. If N nodes in graph, probability to teleport to any other node (including self) = 1/N TELEPORT A B C D E 1/N 1/N 1/N 1/N 1/N

23. TELEPORT A B C D E 1/N 1/N 1/N 1/N α ? 1-α 1/N At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)

24. Probability distribution of the surfer at any time is a vector. COMPUTING THE PAGERANK That vector converges to a steady state: the PageRank vector.

25. PAGERANK EQUATION

26. SOUNDCLOUD DISCORANK

27.

28. DISCORANK A B C D EUser User Track Playlist favorite follow featured in

29. • Search across People, Sounds, Sets, Groups • One unique rank vector that contains all entities • Weight the links based on the type of event: • User favorites Track • Track is featured in Playlist ... • New big (but sparse) adjacency matrix: UNIVERSAL SEARCH

30.

31. • How do we identify content that is trending? • The more recent a listen, favorite, etc. (event) the higher the weight • Multiply each event (=edge) by a time decay: • New adjacency matrix: BACK TO EXPLORE

32. PERFORMANCE OPTIMIZATION

33. • Millions of entities(=nodes) and events(=edges) • First DiscoRank: several hours of computation • Trimmed down to a few minutes using: • Sparse matrix • Optimized storage of the graph in memory • Versioned copies of the DiscoRank • So technically we could compute the DiscoRank realtime A VERY LARGE GRAPH

34. • • Re-mapping entity ids • Memory optimization so the graph holds in memory: • All edges details are stored in memory in a byte[] • buffer the byte[] into an opaque byte block pool • no object • sort the buffered byte[] in place • On disk and when computing the DiscoRank: • Delta encoded ordered adjacency lists: • One “from” node, several “to” nodes • Delta encode the “to” node ids USING SPARSITY

35. • We keep versioned copies of: • the DiscoRank vector of results • the DiscoRank graph • We rebuild the entire DiscoRank graph from scratch once a week • In between: • we create additional graph segments with new entities and events • and use as prior for the DiscoRank computation the results of the previous DiscoRank run • Side effect: • Also allows for experimentation VERSIONED DISCORANK

36. • MySQL batch jobs • DiscoRank results stored in HDFS • At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine its Lucene score with its DiscoRank INTEGRATION IN OUR INFRASTRUCTURE

37. Amélie Anglade Sound/Music Information Retrieval Engineer about.me/utstikkar @utstikkar We’re hiring! www.soundcloud.com

DiscoRank: optimizing discoverability on SoundCloud

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a DiscoRank: optimizing discoverability on SoundCloud

Similar a DiscoRank: optimizing discoverability on SoundCloud (20)

Último

Último (20)

DiscoRank: optimizing discoverability on SoundCloud