These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013.
The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/
2. • Developer at SoundCloud
• SoundCloud is the
world’s largest social
sound platform
• Academic background in
Music Information
Retrieval (MIR)
• Design, prototype and
implement Machine
Learning algorithms for
music discovery
8. • The web is a graph:
• nodes = web pages
• edges = hyperlinks
• The (Page)rank of a node depends on the link
structure of the graph
WEB AND PAGERANK
24. Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state:
the PageRank vector.
29. • Search across People, Sounds, Sets, Groups
• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track
• Track is featured in Playlist
...
• New big (but sparse)
adjacency matrix:
UNIVERSAL SEARCH
30.
31. • How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the
higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE
33. • Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:
• Sparse matrix
• Optimized storage of the graph in memory
• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank
realtime
A VERY LARGE GRAPH
34. •
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:
• All edges details are stored in memory in a byte[]
• buffer the byte[] into an opaque byte block pool
• no object
• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:
• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes
• Delta encode the “to” node ids
USING SPARSITY
35. • We keep versioned copies of:
• the DiscoRank vector of results
• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch
once a week
• In between:
• we create additional graph segments with new
entities and events
• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:
• Also allows for experimentation
VERSIONED DISCORANK
36. • MySQL batch jobs
• DiscoRank results stored in
HDFS
• At the end of every
DiscoRank run we re-load it
in ElasticSearch:
• For each item we combine
its Lucene score with its
DiscoRank
INTEGRATION IN
OUR INFRASTRUCTURE