DiscoRank: optimizing discoverability on SoundCloud

•

3 recomendaciones•20,754 vistas

These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013. The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/

Tecnología

DiscoRank: Optimizing Discoverability
on SoundCloud
Amélie Anglade

• Developer at SoundCloud
• SoundCloud is the
world’s largest social
sound platform
• Academic background in
Music Information
Retrieval (MIR)
• Design, prototype and
implement Machine
Learning algorithms for
music discovery

• The web is a graph:
• nodes = web pages
• edges = hyperlinks
• The (Page)rank of a node depends on the link
structure of the graph
WEB AND PAGERANK

Nodes visited more often:
• Nodes with many links
• Coming from frequently visited nodes
RANDOM SURFER
A
B
C
D
E

Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position

If N nodes in graph,
probability to teleport
to any other node
(including self) = 1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
1/N

TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
α
?
1-α
1/N
At regular node: invoke
teleport operation with
probability α and
standard random walk
with probability (1 - α)

Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state:
the PageRank vector.

DISCORANK
A
B
C
D
EUser
User
Track
Playlist
favorite
follow
featured in

• Search across People, Sounds, Sets, Groups
• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track
• Track is featured in Playlist
...
• New big (but sparse)
adjacency matrix:
UNIVERSAL SEARCH

• How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the
higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE

• Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:
• Sparse matrix
• Optimized storage of the graph in memory
• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank
realtime
A VERY LARGE GRAPH

•
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:
• All edges details are stored in memory in a byte[]
• buffer the byte[] into an opaque byte block pool
• no object
• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:
• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes
• Delta encode the “to” node ids
USING SPARSITY

• We keep versioned copies of:
• the DiscoRank vector of results
• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch
once a week
• In between:
• we create additional graph segments with new
entities and events
• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:
• Also allows for experimentation
VERSIONED DISCORANK

• MySQL batch jobs
• DiscoRank results stored in
HDFS
• At the end of every
DiscoRank run we re-load it
in ElasticSearch:
• For each item we combine
its Lucene score with its
DiscoRank
INTEGRATION IN
OUR INFRASTRUCTURE

Amélie Anglade
Sound/Music Information Retrieval Engineer
about.me/utstikkar
@utstikkar
We’re hiring!
www.soundcloud.com

Más contenido relacionado

La actualidad más candente

Tutorials at ACM RecSys 2013 Social Networks Learning to Rank Beyond Friendship Pref. Handling Beyond Friendship: The Art, Science and Applications of Recommending People to People in Social Networks by Luiz Augusto Pizzato (University of Sydney, Australia) & Anmol Bhasin (LinkedIn, USA) While Recommender Systems are powerful drivers of engagement and transactional utility in social networks, People recommenders are a fairly involved and diverse subdomain. Consider that movies are recommended to be watched, news is recommended to be read, people however, are recommended for a plethora of reasons – such as recommendation of people to befriend, follow, partner, targets for an advertisement or service, recruiting, partnering romantically and to join thematic interest groups. This tutorial aims to first describe the problem domain, touch upon classical approaches like link analysis and collaborative filtering and then take a rapid deep dive into the unique aspects of this problem space like Reciprocity, Intent understanding of recommender and the recomendee, Contextual people recommendations in communication flows and Social Referrals – a paradigm for delivery of recommendations using the Social Graph. These aspects will be discussed in the context of published original work developed by the authors and their collaborators and in many cases deployed in massive-scale real world applications on professional networks such as LinkedIn. Introduction The basics of Social Recommenders People recommender systems Special Topics in People Recommenders Why reciprocal (people) recommenders are different to traditional (product) recommendations Multi-Objective Optimization Intent Understanding Feature Engineering Social Referral Pathfinding Concluding remarks The pre-requisite for this tutorial is some familiarity with foundational Recommender Systems, Data Mining, Machine Learning and Social Network Analysis literature. Date Oct 13, 2013 (08:30 – 10:15)

Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...

Anmol Bhasin

Netflix Recommendations - Beyond the 5 Stars

Xavier Amatriain

Multi-Task Learning for NLP

Motoki Sato

At Netflix we take context of the member seriously. In this keynote talk we will see how modeling contextual factors such as time or device can help members to find the right content at the right moment At the end, the goal is to maximize member satisfaction and retention These slides will go through which contextual factors matters for the video service and why we choose to use them or not.

Contextualization at Netflix

Linas Baltrunas

Making Netflix Machine Learning Algorithms Reliable

Justin Basilico

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create features is critical for machine learning projects to be successful. To enable this, we built a time machine that computes features for any arbitrary time in the recent past for offline experimentation. We also built a real-time stream processing system to capture the interests of members during different times of the day and to quickly adapt to changes in the collective interests of members as it happens in case of real-world events. Building the time machine for offline experimentation and the real-time infrastructure for online recommendations with Apache Spark (Streaming) and Apache Cassandra empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. We will delve into the architecture, use case details, data models used for cassandra and share our learnings. About the Speakers Prasanna Padmanabhan Engineering Manager, Netflix Prasanna leads the Data Systems for Personalization team at Netflix. His primary focus is on building various big data infrastructure components that help their algorithmic engineers to innovate faster and improve personalization for Netflix members. In the past, he has built distributed data systems that leverages both batch and stream processing. Roopa Tangirala Engineering Manager, Netflix Roopa Tangirala is an experienced engineering leader with extensive background in databases, be they distributed or relational. She manages the database engineering team at Netflix responsible for operating cloud persistent and semipersistent runtime stores for Netflix, which includes Cassandra, Elasticsearch, Dynomite and MySQL databases, by ensuring data availability, durability, and scalability to meet the growing business needs.

Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...

DataStax

Recent Trends in Personalization: A Netflix Perspective

Justin Basilico

Reward Innovation for long-term member satisfaction

Jiangwei Pan

Talk with Yves Raimond at the GPU Tech Conference on Marth 28, 2018 in San Jose, CA. Abstract: In this talk, we will survey how Deep Learning methods can be applied to personalization and recommendations. We will cover why standard Deep Learning approaches don't perform better than typical collaborative filtering techniques. Then we will survey we will go over recently published research at the intersection of Deep Learning and recommender systems, looking at how they integrate new types of data, explore new models, or change the recommendation problem statement. We will also highlight some of the ways that neural networks are used at Netflix and how we can use GPUs to train recommender systems. Finally, we will highlight promising new directions in this space.

Deep Learning for Recommender Systems

Justin Basilico

Data Science Introduction

Gang Tao

Anomaly Detection Using Isolation Forests

Turi, Inc.

Model storming

Alberto Brandolini

Personalized Playlists at Spotify

Rohan Agrawal

Introduction to Statista

HVCClibrary

In this talk, I will share some of my personal reflections on the progress in the field of neural IR and some of the ongoing and future research directions that I am personally excited about. This talk will be informed by my own research in this area as well as my experience both as a developer/organizer of the MS MARCO benchmark and the TREC Deep Learning Track and as an applied researcher previously working on web scale search systems at Bing. My goal in this talk would be to move the conversation beyond neural reranking models towards a richer and bolder vision of search powered by deep learning.

What’s next for deep learning for Search?

Bhaskar Mitra

introduction to data science

bhavesh lande

Sequential Decision Making in Recommendations

Jaya Kawale

Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...

MLconf

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...

Heiko Paulheim

Recommendation at Netflix Scale

Justin Basilico

La actualidad más candente (20)

Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...

Netflix Recommendations - Beyond the 5 Stars

Multi-Task Learning for NLP

Contextualization at Netflix

Making Netflix Machine Learning Algorithms Reliable

Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...

Recent Trends in Personalization: A Netflix Perspective

Reward Innovation for long-term member satisfaction

Deep Learning for Recommender Systems

Data Science Introduction

Anomaly Detection Using Isolation Forests

Model storming

Personalized Playlists at Spotify

Introduction to Statista

What’s next for deep learning for Search?

introduction to data science

Sequential Decision Making in Recommendations

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...

Recommendation at Netflix Scale

Similar a DiscoRank: optimizing discoverability on SoundCloud

Cassandra and Spark

nickmbailey

The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing. Table of contents: 1. The Big Data triangle 2. Hadoop stack and its limitations 3. Spark: An Overview 3.a. Spark Streaming 3.b. GraphX: Graph processing 3.c. MLib: Machine Learning 4. Performance characteristics of Spark

Apache Spark: The Next Gen toolset for Big Data Processing

prajods

Crisis Informatics is an area of research that investigates how members of the public make use of social media during times of crisis. The amount of social media data generated by a single event is significant: millions of tweets and status updates accompanied by gigabytes of photos and video. To investigate the types of digital behaviors that occur around these events requires a significant investment in designing, developing, and deploying large-scale software infrastructure for both data collection and analysis. Project EPIC at the University of Colorado has been making use of Cassandra since Spring 2012 to provide a solid foundation for Project EPIC's data collection and analysis activities. Project EPIC has collected terabytes of social media data associated with hundreds of disaster events that must be stored, processed, analyzed, and visualized. This talk will cover how Project EPIC makes use of Cassandra and discuss some of the architectural, modeling, and analysis challenges encountered while developing the Project EPIC software infrastructure.

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...

DataStax Academy

«Scrapy internals» Александр Сибиряков, Scrapinghub

it-people

В этом докладе я собираюсь поделиться нашим опытом обхода испанского интернета. Мы поставили перед собой задачу обойти около 600 тысяч веб-сайтов в зоне .es с целью сбора статистики об узлах и их размерах. Я расскажу об архитектуре робота, хранилища, проблемах, с которыми мы столкнулись при обходе, и их решении. Наше решение доступно в форме open source фреймворка Frontera. Фреймворк позволяет построить распределенного робота для скачивания страниц из Интернета в больших объемах в реальном времени. Также он может быть использован для построения сфокусированных роботов для выкачивания подмножества заранее известных веб-сайтов. Фреймворк предлагает: настраиваемое хранилище URL документов (RDBMS или Key Value), управление стратегиями обхода, абстракцию транспортного уровня, абстракцию модуля загрузки. Доклад построен в увлекательной форме: описание проблемы, решение и проблемы, которые возникли в ходе разработки решения.

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Ontico

Processing Large Graphs

Nishant Gandhi

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Chris Fregly

http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/ Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.

Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds

Flink Forward

Presented at the University of San Diego 2014 Digital Initiatives Symposium. Presentations by: Alan Renga, Archivist, San Diego Air and Space Museum Rosa Longacre, Registrar/Archivist, San Diego Museum of Man Kristi Ehrig-Burgess, Library Archives and Digitization Manager, Mingei International Museum Anna Chiaretta Lavatelli, Asst. Director of Digital Media, Balboa Park Online Collaborative www.balboaparkcommons.org is an IMLS Funded project that was made possible by the hard work of Perian Sully, Christina DePaolo, Rich Cherry and Chris Borkowski and the participating partners of Balboa Park Online Collaborative.

Balboa Park Commons: Collaborative Digitization for a Public Resource

Anna Chiaretta Lavatelli

JavaScript History

Rhio Kim

Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3

jasinb

TinkerPop: a story of graphs, DBs, and graph DBs

Joshua Shinavier

WebServices_Grid.ppt

EqinNiftalyev

LiveCoding Package for Pharo

ESUG

Implementing a VO archive for datacubes of galaxies

Jose Enrique Ruiz

You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover: 1) The architecture we designed based on SDACK to support both batch and streaming workload. 2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing. 3) The Cassandra data model designed to support time series data writes and reads.

Using the SDACK Architecture to Build a Big Data Product

Evans Ye

Maa

blalbritton

RDA for Music: Scores

ALATechSource

Playlist Recommendations @ Spotify

Nikhil Tibrewal

Azure storage deep dive

Yves Goeleven

Similar a DiscoRank: optimizing discoverability on SoundCloud (20)

Cassandra and Spark

Apache Spark: The Next Gen toolset for Big Data Processing

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...

«Scrapy internals» Александр Сибиряков, Scrapinghub

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Processing Large Graphs

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds

Balboa Park Commons: Collaborative Digitization for a Public Resource

JavaScript History

Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3

TinkerPop: a story of graphs, DBs, and graph DBs

WebServices_Grid.ppt

LiveCoding Package for Pharo

Implementing a VO archive for datacubes of galaxies

Using the SDACK Architecture to Build a Big Data Product

Maa

RDA for Music: Scores

Playlist Recommendations @ Spotify

Azure storage deep dive

Último

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Real Time Object Detection Using Open CV

Khem

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

Top 10 Most Downloaded Games on Play Store in 2024

SynarionITSolutions

DiscoRank: optimizing discoverability on SoundCloud

1. DiscoRank: Optimizing Discoverability on SoundCloud Amélie Anglade

2. • Developer at SoundCloud • SoundCloud is the world’s largest social sound platform • Academic background in Music Information Retrieval (MIR) • Design, prototype and implement Machine Learning algorithms for music discovery

3. DISCOVERABILITY ?

7. PAGERANK

8. • The web is a graph: • nodes = web pages • edges = hyperlinks • The (Page)rank of a node depends on the link structure of the graph WEB AND PAGERANK

9. RANDOM SURFER

10. RANDOM SURFER A B C D 1/3 1/3 1/3

11. RANDOM SURFER A B C D 1/3 1/3 1/3

12. Nodes visited more often: • Nodes with many links • Coming from frequently visited nodes RANDOM SURFER A B C D E

13. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

14. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

15. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

16. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

17. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

18. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position

19. TELEPORT A B C D E

20. TELEPORT A B C D E

21. TELEPORT A B C D E

22. If N nodes in graph, probability to teleport to any other node (including self) = 1/N TELEPORT A B C D E 1/N 1/N 1/N 1/N 1/N

23. TELEPORT A B C D E 1/N 1/N 1/N 1/N α ? 1-α 1/N At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)

24. Probability distribution of the surfer at any time is a vector. COMPUTING THE PAGERANK That vector converges to a steady state: the PageRank vector.

25. PAGERANK EQUATION

26. SOUNDCLOUD DISCORANK

27.

28. DISCORANK A B C D EUser User Track Playlist favorite follow featured in

29. • Search across People, Sounds, Sets, Groups • One unique rank vector that contains all entities • Weight the links based on the type of event: • User favorites Track • Track is featured in Playlist ... • New big (but sparse) adjacency matrix: UNIVERSAL SEARCH

30.

31. • How do we identify content that is trending? • The more recent a listen, favorite, etc. (event) the higher the weight • Multiply each event (=edge) by a time decay: • New adjacency matrix: BACK TO EXPLORE

32. PERFORMANCE OPTIMIZATION

33. • Millions of entities(=nodes) and events(=edges) • First DiscoRank: several hours of computation • Trimmed down to a few minutes using: • Sparse matrix • Optimized storage of the graph in memory • Versioned copies of the DiscoRank • So technically we could compute the DiscoRank realtime A VERY LARGE GRAPH

34. • • Re-mapping entity ids • Memory optimization so the graph holds in memory: • All edges details are stored in memory in a byte[] • buffer the byte[] into an opaque byte block pool • no object • sort the buffered byte[] in place • On disk and when computing the DiscoRank: • Delta encoded ordered adjacency lists: • One “from” node, several “to” nodes • Delta encode the “to” node ids USING SPARSITY

35. • We keep versioned copies of: • the DiscoRank vector of results • the DiscoRank graph • We rebuild the entire DiscoRank graph from scratch once a week • In between: • we create additional graph segments with new entities and events • and use as prior for the DiscoRank computation the results of the previous DiscoRank run • Side effect: • Also allows for experimentation VERSIONED DISCORANK

36. • MySQL batch jobs • DiscoRank results stored in HDFS • At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine its Lucene score with its DiscoRank INTEGRATION IN OUR INFRASTRUCTURE

37. Amélie Anglade Sound/Music Information Retrieval Engineer about.me/utstikkar @utstikkar We’re hiring! www.soundcloud.com

DiscoRank: optimizing discoverability on SoundCloud

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a DiscoRank: optimizing discoverability on SoundCloud

Similar a DiscoRank: optimizing discoverability on SoundCloud (20)

Último

Último (20)

DiscoRank: optimizing discoverability on SoundCloud