SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
Scala Data
Pipelines @
Spotify
Neville Li
@sinisa_lyh
Who am I?
‣ SpotifyNYCsince2011
‣ FormerlyYahoo!Search
‣ Musicrecommendations
‣ Datainfrastructure
‣ Scalasince2013
Spotify in numbers
• Started in 2006, 58 markets
• 75M+ active users, 20M+ paying
• 30M+ songs, 20K new per day
• 1.5 billion playlists
• 1 TB logs per day
• 1200+ node Hadoop cluster
• 10K+ Hadoop jobs per day
Music recommendation @ Spotify
• Discover Weekly
• Radio
• RelatedArtists
• Discover Page
Recommendation systems
A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns are used to represent the associative operations…
Grouped[K, +V]::reduce[U >: V](fn: (U, U) U)
Scalding: reduce with fn which must be associative and commutative…
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce function…
Monoid!
enables map side reduce
Actually it’s a semigroup
One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu − I)Y)−1YTCup(u)
vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY
ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY
.map { case (_, (r, op)) =>
(solveKey(r), op * (r.rating * alpha))
}.reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors) // YtCupu
.map { case (_, (r, v)) =>
val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0)
(solveKey(r), v * (Cui * pui))
}.reduceByKey(_ + _)
http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
Success story
• Mid 2013: 100+ Python Luigi M/R jobs, few tests
• 10+ new hires since, most fresh grads
• Few with Java experience, none with Scala
• Now: 300+ Scalding jobs, 400+ tests
• More ad-hoc jobs untracked
• Spark also taking off
First 10 months
……
Activity over time
Guess how many jobs
written by yours truly?
Performance vs. Agility
https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
Let’sdiveinto
something
technical
To join or not to join?
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.join(tgp)
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
Hash join
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.hashJoin(tgp.forceToDisk) // tgp replicated to all mappers
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.reduce(_ ++ _) // map-side reduce!
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.sum // SetMonoid[Set[T]] from Algebird
* sum[U >:V](implicit sg: Semigroup[U])
Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = _ // tgp replicated to all mappers
streams
.map { case (track, user) =>
(user, tgp.get(track).split(",").toSet)
}
.group
.sum
https://github.com/spotify/sparkey
SparkeyManagerwraps DistributedCacheFile
Joins and CoGroups
• Require shuffle and reduce step
• Some ops force everything to reducers

e.g. mapGroup, mapValueStream
• CoGroup more flexible for complex logic
• Scalding flattens a.join(b).join(c)…

into MultiJoin(a, b, c, …)
Distributed cache
• Fasterwith off-heap binary files
• Building cache = more wiring
• Memory mapping may interfere withYARN
• E.g. 64GB nodes with 48GB for containers (no cgroup)
• 12 × 2GB containers each with 2GB JVM heap + mmap cache
• OOM and swap!
• Keep files small (< 1GB) or fallback to joins…
Analyze your jobs
• Concurrent Driven
• Visualize job execution
• Workflow optimization
• Bottlenecks
• Data skew
Notenough
math?
Recommending tracks
• User listened to Rammstein - Du Hast
• Recommend 10 similartracks
• 40 dimension feature vectors fortracks
• Compute cosine similarity between all pairs
• O(n) lookup per userwhere n ≈ 30m
• Trythat with 50m users * 10 seed tracks each
ANNOY - cheat by approximation
• Approximate Nearest Neighbor OhYeah
• Random projections and binarytree search
• Build index on single machine
• Load in mappers via distribute cache
• O(log n) lookup
https://github.com/spotify/annoy
https://github.com/spotify/annoy-java
ANN Benchmark
https://github.com/erikbern/ann-benchmarks
Filtering candidates
• Users don’t like seeing artist/album/tracks they already know
• But may forget what they listened long ago
• 50m * thousands of items each
• Over 5 years of streaming logs
• Need to update daily
• Need to purge old items per user
Options
• Aggregate all logs daily
• Aggregate last x days daily
• CSVof artist/album/track ids
• Bloom filters
Decayed value with cutoff
• Compute new user-item score daily
• Weighted on context, e.g. radio, search, playlist
• score’ = score + previous * 0.99
• half life = log0.99
0.5 = 69 days
• Cut off at top 2000
• Items that users might remember seeing recently
Bloom filters
• Probabilistic data structure
• Encoding set of items with m bits and k hash functions
• No false negative
• Tunable false positive probability
• Size proportional to capacity & FP probability
• Let’s build one per user-{artists,albums,tracks}
• Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
Size versus max items & FP prob
• User-item distribution is uneven
• Assuming same setting for all users
• # items << capacity → wasting space
• # items > capacity → high FP rate
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k
item
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k
item
full
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
item
n=1k n=10k n=100k
fullfull
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k n=100k n=1m
item
fullfullfull
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
80%
n=10k
 
8%
n=100k
 
0.8%
n=1m
 
0.08%
item
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
70%
n=100k
 
7%
n=1m
 
0.7%
item
full
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
100%
n=100k
 
60%
n=1m

Más contenido relacionado

La actualidad más candente

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 
눈으로 듣는 음악 추천 시스템
눈으로 듣는 음악 추천 시스템눈으로 듣는 음악 추천 시스템
눈으로 듣는 음악 추천 시스템if kakao
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Cohort Analysis at Scale
Cohort Analysis at ScaleCohort Analysis at Scale
Cohort Analysis at ScaleBlake Irvine
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 

La actualidad más candente (20)

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
눈으로 듣는 음악 추천 시스템
눈으로 듣는 음악 추천 시스템눈으로 듣는 음악 추천 시스템
눈으로 듣는 음악 추천 시스템
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Cohort Analysis at Scale
Cohort Analysis at ScaleCohort Analysis at Scale
Cohort Analysis at Scale
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 

Destacado

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ SpotifyNikhil Tibrewal
 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pagerori segal
 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey reportJan Dawson
 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015Paul Lamere
 

Destacado (6)

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
Music survey results (2)
Music survey results (2)Music survey results (2)
Music survey results (2)
 
Music & interaction
Music & interactionMusic & interaction
Music & interaction
 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pager
 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey report
 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015
 

Similar a Scala Data Pipelines @ Spotify

London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolscharsbar
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent SearchTed Dunning
 
Vertica architecture
Vertica architectureVertica architecture
Vertica architectureZvika Gutkin
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
Tuning Your Engine
Tuning Your EngineTuning Your Engine
Tuning Your Enginejoelbradbury
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 
Scala in practice - 3 years later
Scala in practice - 3 years laterScala in practice - 3 years later
Scala in practice - 3 years laterpatforna
 
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Thoughtworks
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012Tomas Doran
 

Similar a Scala Data Pipelines @ Spotify (20)

London devops logging
London devops loggingLondon devops logging
London devops logging
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its tools
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
Vertica architecture
Vertica architectureVertica architecture
Vertica architecture
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Tuning Your Engine
Tuning Your EngineTuning Your Engine
Tuning Your Engine
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Scala in practice - 3 years later
Scala in practice - 3 years laterScala in practice - 3 years later
Scala in practice - 3 years later
 
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Graphite
GraphiteGraphite
Graphite
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
 

Más de Neville Li

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
Why functional why scala
Why functional  why scala Why functional  why scala
Why functional why scala Neville Li
 

Más de Neville Li (6)

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Scio
ScioScio
Scio
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Why functional why scala
Why functional  why scala Why functional  why scala
Why functional why scala
 

Último

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Scala Data Pipelines @ Spotify

  • 2. Who am I? ‣ SpotifyNYCsince2011 ‣ FormerlyYahoo!Search ‣ Musicrecommendations ‣ Datainfrastructure ‣ Scalasince2013
  • 3. Spotify in numbers • Started in 2006, 58 markets • 75M+ active users, 20M+ paying • 30M+ songs, 20K new per day • 1.5 billion playlists • 1 TB logs per day • 1200+ node Hadoop cluster • 10K+ Hadoop jobs per day
  • 4. Music recommendation @ Spotify • Discover Weekly • Radio • RelatedArtists • Discover Page
  • 6. A little teaser PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn) Crunch: CombineFns are used to represent the associative operations… Grouped[K, +V]::reduce[U >: V](fn: (U, U) U) Scalding: reduce with fn which must be associative and commutative… PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V) Spark: Merge the values for each key using an associative reduce function…
  • 7. Monoid! enables map side reduce Actually it’s a semigroup
  • 8. One more teaser Linear equation inAlternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu − I)Y)−1YTCup(u) vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _) ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _) http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
  • 9. Success story • Mid 2013: 100+ Python Luigi M/R jobs, few tests • 10+ new hires since, most fresh grads • Few with Java experience, none with Scala • Now: 300+ Scalding jobs, 400+ tests • More ad-hoc jobs untracked • Spark also taking off
  • 12. Guess how many jobs written by yours truly?
  • 15. To join or not to join? val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 16. Hash join val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 17. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .reduce(_ ++ _) // map-side reduce!
  • 18. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .sum // SetMonoid[Set[T]] from Algebird * sum[U >:V](implicit sg: Semigroup[U])
  • 19. Key-value file as distributed cache val streams: TypedPipe[(String, String)] = _ // (gid, user) val tgp: SparkeyManager = _ // tgp replicated to all mappers streams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum https://github.com/spotify/sparkey SparkeyManagerwraps DistributedCacheFile
  • 20. Joins and CoGroups • Require shuffle and reduce step • Some ops force everything to reducers
 e.g. mapGroup, mapValueStream • CoGroup more flexible for complex logic • Scalding flattens a.join(b).join(c)…
 into MultiJoin(a, b, c, …)
  • 21. Distributed cache • Fasterwith off-heap binary files • Building cache = more wiring • Memory mapping may interfere withYARN • E.g. 64GB nodes with 48GB for containers (no cgroup) • 12 × 2GB containers each with 2GB JVM heap + mmap cache • OOM and swap! • Keep files small (< 1GB) or fallback to joins…
  • 22. Analyze your jobs • Concurrent Driven • Visualize job execution • Workflow optimization • Bottlenecks • Data skew
  • 24. Recommending tracks • User listened to Rammstein - Du Hast • Recommend 10 similartracks • 40 dimension feature vectors fortracks • Compute cosine similarity between all pairs • O(n) lookup per userwhere n ≈ 30m • Trythat with 50m users * 10 seed tracks each
  • 25. ANNOY - cheat by approximation • Approximate Nearest Neighbor OhYeah • Random projections and binarytree search • Build index on single machine • Load in mappers via distribute cache • O(log n) lookup https://github.com/spotify/annoy https://github.com/spotify/annoy-java
  • 27. Filtering candidates • Users don’t like seeing artist/album/tracks they already know • But may forget what they listened long ago • 50m * thousands of items each • Over 5 years of streaming logs • Need to update daily • Need to purge old items per user
  • 28. Options • Aggregate all logs daily • Aggregate last x days daily • CSVof artist/album/track ids • Bloom filters
  • 29. Decayed value with cutoff • Compute new user-item score daily • Weighted on context, e.g. radio, search, playlist • score’ = score + previous * 0.99 • half life = log0.99 0.5 = 69 days • Cut off at top 2000 • Items that users might remember seeing recently
  • 30. Bloom filters • Probabilistic data structure • Encoding set of items with m bits and k hash functions • No false negative • Tunable false positive probability • Size proportional to capacity & FP probability • Let’s build one per user-{artists,albums,tracks} • Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
  • 31. Size versus max items & FP prob • User-item distribution is uneven • Assuming same setting for all users • # items << capacity → wasting space • # items > capacity → high FP rate
  • 32. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead
  • 33. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k item
  • 34. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k item full
  • 35. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead item n=1k n=10k n=100k fullfull
  • 36. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k n=100k n=1m item fullfullfull
  • 37. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup
  • 38. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 43. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 48. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 53. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 60. Track metadata • Label dump → content ingestion • Third partytrack genres, e.g. GraceNote • Audio attributes, e.g. tempo, key, time signature • Cultural data, e.g. popularity, tags • Latent vectors from collaborative filtering • Many sources for album, artist, user metadata too
  • 61. Multiple data sources • Big joins • Complex dependencies • Wide rows with few columns accessed • Wasting I/O
  • 62. Apache Parquet • Pre-join sources into mega-datasets • Store as Parquet columnar storage • Column projection • Predicate pushdown • Avro within Scalding pipelines
  • 63. Projection pipe.map(a = (a.getName, a.getAmount)) versus Parquet.project[Account](name, amount) • Strings → unsafe and error prone • No IDE auto-completion → finger injury • my_fancy_field_name → .getMyFancyFieldName • Hard to migrate existing code
  • 64. Predicate pipe.filter(a = a.getName == Neville a.getAmount 100) versus FilterApi.and( FilterApi.eq(FilterApi.binaryColumn(name), Binary.fromString(Neville)), FilterApi.gt(FilterApi.floatColumn(amount), 100f.asInstnacesOf[java.lang.Float]))
  • 65. Macro to the rescue Code →AST→ (pattern matching) → (recursion) → (quasi-quotes) → Code Projection[Account](_.getName, _.getAmount) Predicate[Account](x = x.getName == “Neville x.getAmount 100) https://github.com/nevillelyh/parquet-avro-extra http://www.lyh.me/slides/macros.html
  • 66. What else? ‣ Analytics ‣ Adstargeting,prediction ‣ Metadataquality ‣ Zeppelin ‣ Morecoolstuffintheworks