SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Analyze radio stations broadcasts
with Apache Spark SQL,
Spotify, and Databricks
Spark User Group Paris - May 2017
Galenki, Russia
1. Spark SQL
a. Dataset API
b. Parquet
c. Databricks
2. Data extraction
3. Data exploration
Paul Leclercq
@polomarcus
Ad tech for 3 years at :
Data Engineer
● Spark : Streaming, SQL, MLLib
● Scala
● Kafka
● NoSQL
Looking for his dream job in Data
in music/sport/cool stuffs industry:)
3
4
Data people
Engineer: store, index high volume of raw data, implement machine learning algo
Hadoop, Amazon S3, Kafka, RabbitMQ, Spark, Flink, Beam, Drill, Druid, NoSQL DB : Cassandra, Redis,
Aerospike
Scientist: PhD, Mathematics degrees : build machine learning algorithms that can
predict business actions
Machine learning/Statistics tools: Scikit-learn, MLLib
Business Analyst: use the data provided for business purposes
Tools with UI: Excel, Chart.io, Talend, Superset, Pivot
5
Why I love Spark
“Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.”
Scalable : ops and code
Batch, Streaming, ML unified distributed engine to process data
6
Spark Usage 2016 Survey
“Apache Spark's module for working with structured data”
● Access a variety of data sources : Hive, JSON, Avro, Parquet, ORC, JSON, JDBC.
● Plug Tableau, Chart.io, Power BI, Excel… thanks to JDBC or ODBC driver
● ~ ANSI SQL:2003
● Dataframe / Dataset
○ Since Spark 2.0, the primary Machine Learning API
○ Also used in Structured Streaming (still ALPHA in Spark 2.1)
Spark SQL
8
Spark SQL - RDD and Dataset (and Dataframe)
RDD = strong typing, lambda functions, DAG
Dataset = RDD (= built on top of RDDs) + Optimized execution engine + in-memory
columnar storage + convenient get data by column name : ds.map(_.myColumn)
Dataframe = Dataset[Row]
From “High performance Spark” by Holden Karau
Databricks’ blog 9
Spark SQL - RDD and Dataset
Plain SQL Query or Dataset API
spark.sql("""
SELECT title, artist
FROM datasetTable
"""
)
dataset.select($"title",$"artist")
10
Spark SQL - Catalyst Queries Optimizer
● General tree transformation framework : Scala’s abstract syntax tree (AST)
● Let the optimizer do the hard work : optimizations happen as late as possible
● Read less data as possible : partition, columnar format, statistic metadata (min, max,
dictionary), pushing predicate into storage system (Postgres specific query)
Protip: spark.sql(SQL_QUERY).explain(extended = true) or Spark UI SQL page
11
Spark SQL - Catalyst Queries Optimizer
● No languages jealous : All different Spark’s Dataset APIs have all the same
performance
12
● Columnar storage
● Optimized I/O
○ Column pruning
○ Predicate pushdown (Stats filter : size, max, min, dictionary)
● Popular and interoperable, supported by many other data processing systems
● Supports schema evolution, nullable=true
● Simple use with Spark
○ df.write.format("parquet").save("nrjnovavirginskyrock.parquet")
○ spark.read.parquet("nrjnovavirginskyrock.parquet")
○ df.write.partitionBy("radio").parquet("radioPartitionedByRadio.parquet")
Storage :
13
Protips:
● For your test jobs:
○ df.write.mode(SaveMode.Overwrite).save("test.parquet")
○ Otherwise they can fail because file already exists
● Learn from the best
○ Parquet’s Julien le Dem How to use Parquet
○ Netflix’s Ryan Bleu : Parquet performance tuning: the missing guide
14
What’s awesome about it?
● Collaboration via notebooks
● Free community edition with a 6Go RAM server, ready to go : https://community.cloud.databricks.com/
● Awesome and simple data viz
And also:
● Mixing Languages in a Notebook, including Markdown see demo later
● Cost management (AWS Spot instances)
● Rest API, Jobs, Security...
What about a open source solution?
● notebooks : Apache Zeppelin
● Managed Spark clusters on AWS or GCP
15
16
Getting the radio stations data - Scala scraper
From “what was this title?” HTML pages or REST API:
● http://www.nrj.fr/chansons-diffusees?__postedForm=broadcastedhitdate&date=1970/01/01 00:00
● http://www.novaplanet.com/radionova/cetaitquoicetitre/$timestamp
● https://www.virginradio.fr/cetait-quoi-ce-titre?date=1970-01-01&hour=00&minute=00
● http://skyrock.fm/api/v3/sound?search_date=1970-01-01&search_hour=00:00
Good real life experience of extracting data :
● Slow or fast servers
● Different semantic: Artist 1 & or AND or / Artist
● Different format : HTML page / JSON 17
Data from the radio stations
case class Song(timestamp:Int, humanDate:Long, year:Int, month:Int, day:Int,
hour:Int, minute: Int, artist:String, allArtists: String, title:String, radio:String)
val dataset = spark.read.myformat("myfile").as[Song]
dataset.show() or display(dataset) on Databricks:
18
Data from the radio stations
dataset.show()
dataset.show(numberOfRows, truncate = false)
19
https://developer.spotify.com/web-api/console/
● Audio features of a track : danceability, positiveness, energy
● Artist : music genre
● Search a track
Positiveness/Valence: September — Earth Wind & Fire, Ska-Boo-Da-Ba — The Skatalites or Hey Ya! — OutKast
Danceability: Trick Me — Kelis, Around the world — Daft Punk or Anaconda — Nicki Minaj
Energy : We Are Your Friends - JUSTICE, Steppin’ stone - Davy Jones, Jerk It Out — Caesars
20
Number of songs * (Artist + track + audiofeatures) = 24K requests
→ Avoid surprises : Always think how large your data is before performing an action
● Destination server’s disk big enough? Powerful enough?
● 3rd party rate limit ? Will others applications would need this service too ?
● Network Cost ? 21
Data from
dataframe.show() / display(dataframe) on Databricks
Why dataframe and not data? → dataframe.printSchema
22
root
|-- tracks: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- album: struct (nullable = true)
| | | |-- album_type: string (nullable = true)
| | | |-- artists: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- external_urls: struct (nullable = true)
| | | | | | |-- spotify: string (nullable = true)
| | | | | |-- href: string (nullable = true)
| | | | | |-- id: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- uri: string (nullable = true)
| | | |-- available_markets: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- external_urls: struct (nullable = true)
| | | | |-- spotify: string (nullable = true)
| | | |-- href: string (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- images: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- height: long (nullable = true)
| | | | | |-- url: string (nullable = true)
| | | | | |-- width: long (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uri: string (nullable = true)
dataframe.printSchema
23
Not really big data… and I am ok with that!
+300K rows of broadcasts of 8K different songs
● Nova : 95K broadcasts of 5000 different songs
● NRJ : 50K broadcasts of 800 different songs
● Virgin: 60K broacasts of 1200 different songs
● Skyrock: 100K broadcasts of 1000 different songs
Protips: dataset.sample(withReplacement, percentage)
24
How many songs by day ?
SELECT COUNT(*) as number_songs_broadcasted, DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd') AS
date, radio
FROM nrjnova
GROUP BY DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd'), radio
ORDER BY date
Dataframe API
nrjnova.select(date_format($"timestamp".cast("timestamp"),"Y-MM-dd").alias("date"), $"radio")
.orderBy($"timestamp".asc)
.groupBy($"radio", $"ts")
.count()
25
How many songs by day ?
26
How many different songs by month?
27
Radio brainwashing ?
Same song by day
28
Music genres by radio
Genre info by artist only → ["alternative dance","chamber pop","dance-punk","electronic","garage
rock","indie pop","indie r&b","indie rock","indietronica","new rave","synthpop"]
import org.apache.spark.sql.functions.explode
val genres = TrackArtistAudioFeature.select($"name", explode($"genres"),
$"tracks.name",$"radio").toDF("artist", "genres","title","radio")
genres.createOrReplaceTempView("genres")
genres.cache()
29
Music genres by radio
SELECT COUNT(DISTINCT genres) AS number_of_genres, radio
FROM genres
GROUP BY radio
ORDER BY number_of_genres DESC
30
Music genres by radio
31
32
Is Skyrock really “first on rap” ?
SELECT COUNT(genres) AS number_of_hip_hop_songs, genres, radio
FROM genres
WHERE genres LIKE '%rap%' OR genres LIKE '%hip%' OR genres LIKE '%hop%'
GROUP BY genres, radio
HAVING COUNT(genres) > 50
ORDER BY number_of_hip_hop_songs DESC
33
Is Skyrock really “first on rap” ?
34
Songs duration distribution
SELECT ROUND( (COUNT(t.*) / subTotal.total_radio * 100),2) AS percentage_of_songs, subTotal.total_radio,
FLOOR((duration_ms / 1000 ) / 60) AS minute, ROUND( (((duration_ms / 1000 ) % 60)) / 10) * 10 AS second,
t.radio
FROM AudioFeatureArtistTrackRadios t
JOIN (
SELECT count(*) AS total_radio, radio
FROM AudioFeatureArtistTrackRadios
GROUP BY radio
) AS subTotal
ON subTotal.radio = t.radio
GROUP BY 1, 2, 3, 4
ORDER BY minute, second
35
Songs duration distribution
36
Percentage of music by day
SELECT AVG(number_songs_broadcasted) * 3.3 / (24 * 60) * 100 AS percent_of_music,
radio
FROM (
SELECT COUNT(*) AS number_songs_broadcasted, DATE_FORMAT(CAST(timestamp AS
timestamp),'Y-MM-dd') AS date, radio
FROM nrjnova
GROUP BY DATE_FORMAT(CAST(timestamp AS timestamp),'Y-MM-dd'), radio
HAVING COUNT(*) > 0 -- avoid radio stations’ system bug
ORDER BY date
)
GROUP BY radio
37
average song duration in
minutes
total minutes by day
Spark SQL - Percentage of music by day
38
What’s an average monday ?
SELECT ROUND(AVG(number_of_tracks)) AS number_of_tracks, radio, hour
FROM (
SELECT COUNT(*) AS number_of_tracks, weekofyear( CAST(timestamp as timestamp)) AS
week_number, CAST(DATE_FORMAT(CAST(timestamp as timestamp),'k') AS int) AS hour, radio
FROM nrjnova
WHERE DATE_FORMAT(CAST(timestamp as timestamp),'EEEE') = "Monday"
GROUP BY weekofyear( CAST(timestamp as timestamp)), DATE_FORMAT(CAST(timestamp as
timestamp),'k'), radio
HAVING COUNT(*) > 0 -- avoid radio stations’ system bug
)
GROUP BY hour, radio
ORDER BY hour
39
What’s an average monday ?
40
How many minutes of advertising?
41
Windowing query example - Most broadcasted
songsSELECT COUNT(*), n.title, n.artist, n.radio, rank, month, year
FROM (
SELECT title, artist, radio,number_of_broadcast, dense_rank() OVER (PARTITION BY radio ORDER BY
number_of_broadcast DESC) AS rank
FROM (
SELECT COUNT(*) AS number_of_broadcast, title, artist, radio
FROM nrjnova
GROUP BY title, artist, radio
) tmp
) top10
JOIN nrjnova n
ON top10.title = n.title AND top10.artist = n.artist AND top10.radio = n.radio
WHERE rank <= 2
GROUP BY n.title, n.artist, n.radio, rank, month, year
ORDER BY month
42
Windowing query example - Most broadcasted
songs
43
Similarities between radio stations with unidirectional inequality
SELECT COUNT(DISTINCT n1.artist, n1.title) AS number_of_similar_songs, CONCAT(n1.radio, "-",
n2.radio) AS radios, n1.radio AS radio_1, ROUND(COUNT(DISTINCT n1.artist, n1.title) /
number_of_song_radio_1 * 100) AS percent_radio_1, number_of_song_radio_1, n2.radio as radio_2,
ROUND(COUNT(DISTINCT n1.artist, n1.title) / number_of_song_radio_2 * 100) as percent_radio_2,
number_of_song_radio_2
FROM nrjnova n1
JOIN nrjnova n2
ON n1.radio < n2.radio AND LOWER(n1.artist)=LOWER(n2.artist) AND LOWER(n1.title)=LOWER(n2.title)
GROUP BY n1.radio, n2.radio, number_of_song_radio_1, number_of_song_radio_2
ORDER BY number_of_similar_songs DESC
44
Similarities between radio stations with unidirectional inequality
JOIN radio n2 ON n1.radio = n2.radio →
● (nova, virgin)
● (virgin, nova)
JOIN radio n2 ON n1.radio < n2.radio
● (nova, virgin)
45
Similarities between radio stations with unidirectional inequality
46
Common songs between our 4 radios ?
4 joins ??? → Nope
47
Common songs between our 4 radios ?
SELECT LOWER(title) as Title, LOWER(artist) as Artist, COUNT(DISTINCT (radio))
FROM nrjnova
GROUP BY LOWER(title), LOWER(artist)
HAVING COUNT(DISTINCT (radio)) = ( -- 4, because we have 4 different radios
SELECT MAX (count)
FROM (
SELECT COUNT(DISTINCT (radio)) as count, LOWER(title), LOWER(artist)
FROM nrjnova
GROUP BY LOWER(title), LOWER(artist)
HAVING COUNT(DISTINCT (radio))
)
) 48
Common songs between radios ?
Prince — Kiss
C2C — Happy
Stromae — Formidable
49
Spark SQL - Case statement
SELECT CASE artist
WHEN "Drake"
THEN "New drake name"
ELSE artist END AS artist,
title, radio
FROM nrjnova
50
Resources
Demo’s Notebook available here
“Terra Data” exposition at Cité des sciences, Paris
EPFL Spark Intro from Heather Miller
Deep Dive into Spark SQL’s Catalyst Optimizer
Mastering Apache Spark 2 by Jacek Laskowski
Unsplash: copyrightless-HD-picture platform
51
Bonus - Spotify Playlists
~200 most broadcasted songs in 2016 for each radio :
● “Radio Nova Top 2016” with Calipso Rose, Kaytranada, The Roots, M.I.A...
● “Skyrock Top 2016” with Drake, Major Lazer, Timberlake, Soprano, PNL, Jul…
● “Virgin Top 2016” with Imany, Twenty One Pilots, Sia, Kungs, Julian Perretta…
● “NRJ top 2016” with Enrique Iglesias, Soprano, Coldplay, Kungs, Amir, MHD, Tal
52

Más contenido relacionado

La actualidad más candente

First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmInfoFarm
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkRKien Dang
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...guest5b1607
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceDuyhai Doan
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
Get started with Lua programming
Get started with Lua programmingGet started with Lua programming
Get started with Lua programmingEtiene Dalcol
 

La actualidad más candente (15)

First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
Get started with Lua programming
Get started with Lua programmingGet started with Lua programming
Get started with Lua programming
 

Similar a Analyze one year of radio station songs aired with Spark SQL, Spotify, and Databricks

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageNeo4j
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Mapping, Interlinking and Exposing MusicBrainz as Linked Data
Mapping, Interlinking and Exposing MusicBrainz as Linked DataMapping, Interlinking and Exposing MusicBrainz as Linked Data
Mapping, Interlinking and Exposing MusicBrainz as Linked DataPeter Haase
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Cypher and apache spark multiple graphs and more in open cypher
Cypher and apache spark  multiple graphs and more in  open cypherCypher and apache spark  multiple graphs and more in  open cypher
Cypher and apache spark multiple graphs and more in open cypherNeo4j
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 

Similar a Analyze one year of radio station songs aired with Spark SQL, Spotify, and Databricks (20)

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Mapping, Interlinking and Exposing MusicBrainz as Linked Data
Mapping, Interlinking and Exposing MusicBrainz as Linked DataMapping, Interlinking and Exposing MusicBrainz as Linked Data
Mapping, Interlinking and Exposing MusicBrainz as Linked Data
 
Presentation
PresentationPresentation
Presentation
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Cypher and apache spark multiple graphs and more in open cypher
Cypher and apache spark  multiple graphs and more in  open cypherCypher and apache spark  multiple graphs and more in  open cypher
Cypher and apache spark multiple graphs and more in open cypher
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 

Último

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Último (20)

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 

Analyze one year of radio station songs aired with Spark SQL, Spotify, and Databricks

  • 1. Analyze radio stations broadcasts with Apache Spark SQL, Spotify, and Databricks Spark User Group Paris - May 2017 Galenki, Russia
  • 2. 1. Spark SQL a. Dataset API b. Parquet c. Databricks 2. Data extraction 3. Data exploration
  • 3. Paul Leclercq @polomarcus Ad tech for 3 years at : Data Engineer ● Spark : Streaming, SQL, MLLib ● Scala ● Kafka ● NoSQL Looking for his dream job in Data in music/sport/cool stuffs industry:) 3
  • 4. 4
  • 5. Data people Engineer: store, index high volume of raw data, implement machine learning algo Hadoop, Amazon S3, Kafka, RabbitMQ, Spark, Flink, Beam, Drill, Druid, NoSQL DB : Cassandra, Redis, Aerospike Scientist: PhD, Mathematics degrees : build machine learning algorithms that can predict business actions Machine learning/Statistics tools: Scikit-learn, MLLib Business Analyst: use the data provided for business purposes Tools with UI: Excel, Chart.io, Talend, Superset, Pivot 5
  • 6. Why I love Spark “Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.” Scalable : ops and code Batch, Streaming, ML unified distributed engine to process data 6
  • 8. “Apache Spark's module for working with structured data” ● Access a variety of data sources : Hive, JSON, Avro, Parquet, ORC, JSON, JDBC. ● Plug Tableau, Chart.io, Power BI, Excel… thanks to JDBC or ODBC driver ● ~ ANSI SQL:2003 ● Dataframe / Dataset ○ Since Spark 2.0, the primary Machine Learning API ○ Also used in Structured Streaming (still ALPHA in Spark 2.1) Spark SQL 8
  • 9. Spark SQL - RDD and Dataset (and Dataframe) RDD = strong typing, lambda functions, DAG Dataset = RDD (= built on top of RDDs) + Optimized execution engine + in-memory columnar storage + convenient get data by column name : ds.map(_.myColumn) Dataframe = Dataset[Row] From “High performance Spark” by Holden Karau Databricks’ blog 9
  • 10. Spark SQL - RDD and Dataset Plain SQL Query or Dataset API spark.sql(""" SELECT title, artist FROM datasetTable """ ) dataset.select($"title",$"artist") 10
  • 11. Spark SQL - Catalyst Queries Optimizer ● General tree transformation framework : Scala’s abstract syntax tree (AST) ● Let the optimizer do the hard work : optimizations happen as late as possible ● Read less data as possible : partition, columnar format, statistic metadata (min, max, dictionary), pushing predicate into storage system (Postgres specific query) Protip: spark.sql(SQL_QUERY).explain(extended = true) or Spark UI SQL page 11
  • 12. Spark SQL - Catalyst Queries Optimizer ● No languages jealous : All different Spark’s Dataset APIs have all the same performance 12
  • 13. ● Columnar storage ● Optimized I/O ○ Column pruning ○ Predicate pushdown (Stats filter : size, max, min, dictionary) ● Popular and interoperable, supported by many other data processing systems ● Supports schema evolution, nullable=true ● Simple use with Spark ○ df.write.format("parquet").save("nrjnovavirginskyrock.parquet") ○ spark.read.parquet("nrjnovavirginskyrock.parquet") ○ df.write.partitionBy("radio").parquet("radioPartitionedByRadio.parquet") Storage : 13
  • 14. Protips: ● For your test jobs: ○ df.write.mode(SaveMode.Overwrite).save("test.parquet") ○ Otherwise they can fail because file already exists ● Learn from the best ○ Parquet’s Julien le Dem How to use Parquet ○ Netflix’s Ryan Bleu : Parquet performance tuning: the missing guide 14
  • 15. What’s awesome about it? ● Collaboration via notebooks ● Free community edition with a 6Go RAM server, ready to go : https://community.cloud.databricks.com/ ● Awesome and simple data viz And also: ● Mixing Languages in a Notebook, including Markdown see demo later ● Cost management (AWS Spot instances) ● Rest API, Jobs, Security... What about a open source solution? ● notebooks : Apache Zeppelin ● Managed Spark clusters on AWS or GCP 15
  • 16. 16
  • 17. Getting the radio stations data - Scala scraper From “what was this title?” HTML pages or REST API: ● http://www.nrj.fr/chansons-diffusees?__postedForm=broadcastedhitdate&date=1970/01/01 00:00 ● http://www.novaplanet.com/radionova/cetaitquoicetitre/$timestamp ● https://www.virginradio.fr/cetait-quoi-ce-titre?date=1970-01-01&hour=00&minute=00 ● http://skyrock.fm/api/v3/sound?search_date=1970-01-01&search_hour=00:00 Good real life experience of extracting data : ● Slow or fast servers ● Different semantic: Artist 1 & or AND or / Artist ● Different format : HTML page / JSON 17
  • 18. Data from the radio stations case class Song(timestamp:Int, humanDate:Long, year:Int, month:Int, day:Int, hour:Int, minute: Int, artist:String, allArtists: String, title:String, radio:String) val dataset = spark.read.myformat("myfile").as[Song] dataset.show() or display(dataset) on Databricks: 18
  • 19. Data from the radio stations dataset.show() dataset.show(numberOfRows, truncate = false) 19
  • 20. https://developer.spotify.com/web-api/console/ ● Audio features of a track : danceability, positiveness, energy ● Artist : music genre ● Search a track Positiveness/Valence: September — Earth Wind & Fire, Ska-Boo-Da-Ba — The Skatalites or Hey Ya! — OutKast Danceability: Trick Me — Kelis, Around the world — Daft Punk or Anaconda — Nicki Minaj Energy : We Are Your Friends - JUSTICE, Steppin’ stone - Davy Jones, Jerk It Out — Caesars 20
  • 21. Number of songs * (Artist + track + audiofeatures) = 24K requests → Avoid surprises : Always think how large your data is before performing an action ● Destination server’s disk big enough? Powerful enough? ● 3rd party rate limit ? Will others applications would need this service too ? ● Network Cost ? 21
  • 22. Data from dataframe.show() / display(dataframe) on Databricks Why dataframe and not data? → dataframe.printSchema 22
  • 23. root |-- tracks: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- album: struct (nullable = true) | | | |-- album_type: string (nullable = true) | | | |-- artists: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- external_urls: struct (nullable = true) | | | | | | |-- spotify: string (nullable = true) | | | | | |-- href: string (nullable = true) | | | | | |-- id: string (nullable = true) | | | | | |-- name: string (nullable = true) | | | | | |-- type: string (nullable = true) | | | | | |-- uri: string (nullable = true) | | | |-- available_markets: array (nullable = true) | | | | |-- element: string (containsNull = true) | | | |-- external_urls: struct (nullable = true) | | | | |-- spotify: string (nullable = true) | | | |-- href: string (nullable = true) | | | |-- id: string (nullable = true) | | | |-- images: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- height: long (nullable = true) | | | | | |-- url: string (nullable = true) | | | | | |-- width: long (nullable = true) | | | |-- name: string (nullable = true) | | | |-- type: string (nullable = true) | | | |-- uri: string (nullable = true) dataframe.printSchema 23
  • 24. Not really big data… and I am ok with that! +300K rows of broadcasts of 8K different songs ● Nova : 95K broadcasts of 5000 different songs ● NRJ : 50K broadcasts of 800 different songs ● Virgin: 60K broacasts of 1200 different songs ● Skyrock: 100K broadcasts of 1000 different songs Protips: dataset.sample(withReplacement, percentage) 24
  • 25. How many songs by day ? SELECT COUNT(*) as number_songs_broadcasted, DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd') AS date, radio FROM nrjnova GROUP BY DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd'), radio ORDER BY date Dataframe API nrjnova.select(date_format($"timestamp".cast("timestamp"),"Y-MM-dd").alias("date"), $"radio") .orderBy($"timestamp".asc) .groupBy($"radio", $"ts") .count() 25
  • 26. How many songs by day ? 26
  • 27. How many different songs by month? 27
  • 28. Radio brainwashing ? Same song by day 28
  • 29. Music genres by radio Genre info by artist only → ["alternative dance","chamber pop","dance-punk","electronic","garage rock","indie pop","indie r&b","indie rock","indietronica","new rave","synthpop"] import org.apache.spark.sql.functions.explode val genres = TrackArtistAudioFeature.select($"name", explode($"genres"), $"tracks.name",$"radio").toDF("artist", "genres","title","radio") genres.createOrReplaceTempView("genres") genres.cache() 29
  • 30. Music genres by radio SELECT COUNT(DISTINCT genres) AS number_of_genres, radio FROM genres GROUP BY radio ORDER BY number_of_genres DESC 30
  • 31. Music genres by radio 31
  • 32. 32
  • 33. Is Skyrock really “first on rap” ? SELECT COUNT(genres) AS number_of_hip_hop_songs, genres, radio FROM genres WHERE genres LIKE '%rap%' OR genres LIKE '%hip%' OR genres LIKE '%hop%' GROUP BY genres, radio HAVING COUNT(genres) > 50 ORDER BY number_of_hip_hop_songs DESC 33
  • 34. Is Skyrock really “first on rap” ? 34
  • 35. Songs duration distribution SELECT ROUND( (COUNT(t.*) / subTotal.total_radio * 100),2) AS percentage_of_songs, subTotal.total_radio, FLOOR((duration_ms / 1000 ) / 60) AS minute, ROUND( (((duration_ms / 1000 ) % 60)) / 10) * 10 AS second, t.radio FROM AudioFeatureArtistTrackRadios t JOIN ( SELECT count(*) AS total_radio, radio FROM AudioFeatureArtistTrackRadios GROUP BY radio ) AS subTotal ON subTotal.radio = t.radio GROUP BY 1, 2, 3, 4 ORDER BY minute, second 35
  • 37. Percentage of music by day SELECT AVG(number_songs_broadcasted) * 3.3 / (24 * 60) * 100 AS percent_of_music, radio FROM ( SELECT COUNT(*) AS number_songs_broadcasted, DATE_FORMAT(CAST(timestamp AS timestamp),'Y-MM-dd') AS date, radio FROM nrjnova GROUP BY DATE_FORMAT(CAST(timestamp AS timestamp),'Y-MM-dd'), radio HAVING COUNT(*) > 0 -- avoid radio stations’ system bug ORDER BY date ) GROUP BY radio 37 average song duration in minutes total minutes by day
  • 38. Spark SQL - Percentage of music by day 38
  • 39. What’s an average monday ? SELECT ROUND(AVG(number_of_tracks)) AS number_of_tracks, radio, hour FROM ( SELECT COUNT(*) AS number_of_tracks, weekofyear( CAST(timestamp as timestamp)) AS week_number, CAST(DATE_FORMAT(CAST(timestamp as timestamp),'k') AS int) AS hour, radio FROM nrjnova WHERE DATE_FORMAT(CAST(timestamp as timestamp),'EEEE') = "Monday" GROUP BY weekofyear( CAST(timestamp as timestamp)), DATE_FORMAT(CAST(timestamp as timestamp),'k'), radio HAVING COUNT(*) > 0 -- avoid radio stations’ system bug ) GROUP BY hour, radio ORDER BY hour 39
  • 40. What’s an average monday ? 40
  • 41. How many minutes of advertising? 41
  • 42. Windowing query example - Most broadcasted songsSELECT COUNT(*), n.title, n.artist, n.radio, rank, month, year FROM ( SELECT title, artist, radio,number_of_broadcast, dense_rank() OVER (PARTITION BY radio ORDER BY number_of_broadcast DESC) AS rank FROM ( SELECT COUNT(*) AS number_of_broadcast, title, artist, radio FROM nrjnova GROUP BY title, artist, radio ) tmp ) top10 JOIN nrjnova n ON top10.title = n.title AND top10.artist = n.artist AND top10.radio = n.radio WHERE rank <= 2 GROUP BY n.title, n.artist, n.radio, rank, month, year ORDER BY month 42
  • 43. Windowing query example - Most broadcasted songs 43
  • 44. Similarities between radio stations with unidirectional inequality SELECT COUNT(DISTINCT n1.artist, n1.title) AS number_of_similar_songs, CONCAT(n1.radio, "-", n2.radio) AS radios, n1.radio AS radio_1, ROUND(COUNT(DISTINCT n1.artist, n1.title) / number_of_song_radio_1 * 100) AS percent_radio_1, number_of_song_radio_1, n2.radio as radio_2, ROUND(COUNT(DISTINCT n1.artist, n1.title) / number_of_song_radio_2 * 100) as percent_radio_2, number_of_song_radio_2 FROM nrjnova n1 JOIN nrjnova n2 ON n1.radio < n2.radio AND LOWER(n1.artist)=LOWER(n2.artist) AND LOWER(n1.title)=LOWER(n2.title) GROUP BY n1.radio, n2.radio, number_of_song_radio_1, number_of_song_radio_2 ORDER BY number_of_similar_songs DESC 44
  • 45. Similarities between radio stations with unidirectional inequality JOIN radio n2 ON n1.radio = n2.radio → ● (nova, virgin) ● (virgin, nova) JOIN radio n2 ON n1.radio < n2.radio ● (nova, virgin) 45
  • 46. Similarities between radio stations with unidirectional inequality 46
  • 47. Common songs between our 4 radios ? 4 joins ??? → Nope 47
  • 48. Common songs between our 4 radios ? SELECT LOWER(title) as Title, LOWER(artist) as Artist, COUNT(DISTINCT (radio)) FROM nrjnova GROUP BY LOWER(title), LOWER(artist) HAVING COUNT(DISTINCT (radio)) = ( -- 4, because we have 4 different radios SELECT MAX (count) FROM ( SELECT COUNT(DISTINCT (radio)) as count, LOWER(title), LOWER(artist) FROM nrjnova GROUP BY LOWER(title), LOWER(artist) HAVING COUNT(DISTINCT (radio)) ) ) 48
  • 49. Common songs between radios ? Prince — Kiss C2C — Happy Stromae — Formidable 49
  • 50. Spark SQL - Case statement SELECT CASE artist WHEN "Drake" THEN "New drake name" ELSE artist END AS artist, title, radio FROM nrjnova 50
  • 51. Resources Demo’s Notebook available here “Terra Data” exposition at Cité des sciences, Paris EPFL Spark Intro from Heather Miller Deep Dive into Spark SQL’s Catalyst Optimizer Mastering Apache Spark 2 by Jacek Laskowski Unsplash: copyrightless-HD-picture platform 51
  • 52. Bonus - Spotify Playlists ~200 most broadcasted songs in 2016 for each radio : ● “Radio Nova Top 2016” with Calipso Rose, Kaytranada, The Roots, M.I.A... ● “Skyrock Top 2016” with Drake, Major Lazer, Timberlake, Soprano, PNL, Jul… ● “Virgin Top 2016” with Imany, Twenty One Pilots, Sia, Kungs, Julian Perretta… ● “NRJ top 2016” with Enrique Iglesias, Soprano, Coldplay, Kungs, Amir, MHD, Tal 52