Paris Spark Meetup - May 2017
Video : https://www.youtube.com/watch?v=w5Zd-1wIJrU
AdHoc analysis of radio stations broadcasts stored in a parquet files with plain SQL, the dataframe API.
The aim was to notice radio stations habits, differences and if radio stations brainwashing is a thing
This talk's Databricks notebook can be found here : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6937750999095841/3645330882010081/6197123402747553/latest.html
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Databricks
1. Analyze radio stations broadcasts
with Apache Spark SQL,
Spotify, and Databricks
Spark User Group Paris - May 2017
Galenki, Russia
2. 1. Spark SQL
a. Dataset API
b. Parquet
c. Databricks
2. Data extraction
3. Data exploration
3. Paul Leclercq
@polomarcus
Ad tech for 3 years at :
Data Engineer
● Spark : Streaming, SQL, MLLib
● Scala
● Kafka
● NoSQL
Looking for his dream job in Data
in music/sport/cool stuffs industry:)
3
5. Data people
Engineer: store, index high volume of raw data, implement machine learning algo
Hadoop, Amazon S3, Kafka, RabbitMQ, Spark, Flink, Beam, Drill, Druid, NoSQL DB : Cassandra, Redis,
Aerospike
Scientist: PhD, Mathematics degrees : build machine learning algorithms that can
predict business actions
Machine learning/Statistics tools: Scikit-learn, MLLib
Business Analyst: use the data provided for business purposes
Tools with UI: Excel, Chart.io, Talend, Superset, Pivot
5
6. Why I love Spark
“Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.”
Scalable : ops and code
Batch, Streaming, ML unified distributed engine to process data
6
8. “Apache Spark's module for working with structured data”
● Access a variety of data sources : Hive, JSON, Avro, Parquet, ORC, JSON, JDBC.
● Plug Tableau, Chart.io, Power BI, Excel… thanks to JDBC or ODBC driver
● ~ ANSI SQL:2003
● Dataframe / Dataset
○ Since Spark 2.0, the primary Machine Learning API
○ Also used in Structured Streaming (still ALPHA in Spark 2.1)
Spark SQL
8
9. Spark SQL - RDD and Dataset (and Dataframe)
RDD = strong typing, lambda functions, DAG
Dataset = RDD (= built on top of RDDs) + Optimized execution engine + in-memory
columnar storage + convenient get data by column name : ds.map(_.myColumn)
Dataframe = Dataset[Row]
From “High performance Spark” by Holden Karau
Databricks’ blog 9
10. Spark SQL - RDD and Dataset
Plain SQL Query or Dataset API
spark.sql("""
SELECT title, artist
FROM datasetTable
"""
)
dataset.select($"title",$"artist")
10
11. Spark SQL - Catalyst Queries Optimizer
● General tree transformation framework : Scala’s abstract syntax tree (AST)
● Let the optimizer do the hard work : optimizations happen as late as possible
● Read less data as possible : partition, columnar format, statistic metadata (min, max,
dictionary), pushing predicate into storage system (Postgres specific query)
Protip: spark.sql(SQL_QUERY).explain(extended = true) or Spark UI SQL page
11
12. Spark SQL - Catalyst Queries Optimizer
● No languages jealous : All different Spark’s Dataset APIs have all the same
performance
12
13. ● Columnar storage
● Optimized I/O
○ Column pruning
○ Predicate pushdown (Stats filter : size, max, min, dictionary)
● Popular and interoperable, supported by many other data processing systems
● Supports schema evolution, nullable=true
● Simple use with Spark
○ df.write.format("parquet").save("nrjnovavirginskyrock.parquet")
○ spark.read.parquet("nrjnovavirginskyrock.parquet")
○ df.write.partitionBy("radio").parquet("radioPartitionedByRadio.parquet")
Storage :
13
14. Protips:
● For your test jobs:
○ df.write.mode(SaveMode.Overwrite).save("test.parquet")
○ Otherwise they can fail because file already exists
● Learn from the best
○ Parquet’s Julien le Dem How to use Parquet
○ Netflix’s Ryan Bleu : Parquet performance tuning: the missing guide
14
15. What’s awesome about it?
● Collaboration via notebooks
● Free community edition with a 6Go RAM server, ready to go : https://community.cloud.databricks.com/
● Awesome and simple data viz
And also:
● Mixing Languages in a Notebook, including Markdown see demo later
● Cost management (AWS Spot instances)
● Rest API, Jobs, Security...
What about a open source solution?
● notebooks : Apache Zeppelin
● Managed Spark clusters on AWS or GCP
15
17. Getting the radio stations data - Scala scraper
From “what was this title?” HTML pages or REST API:
● http://www.nrj.fr/chansons-diffusees?__postedForm=broadcastedhitdate&date=1970/01/01 00:00
● http://www.novaplanet.com/radionova/cetaitquoicetitre/$timestamp
● https://www.virginradio.fr/cetait-quoi-ce-titre?date=1970-01-01&hour=00&minute=00
● http://skyrock.fm/api/v3/sound?search_date=1970-01-01&search_hour=00:00
Good real life experience of extracting data :
● Slow or fast servers
● Different semantic: Artist 1 & or AND or / Artist
● Different format : HTML page / JSON 17
18. Data from the radio stations
case class Song(timestamp:Int, humanDate:Long, year:Int, month:Int, day:Int,
hour:Int, minute: Int, artist:String, allArtists: String, title:String, radio:String)
val dataset = spark.read.myformat("myfile").as[Song]
dataset.show() or display(dataset) on Databricks:
18
19. Data from the radio stations
dataset.show()
dataset.show(numberOfRows, truncate = false)
19
20. https://developer.spotify.com/web-api/console/
● Audio features of a track : danceability, positiveness, energy
● Artist : music genre
● Search a track
Positiveness/Valence: September — Earth Wind & Fire, Ska-Boo-Da-Ba — The Skatalites or Hey Ya! — OutKast
Danceability: Trick Me — Kelis, Around the world — Daft Punk or Anaconda — Nicki Minaj
Energy : We Are Your Friends - JUSTICE, Steppin’ stone - Davy Jones, Jerk It Out — Caesars
20
21. Number of songs * (Artist + track + audiofeatures) = 24K requests
→ Avoid surprises : Always think how large your data is before performing an action
● Destination server’s disk big enough? Powerful enough?
● 3rd party rate limit ? Will others applications would need this service too ?
● Network Cost ? 21
22. Data from
dataframe.show() / display(dataframe) on Databricks
Why dataframe and not data? → dataframe.printSchema
22
24. Not really big data… and I am ok with that!
+300K rows of broadcasts of 8K different songs
● Nova : 95K broadcasts of 5000 different songs
● NRJ : 50K broadcasts of 800 different songs
● Virgin: 60K broacasts of 1200 different songs
● Skyrock: 100K broadcasts of 1000 different songs
Protips: dataset.sample(withReplacement, percentage)
24
25. How many songs by day ?
SELECT COUNT(*) as number_songs_broadcasted, DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd') AS
date, radio
FROM nrjnova
GROUP BY DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd'), radio
ORDER BY date
Dataframe API
nrjnova.select(date_format($"timestamp".cast("timestamp"),"Y-MM-dd").alias("date"), $"radio")
.orderBy($"timestamp".asc)
.groupBy($"radio", $"ts")
.count()
25
29. Music genres by radio
Genre info by artist only → ["alternative dance","chamber pop","dance-punk","electronic","garage
rock","indie pop","indie r&b","indie rock","indietronica","new rave","synthpop"]
import org.apache.spark.sql.functions.explode
val genres = TrackArtistAudioFeature.select($"name", explode($"genres"),
$"tracks.name",$"radio").toDF("artist", "genres","title","radio")
genres.createOrReplaceTempView("genres")
genres.cache()
29
30. Music genres by radio
SELECT COUNT(DISTINCT genres) AS number_of_genres, radio
FROM genres
GROUP BY radio
ORDER BY number_of_genres DESC
30
33. Is Skyrock really “first on rap” ?
SELECT COUNT(genres) AS number_of_hip_hop_songs, genres, radio
FROM genres
WHERE genres LIKE '%rap%' OR genres LIKE '%hip%' OR genres LIKE '%hop%'
GROUP BY genres, radio
HAVING COUNT(genres) > 50
ORDER BY number_of_hip_hop_songs DESC
33
35. Songs duration distribution
SELECT ROUND( (COUNT(t.*) / subTotal.total_radio * 100),2) AS percentage_of_songs, subTotal.total_radio,
FLOOR((duration_ms / 1000 ) / 60) AS minute, ROUND( (((duration_ms / 1000 ) % 60)) / 10) * 10 AS second,
t.radio
FROM AudioFeatureArtistTrackRadios t
JOIN (
SELECT count(*) AS total_radio, radio
FROM AudioFeatureArtistTrackRadios
GROUP BY radio
) AS subTotal
ON subTotal.radio = t.radio
GROUP BY 1, 2, 3, 4
ORDER BY minute, second
35
37. Percentage of music by day
SELECT AVG(number_songs_broadcasted) * 3.3 / (24 * 60) * 100 AS percent_of_music,
radio
FROM (
SELECT COUNT(*) AS number_songs_broadcasted, DATE_FORMAT(CAST(timestamp AS
timestamp),'Y-MM-dd') AS date, radio
FROM nrjnova
GROUP BY DATE_FORMAT(CAST(timestamp AS timestamp),'Y-MM-dd'), radio
HAVING COUNT(*) > 0 -- avoid radio stations’ system bug
ORDER BY date
)
GROUP BY radio
37
average song duration in
minutes
total minutes by day
39. What’s an average monday ?
SELECT ROUND(AVG(number_of_tracks)) AS number_of_tracks, radio, hour
FROM (
SELECT COUNT(*) AS number_of_tracks, weekofyear( CAST(timestamp as timestamp)) AS
week_number, CAST(DATE_FORMAT(CAST(timestamp as timestamp),'k') AS int) AS hour, radio
FROM nrjnova
WHERE DATE_FORMAT(CAST(timestamp as timestamp),'EEEE') = "Monday"
GROUP BY weekofyear( CAST(timestamp as timestamp)), DATE_FORMAT(CAST(timestamp as
timestamp),'k'), radio
HAVING COUNT(*) > 0 -- avoid radio stations’ system bug
)
GROUP BY hour, radio
ORDER BY hour
39
42. Windowing query example - Most broadcasted
songsSELECT COUNT(*), n.title, n.artist, n.radio, rank, month, year
FROM (
SELECT title, artist, radio,number_of_broadcast, dense_rank() OVER (PARTITION BY radio ORDER BY
number_of_broadcast DESC) AS rank
FROM (
SELECT COUNT(*) AS number_of_broadcast, title, artist, radio
FROM nrjnova
GROUP BY title, artist, radio
) tmp
) top10
JOIN nrjnova n
ON top10.title = n.title AND top10.artist = n.artist AND top10.radio = n.radio
WHERE rank <= 2
GROUP BY n.title, n.artist, n.radio, rank, month, year
ORDER BY month
42
44. Similarities between radio stations with unidirectional inequality
SELECT COUNT(DISTINCT n1.artist, n1.title) AS number_of_similar_songs, CONCAT(n1.radio, "-",
n2.radio) AS radios, n1.radio AS radio_1, ROUND(COUNT(DISTINCT n1.artist, n1.title) /
number_of_song_radio_1 * 100) AS percent_radio_1, number_of_song_radio_1, n2.radio as radio_2,
ROUND(COUNT(DISTINCT n1.artist, n1.title) / number_of_song_radio_2 * 100) as percent_radio_2,
number_of_song_radio_2
FROM nrjnova n1
JOIN nrjnova n2
ON n1.radio < n2.radio AND LOWER(n1.artist)=LOWER(n2.artist) AND LOWER(n1.title)=LOWER(n2.title)
GROUP BY n1.radio, n2.radio, number_of_song_radio_1, number_of_song_radio_2
ORDER BY number_of_similar_songs DESC
44
45. Similarities between radio stations with unidirectional inequality
JOIN radio n2 ON n1.radio = n2.radio →
● (nova, virgin)
● (virgin, nova)
JOIN radio n2 ON n1.radio < n2.radio
● (nova, virgin)
45
48. Common songs between our 4 radios ?
SELECT LOWER(title) as Title, LOWER(artist) as Artist, COUNT(DISTINCT (radio))
FROM nrjnova
GROUP BY LOWER(title), LOWER(artist)
HAVING COUNT(DISTINCT (radio)) = ( -- 4, because we have 4 different radios
SELECT MAX (count)
FROM (
SELECT COUNT(DISTINCT (radio)) as count, LOWER(title), LOWER(artist)
FROM nrjnova
GROUP BY LOWER(title), LOWER(artist)
HAVING COUNT(DISTINCT (radio))
)
) 48
50. Spark SQL - Case statement
SELECT CASE artist
WHEN "Drake"
THEN "New drake name"
ELSE artist END AS artist,
title, radio
FROM nrjnova
50
51. Resources
Demo’s Notebook available here
“Terra Data” exposition at Cité des sciences, Paris
EPFL Spark Intro from Heather Miller
Deep Dive into Spark SQL’s Catalyst Optimizer
Mastering Apache Spark 2 by Jacek Laskowski
Unsplash: copyrightless-HD-picture platform
51
52. Bonus - Spotify Playlists
~200 most broadcasted songs in 2016 for each radio :
● “Radio Nova Top 2016” with Calipso Rose, Kaytranada, The Roots, M.I.A...
● “Skyrock Top 2016” with Drake, Major Lazer, Timberlake, Soprano, PNL, Jul…
● “Virgin Top 2016” with Imany, Twenty One Pilots, Sia, Kungs, Julian Perretta…
● “NRJ top 2016” with Enrique Iglesias, Soprano, Coldplay, Kungs, Amir, MHD, Tal
52