Jere Nieminen
Service Architect – Elisa
Jere is experienced architect specialized in video streaming technologies. He is currently working on making video streaming as smooth as possible for Elisa Viihde customers.
1. 19.12.2018
1
Anomaly Detection using ML in Elisa Viihde CDN
Jere Nieminen
13.12.2018
Elisa and Elisa Viihde
• Elisa
• Telecommunications, ICT and digital service company operating mainly in Finland and
Estonia
• Over 2.8 million customers who have over 6.2 million subscriptions
• Elisa Viihde
• Finland’s most popular entertainment service
• Several original series and exclusive distribution rights for certain movies and series
• Linear TV channels, Network PVR, Catchup, TVOD/SVOD/EST
• More than 300 000 household subscribers
2
2. 19.12.2018
2
Elisa Viihde CDN
3
• Features focused on
• Streaming Video
• Cache/Network Optimization
• Team with 6 members focused on
• SW Development and integrations
• Daily operations
• QoS and QoE
High Level Architecture
Background - Elastic Stack 101
• Elasticsearch
• JSON data store with Restfull API
• Beats & Logstash
• Ingest data to Elasticsearch
• Kibana
• Search and Visualize data in Elasticsearch
• Machine Learning (X-Pack)
• Anomaly Detection
3. 19.12.2018
3
Terminology
• Anomaly
• A deviation in the normal behaviour
• Machine Learning
• Make predictions or decisions without being explicitly programmed to perform the task
• Unsupervised Anomaly Detection
• Searching for instances that fit the least to remaining unlabeled data set where it is assumed
that most of the data is normal
• In our use-case, we let the machine learn from the data and detect anomalies, but do not
allow the machine to carry out any ”smart” tasks related to it
5
N
otifications
History of Detecting Streaming Issues
2016
Early
days
01
Logging
Trials
02
2017
04
2018
Q1 Q2 Q3 Q4Q1 Q2 Q3 Q4
03
Elastic
w
ith
Access
Logs
Stream
ing
Session
05
Anom
aly
D
etection
Trials
06
Anomaly Detection in Action
4. 19.12.2018
4
Visual Dashboard - Incorrect caching configuration
7
Increasing daily error rate
Fix deployed
Reaction time
ML Detection Example - Broken Content
8
Fragmented MP4 asset
1920x1080@7Mbps
1280x720@4.5Mbps
1024x576@2Mbps
640x360@800kbps
480x270@300kbps
Timeline
Timecode drift
ML Job Config
5. 19.12.2018
5
ML Detection Example - Network Issue
9
ML Job Config
ML Detection Example – RR Performance
10
ML Job Config
Production v1.0-52
Canary v1.0-53
6. 19.12.2018
6
Ask the Right Questions / Survivorship Bias
11
Image credit to Daniel G. Siegel
https://www.dgsiegel.net/talks/the-bullet-hole-misconception
Is the CDN performing well?
Are the clients getting the best
quality of experience?
Based on the server side metrics can
we answer following questions:
ML Example - Anomalies from Client QoE data
12
7. 19.12.2018
7
ML Example – How to get fooled
13
Anomaly New normal
Key Takeaways
14
• Focus on the Data
• Logs
• Usually made for humans to read
• Log also the successful events
• Do all the tricks like split, parse etc. before storing
• Logging vs. Monitoring
• Needless battle
• Manual thresholds are still not outdated
• Creating ML jobs is easy, but…
• Understanding the events is sometimes really hard
• Process to investigate all the anomalies
• Enhance the data set