SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Big Data Science
Hype?
Levente Török
Blinkbox Music Ltd ... GE Hungary
Disclaimer
All statements appearing in slides or in the presentation represent my personal
opinion. They are not in connection to any companies nor any person I had or
have connection to.
I reserve these statements with risk of error.
Summary
- Big data? Data Science? Hype?
- Continuous improvement of Online Systems
- A/B testing
Data Science, hype?
Harvard Business Review in 2012
Data Science, hype?
Forbes in 2015
Whether employers know or don’t know what data scientists do, they have been
using—in rapidly-growing numbers—the term“data scientist” in job
descriptions in the past two years as Indeed.com’s data demonstrates.
Developers, developers ...
“Data Science” in media
Yahoo Finance:
“If you take a cue from the Harvard Business Review, the title goes to data
scientists. That’s right. Data scientist, as in, the type of person who can write
computer code and harness the power of big data to come up with innovative
ways to use it for companies like Google (GOOG), Amazon (AMZN), and
Yahoo! (YHOO).”
“Data Science” in media
Nature Jobs:
Data Science, what is this?
Wikipedia
“Data Science is the extraction of knowledge from data, which is a continuation
of the field data mining and predictive analytics”
Data? Science... ?
1) Big Data Engineer
- Hive, Yarn, Spark, Impala
2) Data Miner
- SAS, Knime, Rapid Miner, Weka,
IBM Clementine
3) Big Data & Data Miner
- Apache - Mahout
- Spark - MLlib, Spark - GraphX
- Apache - Giraph
- GraphLab ?
Data Scientist?
Big data - big failure:
If an algo doesn’t work on small data, it wont work on big data.
4) Data Scientist is a real scientist:
Follows scientific principles in data modeling:
- conjectures hypothesis on statistical structure of data
- validates it offline and online
- improves model iteratively
Tools: R / Python / C++
http://bit.ly/1B3bSS1
Tools: verdict
other -> R -> python = 0.44 * 0.26 = 0.11
other -> python -> R = 0.23 * 0.18 = 0.04
Is this correct?
However ... what?
Improving Online Systems
Examples
Recommender systems (ie. RecSys)
What to listen next?
What ad to display?
Anomaly detection:
Is this user/system behaviour “normal”?
Does this system going to fail soon?
Data Flow in Online Sys
Online sys -> log -> daily aggregation -> long term -> batch model bld.
storage
queue -> async online model updates
near optimal online data model
The major difficulty
daily aggregates
datasource
batch model training
online model training
1. batch model training starts: 4:00, finishes 4:30
2. new online model updates starts at 4:30, would finish at 5:10 with all the events from 0:00
to 4:30 but new events arrived in the mean time
.... -> streaming architectures
queue
Offline data modelling
Train Test
Model Prediction
Parameters
Offline modeling
1. Data splits for train / test / quiz
- time based: eg 2 weeks / 1 day / 1 day
- entity based: set of users
- session: set of sessions of users
Test data preparation:
- manual pos/neg sample data points labeled, or injected
2. Train by batch training
Given a data set, we try to fit the model to the data set controlled by model
parameters.
Offline data modelling
3. Prediction phase: Given a model
- for each users we met in train, we give predictions
- for each event we can see in test set, we predict likelihood
4. Evaluation phase: prediction and test data similarity is measured
- RecSys: NDCG, Recall, Precision, AUC, ... 20 different metrics
- Artificially labelled data set for anomaly detection: C2B (AUC),
weighted AUC ...
- Sanity check! -> Q/A team
Offline data modelling
4. Parameter search in parallel
The output of the searching is the parameter vector (+ model id) that
returns the optimal solution offline according to our belief
NB: usually we are unsure which offline measure is going to reflect the best
online results, so we have number of optimal parameter vectors according to
different offline measures.
A/B testing
Train_A Model_A Online pred_A Performance_A
Model_B Online pred_B Performance_BTrain_B
??
Online performance tuning
Train_A Model_A Online pred_A
Parameters
Performance_A
Model_B Online pred_B Performance_BTrain_B
Online traffic split adj.
Train_A Model_A Online pred_A Performance_A
Model_B Online pred_B Performance_BTrain_B
Offline-Online matching
Model NDCG AUC ... Avg Sess Len
A 1 1 1
B 2 3 3
C 3 2 2
Offline measures Online measure
compare with Pearsons corr. coeff.
On-line testing
5. A/B testing
- control model
- tested model (model with an offline optimal parameter set)
6. Evaluation of online results:
Measures:
- Session length, station length
- Return rate, CLTV
Filter and compare models -> wow!
On-line testing
7. Run many models one-by-one according phase 4.
8. Figure out the best offline metrics:
Compare order statistics of offline and online models
(ie Pearsons correlation) to figure out which of the offline metrics matter the
most in online performance.
Model comparisons
Problems:
1. Day 1 A is better, Day 2 B is better
2. The version with the longest session length != the version with the highest
full play ratio of tracks
3. Outliers are dominates the session length average:
- Number of users listen the service “forever”
- Bouncing users pollutes the session length average with high noise
A/B testing
1. Version A: Control group
2. Version B: Treatment group
With n_A, n_B users, we have successes of k_A, k_B.
Is it enough if I compare k_A / n_A with k_B / n_B ?
A/B testing?
Questions:
- What if one day A wins, next the B wins?
- How many users should I use for testing?
- How long should I run test?
- What if we have A, B, C ... versions we want to test?
Classical Statistics
Hypothesis testing:
- Does treatment B have any effect?
- up to probability: (1-alpha)
- given: a sample size of N
Even the most well known A/B testing platforms can lead you illusory results.
Command: “Sample size estimator”
Binomial ?
Note that:
Binomial distribution:
Beta distribution: where
New statistics
n_A = 150, k_A = 18
n_B = 145, k_B = 14
The major question:
New statistics
n_A = 150, k_A = 18
n_B = 145, k_B = 14
The major question:
Chance2beat:
x
f_A(x;...)
f_B(x;...)
Chance 2 beat
- This is a probability, we want to increase by testing. For example:
- Can be:
- Gaussians,
- distributions w/ priors
- empiric distributions, or
- small sample size data sets directly
- Sometimes it is not enough: use bootstrapping!
Thanks

Más contenido relacionado

La actualidad más candente

DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMS
Gayathri Gaayu
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 

La actualidad más candente (20)

3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Graph Based Pattern Recognition
Graph Based Pattern RecognitionGraph Based Pattern Recognition
Graph Based Pattern Recognition
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
 
DESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMSDESIGN AND ANALYSIS OF ALGORITHMS
DESIGN AND ANALYSIS OF ALGORITHMS
 
If then rule in fuzzy logic and fuzzy implications
If then rule  in fuzzy logic and fuzzy implicationsIf then rule  in fuzzy logic and fuzzy implications
If then rule in fuzzy logic and fuzzy implications
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Fuzzy expert system
Fuzzy expert systemFuzzy expert system
Fuzzy expert system
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AI
 
Conceptual dependency
Conceptual dependencyConceptual dependency
Conceptual dependency
 

Destacado

Destacado (10)

Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
 
What's New in JHipsterLand - Devoxx Poland 2017
What's New in JHipsterLand - Devoxx Poland 2017What's New in JHipsterLand - Devoxx Poland 2017
What's New in JHipsterLand - Devoxx Poland 2017
 
Swift -Helyzetjelentés az iOS programozás új nyelvéről
Swift -Helyzetjelentés az iOS programozás új nyelvérőlSwift -Helyzetjelentés az iOS programozás új nyelvéről
Swift -Helyzetjelentés az iOS programozás új nyelvéről
 
Linux Kernel – Hogyan csapjunk bele?
Linux Kernel – Hogyan csapjunk bele?Linux Kernel – Hogyan csapjunk bele?
Linux Kernel – Hogyan csapjunk bele?
 
10 tips to become an awesome Technical Lead v2 (Devoxx PL)
10 tips to become an awesome Technical Lead v2 (Devoxx PL)10 tips to become an awesome Technical Lead v2 (Devoxx PL)
10 tips to become an awesome Technical Lead v2 (Devoxx PL)
 
Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016
 
CDI 2.0 is upon us Devoxx
CDI 2.0 is upon us DevoxxCDI 2.0 is upon us Devoxx
CDI 2.0 is upon us Devoxx
 
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkelDATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
 
DevAssistant, Docker and You
DevAssistant, Docker and YouDevAssistant, Docker and You
DevAssistant, Docker and You
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 

Similar a Big Data Science - hype?

Testing Software Solutions
Testing Software SolutionsTesting Software Solutions
Testing Software Solutions
gavhays
 
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docxAdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
galerussel59292
 

Similar a Big Data Science - hype? (20)

Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
 
Data and Business Team Collaboration
Data and Business Team CollaborationData and Business Team Collaboration
Data and Business Team Collaboration
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
Projects
ProjectsProjects
Projects
 
Testing Software Solutions
Testing Software SolutionsTesting Software Solutions
Testing Software Solutions
 
Implementation of Spam Classifier using Naïve Bayes Algorithm
Implementation of Spam Classifier using Naïve Bayes AlgorithmImplementation of Spam Classifier using Naïve Bayes Algorithm
Implementation of Spam Classifier using Naïve Bayes Algorithm
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Big Data
Big DataBig Data
Big Data
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AI
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and TracingAutomation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data Decisions
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
When UX (guy) Meets Operations
When UX (guy) Meets OperationsWhen UX (guy) Meets Operations
When UX (guy) Meets Operations
 
Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching
 
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docxAdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
 
The Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 KeynoteThe Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 Keynote
 

Más de BalaBit

SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
BalaBit
 
syslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extractionsyslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extraction
BalaBit
 
Techreggeli - Logmenedzsment
Techreggeli - LogmenedzsmentTechreggeli - Logmenedzsment
Techreggeli - Logmenedzsment
BalaBit
 
State of the art logging
State of the art loggingState of the art logging
State of the art logging
BalaBit
 
Why proper logging is important
Why proper logging is importantWhy proper logging is important
Why proper logging is important
BalaBit
 
Balabit Company Overview
Balabit Company OverviewBalabit Company Overview
Balabit Company Overview
BalaBit
 
BalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációjaBalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációja
BalaBit
 
The Future of Electro Car
The Future of Electro CarThe Future of Electro Car
The Future of Electro Car
BalaBit
 

Más de BalaBit (18)

SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
 
NIAS 2015 - The value add of open source for innovation
NIAS 2015 - The value add of open source for innovationNIAS 2015 - The value add of open source for innovation
NIAS 2015 - The value add of open source for innovation
 
Les Assises 2015 - Why people are the most important aspect of IT security?
Les Assises 2015 - Why people are the most important aspect of IT security?Les Assises 2015 - Why people are the most important aspect of IT security?
Les Assises 2015 - Why people are the most important aspect of IT security?
 
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
 
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
 
syslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extractionsyslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extraction
 
eCSI - The Agile IT security
eCSI - The Agile IT securityeCSI - The Agile IT security
eCSI - The Agile IT security
 
Top 10 reasons to monitor privileged users
Top 10 reasons to monitor privileged usersTop 10 reasons to monitor privileged users
Top 10 reasons to monitor privileged users
 
Hogyan maradj egészséges irodai munka mellett?
Hogyan maradj egészséges irodai munka mellett?Hogyan maradj egészséges irodai munka mellett?
Hogyan maradj egészséges irodai munka mellett?
 
Regulatory compliance and system logging
Regulatory compliance and system loggingRegulatory compliance and system logging
Regulatory compliance and system logging
 
Kontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
Kontrolle und revisionssichere Auditierung privilegierter IT-ZugriffeKontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
Kontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
 
Techreggeli - Logmenedzsment
Techreggeli - LogmenedzsmentTechreggeli - Logmenedzsment
Techreggeli - Logmenedzsment
 
State of the art logging
State of the art loggingState of the art logging
State of the art logging
 
Why proper logging is important
Why proper logging is importantWhy proper logging is important
Why proper logging is important
 
Balabit Company Overview
Balabit Company OverviewBalabit Company Overview
Balabit Company Overview
 
BalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációjaBalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációja
 
The Future of Electro Car
The Future of Electro CarThe Future of Electro Car
The Future of Electro Car
 
Compliance needs transparency
Compliance needs transparencyCompliance needs transparency
Compliance needs transparency
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 

Big Data Science - hype?

  • 1. Big Data Science Hype? Levente Török Blinkbox Music Ltd ... GE Hungary
  • 2. Disclaimer All statements appearing in slides or in the presentation represent my personal opinion. They are not in connection to any companies nor any person I had or have connection to. I reserve these statements with risk of error.
  • 3. Summary - Big data? Data Science? Hype? - Continuous improvement of Online Systems - A/B testing
  • 4. Data Science, hype? Harvard Business Review in 2012
  • 5. Data Science, hype? Forbes in 2015 Whether employers know or don’t know what data scientists do, they have been using—in rapidly-growing numbers—the term“data scientist” in job descriptions in the past two years as Indeed.com’s data demonstrates.
  • 7. “Data Science” in media Yahoo Finance: “If you take a cue from the Harvard Business Review, the title goes to data scientists. That’s right. Data scientist, as in, the type of person who can write computer code and harness the power of big data to come up with innovative ways to use it for companies like Google (GOOG), Amazon (AMZN), and Yahoo! (YHOO).”
  • 8. “Data Science” in media Nature Jobs:
  • 9. Data Science, what is this? Wikipedia “Data Science is the extraction of knowledge from data, which is a continuation of the field data mining and predictive analytics”
  • 10. Data? Science... ? 1) Big Data Engineer - Hive, Yarn, Spark, Impala 2) Data Miner - SAS, Knime, Rapid Miner, Weka, IBM Clementine 3) Big Data & Data Miner - Apache - Mahout - Spark - MLlib, Spark - GraphX - Apache - Giraph - GraphLab ?
  • 11. Data Scientist? Big data - big failure: If an algo doesn’t work on small data, it wont work on big data. 4) Data Scientist is a real scientist: Follows scientific principles in data modeling: - conjectures hypothesis on statistical structure of data - validates it offline and online - improves model iteratively
  • 12. Tools: R / Python / C++ http://bit.ly/1B3bSS1
  • 13. Tools: verdict other -> R -> python = 0.44 * 0.26 = 0.11 other -> python -> R = 0.23 * 0.18 = 0.04 Is this correct? However ... what?
  • 14. Improving Online Systems Examples Recommender systems (ie. RecSys) What to listen next? What ad to display? Anomaly detection: Is this user/system behaviour “normal”? Does this system going to fail soon?
  • 15. Data Flow in Online Sys Online sys -> log -> daily aggregation -> long term -> batch model bld. storage queue -> async online model updates near optimal online data model
  • 16. The major difficulty daily aggregates datasource batch model training online model training 1. batch model training starts: 4:00, finishes 4:30 2. new online model updates starts at 4:30, would finish at 5:10 with all the events from 0:00 to 4:30 but new events arrived in the mean time .... -> streaming architectures queue
  • 17. Offline data modelling Train Test Model Prediction Parameters
  • 18. Offline modeling 1. Data splits for train / test / quiz - time based: eg 2 weeks / 1 day / 1 day - entity based: set of users - session: set of sessions of users Test data preparation: - manual pos/neg sample data points labeled, or injected 2. Train by batch training Given a data set, we try to fit the model to the data set controlled by model parameters.
  • 19. Offline data modelling 3. Prediction phase: Given a model - for each users we met in train, we give predictions - for each event we can see in test set, we predict likelihood 4. Evaluation phase: prediction and test data similarity is measured - RecSys: NDCG, Recall, Precision, AUC, ... 20 different metrics - Artificially labelled data set for anomaly detection: C2B (AUC), weighted AUC ... - Sanity check! -> Q/A team
  • 20. Offline data modelling 4. Parameter search in parallel The output of the searching is the parameter vector (+ model id) that returns the optimal solution offline according to our belief NB: usually we are unsure which offline measure is going to reflect the best online results, so we have number of optimal parameter vectors according to different offline measures.
  • 21. A/B testing Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B ??
  • 22. Online performance tuning Train_A Model_A Online pred_A Parameters Performance_A Model_B Online pred_B Performance_BTrain_B
  • 23. Online traffic split adj. Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B
  • 24. Offline-Online matching Model NDCG AUC ... Avg Sess Len A 1 1 1 B 2 3 3 C 3 2 2 Offline measures Online measure compare with Pearsons corr. coeff.
  • 25. On-line testing 5. A/B testing - control model - tested model (model with an offline optimal parameter set) 6. Evaluation of online results: Measures: - Session length, station length - Return rate, CLTV Filter and compare models -> wow!
  • 26. On-line testing 7. Run many models one-by-one according phase 4. 8. Figure out the best offline metrics: Compare order statistics of offline and online models (ie Pearsons correlation) to figure out which of the offline metrics matter the most in online performance.
  • 27. Model comparisons Problems: 1. Day 1 A is better, Day 2 B is better 2. The version with the longest session length != the version with the highest full play ratio of tracks 3. Outliers are dominates the session length average: - Number of users listen the service “forever” - Bouncing users pollutes the session length average with high noise
  • 28. A/B testing 1. Version A: Control group 2. Version B: Treatment group With n_A, n_B users, we have successes of k_A, k_B. Is it enough if I compare k_A / n_A with k_B / n_B ?
  • 29. A/B testing? Questions: - What if one day A wins, next the B wins? - How many users should I use for testing? - How long should I run test? - What if we have A, B, C ... versions we want to test?
  • 30. Classical Statistics Hypothesis testing: - Does treatment B have any effect? - up to probability: (1-alpha) - given: a sample size of N Even the most well known A/B testing platforms can lead you illusory results. Command: “Sample size estimator”
  • 31. Binomial ? Note that: Binomial distribution: Beta distribution: where
  • 32. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question:
  • 33. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question: Chance2beat: x f_A(x;...) f_B(x;...)
  • 34. Chance 2 beat - This is a probability, we want to increase by testing. For example: - Can be: - Gaussians, - distributions w/ priors - empiric distributions, or - small sample size data sets directly - Sometimes it is not enough: use bootstrapping!