SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Morning class summary
Mercè Martín
BigML
Day 2
The Future of ML
José David Martín-Guerrero (IDAL, UV)
Machine learning projectMachine learning project
All steps are connected and feedback is essential to succeed
Society has drifted to the Machine Learning way
social networks, data acquisition, technologies...
Feature engineering challenges
High space dimensionality (#features >>> #samples)
Inputs preparation: selection, transformation or model direct
attack
Modelling strategies: paradox of choice
Too many algorithms and structures, no general purpose one?
Too many con2guration options, no automatic choice?
Select your model by its structure, parameters (tuning) or
search algorithm (e.g. Deep learning: no feature engineering
but hectic tuning, Azure: many elections)
Wish list: more automation
Work7ows, model selection, tuning, representation,
prediction strategies
The Future of ML
The Future of ML
Existing techniques: Reinforcement learning
Environment definable as state-space?
Evolution of this space acted by a set of actors?
The Problem is suitable for RL
Goal to be maximized in the long term?
Prior experience
Interaction
Environment adaptation
Policy
So far applied to synthetic problems and robotics but also suitable for
marketing or medicine, and more to come!
Evaluating ML Algorithms II
GOLDEN RULE: Never use the same
example for training the model and
evaluating it!!
What if you don't have so much data? Sample and repeat!
José Hernández-Orallo (UPV)
Under-fitting: too general
How can we detect them? Evaluating
Over-fitting: too specific
Evaluating ML Algorithms II
Training
Data
h1
Test
hn
Evaluation
Evaluation
Learning
Learning
Training
Test
n times
n folds
Cross-validation
o We take all possible
combinations with n‒1
for training and the
remaining fold for test.
o The error (or any other
metric) is calculated n
times and then
averaged.
o A 2nal model is trained
with all the data.
Bootstrapping o We extract n samples with repetition
and train with the rest
Evaluating ML Algorithms II
Cost-sensitive evaluations: not all errors are equally costly
Hadamard product = Cost matrix . Confusion matrix
open close
OPEN 0 100€
CLOSE 2000€ 0
Actual
Predicted
c1 open close
OPEN 300 500
CLOSE 200 99000
Actual
Pred
c3 open close
OPEN 400 5400
CLOSE 100 94100
Actual
c2 open close
OPEN 0 0
CLOSE 500 99500
Actual
c1 open close
OPEN 0€ 50,000€
CLOSE 400,000€ 0€
c3 open close
OPEN 0€ 540,000€
CLOSE 200,000€ 0€
c2 open close
OPEN 0€ 0€
CLOSE 1,000,000€ 0€
TOTAL COST:
450,000€
TOTAL COST:
1,000,000€ TOTAL COST:
740,000€
Confusion Matrices
Cost
Matrix
Resulting
Matrices
External Context:
Set of classes
Cost estimation
Confusion matrix & cost matrix can be characterized by just one number: slope
Evaluating ML Algorithms II
ROC (Receiver Operating Characteristic) analysis
Dynamic context (class distribution & cost matrix)
ROC diagram
0 1
1
0
FPR
TPR
o Given several classi2ers:
 We add the trivial (0,0) (1,1)
classi2ers and construct the convex
hull of their points (FPR,TPR). The
points in the edges are linear
combinations of classi2ers (p * Ca
+
(1-p) * Cb
)
 The classi2ers below the ROC curve
are discarded.
 The best classi2er (from those
remaining) will be selected at
application time… slope
Probabilistic context: soft ROC analysis
A single classifier with probability-weighted predictions can generate a ROC
curve by changing score threshold
(each threshold gives a new classifier in the ROC curve)
Ca
Cb
Evaluating ML Algorithms II
AUC (Area Under the ROC Curve)
For crisp classifiers AUC is equivalent to the macro-averaged accuracy.
AUC is a good metric for classifiers and rankers:
A classifier with high AUC is a good ranker.
It is also good for a (uniform) range of operating
conditions
A model with very good AUC will have good accuracy
for all operating conditions.
A model with very good accuracy for one operating
condition can have very bad accuracy for another
operating condition.
A classifier with high AUC can have poor calibration
(probability estimation).
Multidimensional classifications? ROC problematic, AUC has been extended
Regressions? ROC has been extended, AUC is the error variance
Cluster Analysis
K-means
clustering
K=3
Poul Petersen (BigML)
Unsupervised problem (unlabelled data)
Customer segmentation, Item discovery (types),
Association (profiles), recommender, active learning (group and label)
Cluster Analysis
• What is the distance to a “missing value”? Defaults replacement
• What is the distance between categorical values? [0,1]
• What is the distance between text features? Vectorize and use
cosine distance
• Does it have to be Euclidean distance?
• Unknown “K”?
Distance and centers define the groups: K-means, but...
Problems: Convergence (initial conditions), scaling dimensions
Things you need to tackle:
K-means: starting from a subset of K points, recursively compute the distances
of all points in data to them and associate with the closest.
Define the center of each group as new set of K points and repeat until there's
no improvement.
Cluster Analysis
Let K=5
K=5
g-means clustering: increment k looking for the gaussian
Unsupervised Data: Rank by dissimilarity
Why? Unusual instances, intrusion detection, fraud, incorrect
data
• Given a group, try to single out the odd: remove outliers from data
Dataset → Anomaly Detector → score → remove outliers
Can use it a diKerent layers and combined with clustering
• Improve model competence: testing predictions score to look for new
instances dissimilar to train instances (non-competent model)
• Compare against usual distributions, Gaussian, Benford's Law
Anomaly Detection
Poul Petersen (BigML)
Anomaly Detection
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Most unusual
Different according to grouping features (prior knowledge)
Anomaly Detection
Grow a random decision tree
until each instance is in its
own leaf (random features and
splits)
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and assign an
anomaly score ( 0 = similar , 1 = dissimilar) to any
input data by computing how di%erent is the average
depth for the instance to the average depth of the
training set.
Machine Learning Black Art
Charles Parker (BigML)
Even when you follow the
yellow brick road...
Different models
Feature engineering
Evaluation metrics
The house of horrors awaits you
around the corner:
Huge Hypothesis Space
Poorly Picked Loss Function
Cross Validation
Drifting Domain
Reliance on Research Results
Machine Learning Black Art
● Huge hypothesis space: the possible classifiers you could build with an
algorithm given the data. Choice!
Triple trade-off
Use non-parametric methods
As data scales simpler models are desirable
Big data often trumps modelling!
● Poorly picked Loss function: standard loss functions (entropy, distance in
formal space) are mathematically convenient but not always enough for real problems
No info about the classes or the costs
False positive in disease diagnosis
False positive in face detection
False positive in thumbprint identification
Path dependence
Game playing
Let developers apply their own loss
function: SVM light, plugins in splitting
code, customized gradient descent...
OR
Hack the prediction (cascade classifiers)
Change the problem setting
(time based limits to the classifier, max loss)
Keep error down with a certain probability
More complex: you need more data
Machine Learning Black Art
● Cross-validation
hold outs can lead to leakage: features or
instances can be correlated in test an train
sets. Optimistic performance.
Law of averages and being off by one
Features correlated with my prediction
can bias predictions
Photo dating: colors, borders...
Beware of the group the instances belong to
Agreggates and timestamps
Instances in close moments are very correlated
Machine Learning Black Art
● Drifting Domain
Domain changes (document classification, sales prediction)
Adverse selection of training data (market data predictions, spam)
➢ Prior p(input) is changing → covariate shift
➢ Map changes p(output | input) is changing → concept drift
Symptoms: lots of errors, distribution changes. Compare to old data!
● Reliance on Research results
Reality does not comply to theorems' initial boundaries (error, sample
complexity, convergence)
Rule of thumb:
Use academia as your starting point, but don’t
think it will solve all your problems. Keep learning
Reality does not comply to theorems' initial boundaries (error, sample
complexity, convergence) non-real assumptions
Useful Things about ML
Charles Parker (BigML)
Advice from Dijkstra
● Killing Ambitious Projects - identify sub-problems you can tackle
hard vs easy, hacking it's all right. Good candidates:
No human experts predict in complex environments (protein folding)
Humans can't explain how they know f(x)(character recognition)
f(x) is changing all the time (market data)
f(x) must be specialized many times (anything user speci2c)
● Ignoring the Lure of Complexity
Look for simplicity (remove spaghetti code, processes, drudgery)
Push around complexity (clever compression)
Raw data might have information, sometimes is the right way
● Finding Your Own Humility
Know and embrace your own limits
Continuously learn
Do A/B test: improve from an existing system
● Avoiding Useless Projects
Look for the best combination of easy and big win
De2ne metrics with experts but don't rely on them: monitor
Useful Things about ML
Advice From DijkstraAdvice From DijkstraAdvice From DijkstraAdvice From Dijkstra (continued)
● Creating a good story
Explain why and summarize your model and your data
Stories are more valuable than models
● Continuing to Learn
Don't accommodate, work at the verge of your abilities
Understand your limitations
Learn from your errors
Summary:
ML can be of value for every organization: 2nd where
Locating the right problem, Executing, Showing the proof
When you win we all win, so good luck!!!

Más contenido relacionado

La actualidad más candente

Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningHJ van Veen
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision TreesSara Hooker
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statisticsSpotle.ai
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine LearningPranav Ainavolu
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsankit_ppt
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.pptbutest
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep DiveSara Hooker
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introductionbutest
 

La actualidad más candente (20)

Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
 
Borderline Smote
Borderline SmoteBorderline Smote
Borderline Smote
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
ML Basics
ML BasicsML Basics
ML Basics
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statistics
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.ppt
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
 

Similar a LR2. Summary Day 2

林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning台灣資料科學年會
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMarion DE SOUSA
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIPramit Choudhary
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Saurabh Kaushik
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Jeet Das
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
 
ppt slides
ppt slidesppt slides
ppt slidesbutest
 

Similar a LR2. Summary Day 2 (20)

林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - Reboot
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
07 learning
07 learning07 learning
07 learning
 
ppt slides
ppt slidesppt slides
ppt slides
 

Más de Machine Learning Valencia

Más de Machine Learning Valencia (10)

From Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de MántarasFrom Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de Mántaras
 
Artificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom DietterichArtificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom Dietterich
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking Predictions
 
L7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsL7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIs
 
LR1. Summary Day 1
LR1. Summary Day 1LR1. Summary Day 1
LR1. Summary Day 1
 
L6. Unbalanced Datasets
L6. Unbalanced DatasetsL6. Unbalanced Datasets
L6. Unbalanced Datasets
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
L1. State of the Art in Machine Learning
L1. State of the Art in Machine LearningL1. State of the Art in Machine Learning
L1. State of the Art in Machine Learning
 

Último

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

LR2. Summary Day 2

  • 3. The Future of ML José David Martín-Guerrero (IDAL, UV) Machine learning projectMachine learning project All steps are connected and feedback is essential to succeed Society has drifted to the Machine Learning way social networks, data acquisition, technologies...
  • 4. Feature engineering challenges High space dimensionality (#features >>> #samples) Inputs preparation: selection, transformation or model direct attack Modelling strategies: paradox of choice Too many algorithms and structures, no general purpose one? Too many con2guration options, no automatic choice? Select your model by its structure, parameters (tuning) or search algorithm (e.g. Deep learning: no feature engineering but hectic tuning, Azure: many elections) Wish list: more automation Work7ows, model selection, tuning, representation, prediction strategies The Future of ML
  • 5. The Future of ML Existing techniques: Reinforcement learning Environment definable as state-space? Evolution of this space acted by a set of actors? The Problem is suitable for RL Goal to be maximized in the long term? Prior experience Interaction Environment adaptation Policy So far applied to synthetic problems and robotics but also suitable for marketing or medicine, and more to come!
  • 6. Evaluating ML Algorithms II GOLDEN RULE: Never use the same example for training the model and evaluating it!! What if you don't have so much data? Sample and repeat! José Hernández-Orallo (UPV) Under-fitting: too general How can we detect them? Evaluating Over-fitting: too specific
  • 7. Evaluating ML Algorithms II Training Data h1 Test hn Evaluation Evaluation Learning Learning Training Test n times n folds Cross-validation o We take all possible combinations with n‒1 for training and the remaining fold for test. o The error (or any other metric) is calculated n times and then averaged. o A 2nal model is trained with all the data. Bootstrapping o We extract n samples with repetition and train with the rest
  • 8. Evaluating ML Algorithms II Cost-sensitive evaluations: not all errors are equally costly Hadamard product = Cost matrix . Confusion matrix open close OPEN 0 100€ CLOSE 2000€ 0 Actual Predicted c1 open close OPEN 300 500 CLOSE 200 99000 Actual Pred c3 open close OPEN 400 5400 CLOSE 100 94100 Actual c2 open close OPEN 0 0 CLOSE 500 99500 Actual c1 open close OPEN 0€ 50,000€ CLOSE 400,000€ 0€ c3 open close OPEN 0€ 540,000€ CLOSE 200,000€ 0€ c2 open close OPEN 0€ 0€ CLOSE 1,000,000€ 0€ TOTAL COST: 450,000€ TOTAL COST: 1,000,000€ TOTAL COST: 740,000€ Confusion Matrices Cost Matrix Resulting Matrices External Context: Set of classes Cost estimation Confusion matrix & cost matrix can be characterized by just one number: slope
  • 9. Evaluating ML Algorithms II ROC (Receiver Operating Characteristic) analysis Dynamic context (class distribution & cost matrix) ROC diagram 0 1 1 0 FPR TPR o Given several classi2ers:  We add the trivial (0,0) (1,1) classi2ers and construct the convex hull of their points (FPR,TPR). The points in the edges are linear combinations of classi2ers (p * Ca + (1-p) * Cb )  The classi2ers below the ROC curve are discarded.  The best classi2er (from those remaining) will be selected at application time… slope Probabilistic context: soft ROC analysis A single classifier with probability-weighted predictions can generate a ROC curve by changing score threshold (each threshold gives a new classifier in the ROC curve) Ca Cb
  • 10. Evaluating ML Algorithms II AUC (Area Under the ROC Curve) For crisp classifiers AUC is equivalent to the macro-averaged accuracy. AUC is a good metric for classifiers and rankers: A classifier with high AUC is a good ranker. It is also good for a (uniform) range of operating conditions A model with very good AUC will have good accuracy for all operating conditions. A model with very good accuracy for one operating condition can have very bad accuracy for another operating condition. A classifier with high AUC can have poor calibration (probability estimation). Multidimensional classifications? ROC problematic, AUC has been extended Regressions? ROC has been extended, AUC is the error variance
  • 11. Cluster Analysis K-means clustering K=3 Poul Petersen (BigML) Unsupervised problem (unlabelled data) Customer segmentation, Item discovery (types), Association (profiles), recommender, active learning (group and label)
  • 12. Cluster Analysis • What is the distance to a “missing value”? Defaults replacement • What is the distance between categorical values? [0,1] • What is the distance between text features? Vectorize and use cosine distance • Does it have to be Euclidean distance? • Unknown “K”? Distance and centers define the groups: K-means, but... Problems: Convergence (initial conditions), scaling dimensions Things you need to tackle: K-means: starting from a subset of K points, recursively compute the distances of all points in data to them and associate with the closest. Define the center of each group as new set of K points and repeat until there's no improvement.
  • 13. Cluster Analysis Let K=5 K=5 g-means clustering: increment k looking for the gaussian
  • 14. Unsupervised Data: Rank by dissimilarity Why? Unusual instances, intrusion detection, fraud, incorrect data • Given a group, try to single out the odd: remove outliers from data Dataset → Anomaly Detector → score → remove outliers Can use it a diKerent layers and combined with clustering • Improve model competence: testing predictions score to look for new instances dissimilar to train instances (non-competent model) • Compare against usual distributions, Gaussian, Benford's Law Anomaly Detection Poul Petersen (BigML)
  • 15. Anomaly Detection “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Most unusual Different according to grouping features (prior knowledge)
  • 16. Anomaly Detection Grow a random decision tree until each instance is in its own leaf (random features and splits) “easy” to isolate “hard” to isolate Depth Now repeat the process several times and assign an anomaly score ( 0 = similar , 1 = dissimilar) to any input data by computing how di%erent is the average depth for the instance to the average depth of the training set.
  • 17. Machine Learning Black Art Charles Parker (BigML) Even when you follow the yellow brick road... Different models Feature engineering Evaluation metrics The house of horrors awaits you around the corner: Huge Hypothesis Space Poorly Picked Loss Function Cross Validation Drifting Domain Reliance on Research Results
  • 18. Machine Learning Black Art ● Huge hypothesis space: the possible classifiers you could build with an algorithm given the data. Choice! Triple trade-off Use non-parametric methods As data scales simpler models are desirable Big data often trumps modelling! ● Poorly picked Loss function: standard loss functions (entropy, distance in formal space) are mathematically convenient but not always enough for real problems No info about the classes or the costs False positive in disease diagnosis False positive in face detection False positive in thumbprint identification Path dependence Game playing Let developers apply their own loss function: SVM light, plugins in splitting code, customized gradient descent... OR Hack the prediction (cascade classifiers) Change the problem setting (time based limits to the classifier, max loss) Keep error down with a certain probability More complex: you need more data
  • 19. Machine Learning Black Art ● Cross-validation hold outs can lead to leakage: features or instances can be correlated in test an train sets. Optimistic performance. Law of averages and being off by one Features correlated with my prediction can bias predictions Photo dating: colors, borders... Beware of the group the instances belong to Agreggates and timestamps Instances in close moments are very correlated
  • 20. Machine Learning Black Art ● Drifting Domain Domain changes (document classification, sales prediction) Adverse selection of training data (market data predictions, spam) ➢ Prior p(input) is changing → covariate shift ➢ Map changes p(output | input) is changing → concept drift Symptoms: lots of errors, distribution changes. Compare to old data! ● Reliance on Research results Reality does not comply to theorems' initial boundaries (error, sample complexity, convergence) Rule of thumb: Use academia as your starting point, but don’t think it will solve all your problems. Keep learning Reality does not comply to theorems' initial boundaries (error, sample complexity, convergence) non-real assumptions
  • 21. Useful Things about ML Charles Parker (BigML) Advice from Dijkstra ● Killing Ambitious Projects - identify sub-problems you can tackle hard vs easy, hacking it's all right. Good candidates: No human experts predict in complex environments (protein folding) Humans can't explain how they know f(x)(character recognition) f(x) is changing all the time (market data) f(x) must be specialized many times (anything user speci2c) ● Ignoring the Lure of Complexity Look for simplicity (remove spaghetti code, processes, drudgery) Push around complexity (clever compression) Raw data might have information, sometimes is the right way ● Finding Your Own Humility Know and embrace your own limits Continuously learn Do A/B test: improve from an existing system ● Avoiding Useless Projects Look for the best combination of easy and big win De2ne metrics with experts but don't rely on them: monitor
  • 22. Useful Things about ML Advice From DijkstraAdvice From DijkstraAdvice From DijkstraAdvice From Dijkstra (continued) ● Creating a good story Explain why and summarize your model and your data Stories are more valuable than models ● Continuing to Learn Don't accommodate, work at the verge of your abilities Understand your limitations Learn from your errors Summary: ML can be of value for every organization: 2nd where Locating the right problem, Executing, Showing the proof When you win we all win, so good luck!!!