SlideShare a Scribd company logo
1 of 27
Download to read offline
Intel HPC Developer Convention Salt Lake City 2016
Machine Learning Track
Franz J. Király
Data Analytics, Machine Learning
and HPC in today’s changing
application environment
An overview ofdata analytics
DATA Scientific
Questions
Exploration
Statistical
Questions
Methods
Quantitative Modelling
Predictive/InferentialDescriptive/Explanatory
StatisticalProgramming
R
python
TheScientificMethod
Scientific and Statistical Validation
Knowledge
(practical)
Data analytics and data science
in a broader context
Data analytics
Data mining,
Machine learning
Statistics, Modelling,
Raw
data
Clean
data
Lot of problems and subtleties
at these stages already
often, most of manpower
in „data“ project needs
to go here first before
one can attempt reliable
Knowledge
underlying arguments
need to be explained
well and properly
Relevant findings and
Big Data?
What „Big Data“ may mean in practice
Kernel methods, OLS
10.000
Solution strategiesNumber of data samples
Strategies that
stop working
in reasonable time
Number of features
10.000.000
10.000.000.000
1.000
Reading in all the data
Random forests
100
L1, LASSO
(around the same order)
Manual exploratory
data analysis
1.000
Super-linear algorithms
Linear algorithms, including
Sub-sampling
On-line models
Feature extraction
Large-scale strategies
for super-linear algorithms
Feature selection
Distributed computing
Large-scale motifs in data science
Not necessarily a lot of data, but computationally intensive models
Classical example: finite elements and other numerical models
„Big models“
New fancy example: large neural networks aka „deep learning“
= where high-performance computing is helpful/impactful
Computational challenge arises from processing all of the data
Example: histogram or linear regression with huge amounts of data
„Big data“
Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes
= the „classic“, beloved by everyone
= what it says, a lot of data (ca 1 million samples or more)
Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting
Model validation and model selection = this talk‘s focus
Answers the question: which model is best for your data?
Demanding even for simple models and small amounts of data!
Example: is deep learning better than logistic regression, or guessing?
Customer: Hospital specializing in treatment of patients with a certain disease.
Meta-modelling: stylized case studies
Scientific question: depending on patient characteristics, predict the event risk.
Patients with this disease are at-risk to experience an adverse event (e.g. death)
Data set: complete clinical records of 1.000 patients, including event if occurred
Customer: Retailer who wants to accurately model behaviour of customers.
Not of interest: which algorithm/strategy, out of many, exactly solves the task
Scientific question: predict future customer behaviour given past behaviour
Customers can buy (or not buy) any of a number of products, or churn.
Data set: complete customer and purchase records of 100.000 customers
Of interest: model interpretability; how accurate the predictions are expected to be
Customer: Manufacturer wishes to find best parameter setting for machines.
Scientific question: find parameter settings which optimizes the above
Parameters influence amount/quality of product (or whether machine breaks)
Data set: outcomes for 10.000 parameter settings on those machines
whether the algorithm/model is (easily) deployable in the „real world“
= data-centric and data-dependent modelling
Model validation and model selection
1. There is no model that is good for all data.
2. For given data, there is no a-priori reason to believe
that a certain type of model will be the best one.
(otherwise the justification of validity is circular hence faulty)
a scientific necessity implied by the scientific method and the following:
Machine learning provides algorithms & theory for meta-modelling
(otherwise the concept of a model would be unnecessary)
(any such belief is not empirically justified hence pseudoscientific)
3. No model can be trusted unless its validity has
been verified by a model-independent argument.
and powerful algorithms motivated by meta-modelling optimality.
Machine Learning
and Meta-Modelling
in a Nutshell
modelling
strategy
Leitmotifs of Machine Learning
Statistical models are objects in their own right
„learning
machines“
modelling
strategy
Engineering & statistics idea:
Engineering & computer science idea:
Computer science & statistics idea:
Any abstract algorithm can be a modelling strategy/learning machine
Future performance of algorithm/learning machine can be estimated
„model validation“
„model selection“
„computational
learning“
from the intersection of engineering, statistics and computer science
Possibly non-explicit
(and should)
learning
machine ?
Problem types in Machine Learning
?? ?
Supervised Learning:
some data is labelled by expert/oracle
Task: predict label from covariates
statistical models are usually discriminative
Examples: regression, classification
Problem types in Machine Learning
? ?
!
Unsupervised Learning:
the training data is not pre-labelled
Task: find „structure“ or „pattern“ in data
statistical models are usually generative
Examples: clustering, dimension reduction
Advanced learning tasks
Semi-supervised learning
some training data are labelled, some are not
On-line learning
the data is revealed with time, models need to update
Anomaly detection
all or most data are „positive examples“, the task is to flag „test negatives“
Complications in the labelling
Complications through correlated data and/or time
Forecasting
each data point has a time stamp, predict the temporal future
Transfer learning
the data comes in dissimilar batches, train and test may be distinct
Reinforcement learning
data are not directly labelled, only indirect gain/loss
observations
„training data“
predictions
model fitting
“learning”
fitted model
prediction
new data
??
model tuning parameters
e.g., to base
decisions on
What is a Learning Machine?
Examples: generalized linear model, linear regression, support vector machine,
neural networks (= „deep learning“), random forests, gradient boosting, …
… an algorithm that solves,
e.g., the previous tasks:
Illustration: supervised learning machine
Example: Linear Regression
observations
„training data“
predictions
model fitting
“learning”
fitted model
prediction
new data
?
Fit intercept or not?
Model validation: does the model make sense?
Model
learning
Prediction
„the truth“
„training data“
„test data“
e.g. regression, GLM,
advanced methods learnt model
? „test labels“
compare
&
quantify
„out-of-sample“
„hold-out “
„in-sample“
Predictive models need to be validated on unseen data!
Which means the part of data for testing has not been seen by the algorithm before!
(note: this includes the case where machine = linear regression, deep learning, etc)
The only (general) way to test goodness of prediction is actually observing prediction!
??
predictions
e.g. evaluating the
regression model
prediction strategy
learning machine
„Re-sampling“:
training data 1
test data
Predictor 1
Predictor 2
Predictor 3training data 2
test data
Predictor 1
Predictor 2
Predictor 3training data 3
test data 3
Predictor 1
Predictor 2
Predictor 3
all data
errors 1,2,3
errors 1,2,3
errors 1,2,3
aggregate
errors 1,2,3
comparison
k-fold
cross-validation
how to obtain training/test splitstype of re-sampling pros/cons
2. obtain k train/tests splits via:
1. divide data in k (almost) equal parts
each part is test data exactly once
the rest of data is the training set
often: k=5
good compromise between
runtime and accuracy
Multiple algorithms are compared on multiple data splits/sub-datasets
leave-one-out
when k is small compared to data size
= [number of data points]-fold c.v. very accurate, high run-time
repeated
sub-sampling
parameters:
training/test size
# of repetitions
1. obtain a random sub-sample of
training/test data of specified sizes
(train/test need not cover all data)
can be arbitrarily quick
can be arbitrarily inaccurate
(depending on parameter choice)
2. repeat 1. desired number of times
can be combined with k-fold
State-of-art principle in model validation, model comparison and meta-modelling
Quantitative model comparison
a „benchmarking experiment“ results in a table like this
model RMSE
15.3
?
Confidence regions (or paired tests) to compare models to each other:
A is better than B / B is better than A / A and B are equally good
Uninformed model (stupid model/random guess) needs to be included
otherwise a statement „is better than an uninformed guess“ cannot be made.
9.5
13.6
20.1 ± 1.2
± 0.9
± 0.7
± 1.4
MAE
12.3
7.3
11.4
18.1 ± 1.1
± 0.8
± 0.9
± 1.7
„useful model“ = (significantly) better than uninformed baseline
Meta-model: automated parameter tuning
training
data
test data
Parameters 1
Parameters 2
Parameters 3
mo
del
goodn
ess
1
5
.
3
?
9
.
5
1
3
.
62
0
.
1
±
1
.
2
±
0
.
9
±
0
.
7
±
1
.
4
Best parameters
whole training data
Re-sampled training data
Important caveat:
Which measure
of predictive goodness
Which inner re-sampling scheme
Methods are usually less sensitive
to these „new“ tuning parameters
the „inner“ training/test splits
need to be part of any „outer“ training set
otherwise validation is not out-of-sample!
Re-sampling is used to determine [best parameter setting]
For validation, new unseen data needs to be used:
all data
training
data
test data
tuning train
tuning test
„real“ test
model goodness
1
5
.
3
?
9
.
5
1
3
.
6
2
0
.
1
±
1
.
2
±
0
.
9
±
0
.
7
±
1
.
4
Model w. Best
Parameter
training
data
fit to all
predict &
quantify
Multi-fold-schemes are nested:
„splits within splits“
Meta-Strategies in ML
„Model
tuning“
Model with tuning parameters
Best tuning parameters are determined
using data-driven tuning algorithm
„Ensemble
learning“
A
B
C
D
a number of (possibly „weak“) models
A D
B
„strong“ ensemble model
Object dependencies in the ML workflow
all data
One interesting dataset
into multiple
train/test splits
training
datatest data
is re-sampled
training
datatest data
training
datatest data
„Typical
number of“
5-10
on each
of which
the strategies
are compared 1 2 M
M = 5-20
most of which
are parameter-
tuned by the
same principle
10-10.000
parameter
combinations Ensembles: further nesting
10-1.000
base learners
Runtime = 10 x 10 x 5 x 1.000 (x 100) x one run on N samples
3-5 nested splits
outer
splits
N = 100-100.000
data points
(„small data“)
(usually O(N²) or O(N³) )
Machine Learning
Toolboxes
An incomplete list of influential toolboxes
Modular API
(e.g., methods)
Model tuning,
meta-methods
Model validation
and comparison
GUILanguage
R
caret
python
multi-
interface
R
Java
3rd party
wrappers
python
Common
models
Not
entirely
scikit-learn is perhaps the most widely used ML toolbox
mostly
kernels
some
Few, mostly
classifiers few
python
The object-oriented ML Toolbox API
Learning Machines
as found in the R/mlr or scikit-learn packages
Leading principles: encapsulation, modularization
modular structure
Linear regression
fit(traindata)
„learning machine“ object
predict(testdata)
plus metadata & model info
object orientation
Abstraction models objects with unified API:
Public interfaceConcept abstracted in R/mlr in sklearn
fitting, predicting, set parameters Learner estimator
Re-sampling schemes sample, apply & get results ResampleDesc
splitter classes in
model_selection
Evaluation metrics compute from results, tabulate Measure metrics classes in
metrics
Meta-modelling wrapping machines by strategy
Learning task benchmark, list strategies/measures Task Implicit, not
encapsulated
Tuning
Ensembling
Pipelining Pipeline
various wrappersvarious wrappers
fused classes
HPC for benchmarking/validation today
all data
Scikit-learn: joblib
training
datatest data
training
datatest data
training
datatest data
„Typical
number of“
5-10
1 2 M
M = 5-20
10-10.000
parameter
combinations
10-1.000
base learners
Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive)
3-5 nested splits
outer
splits
N = 100-100.000
data points
(„small data“)
mlr: parallelMap
1
2
3
4
At the
selected
level:
Distribute to
clusters/cores
(one of 1-4)
HPC support tomorrow?
1 2 M
Layer 2:
Layer 1:
full graph of
dependencies:
re-samples
algorithms
parameters
…
Scheduler for
algorithms and
meta-algorithms
Data/taskpipeline
DATA
(e.g. Hadoop)
Layer 3:
Optimized
Primitives
Layer 4:
Hardware API
Combining (?)
MapReduce,
DAAL, dask,
joblib -> TBB?
e.g. MKL,
CUDA,
BLAS
e.g. distributed, multi-core,
multi-type/heterogeneous
(image source: continuum analytics)
Linear systems
convex optimization
stoch. gradient descent (image source: Intel math kernel library)
Challenges in ML APIs and HPC
Surprisingly few resources have been invested in ML toolboxes
Most advanced toolboxes are currently open-source & academic
Features that would be desirable to the practitioner
but not available without mid-scale software development:
Integration of (a) data management, (b) exploration and (c) modelling
Full HPC integration on granular level for distributed ML benchmarking
Non-standard modelling tasks, structured data (incl time series)
data heterogeneity, multiple datasets, time series, spatial features, images etc
forecasting, on-line learning, anomaly detection, change point detection
especially challenging: integration in large scale scenarios
e.g. MapReduce for divide/conquer over data, model parts, and models
making full use parallelism for nesting and computational redundancies
complete HPC architecture for whole model benchmarking workflow
meta-modelling and re-sampling for these is an order of magnitude more costly

More Related Content

What's hot

How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversionsSudeep Shukla
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4Luis Borbon
 
MLPA for health care presentation smc
MLPA for health care presentation   smcMLPA for health care presentation   smc
MLPA for health care presentation smcShaun Comfort
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningEng Teong Cheah
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataData Driven Innovation
 
Anomaly detection workshop
Anomaly detection workshopAnomaly detection workshop
Anomaly detection workshopgforgovind
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionLalit Jain
 
Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and typesPadma Metta
 
Nss power point_machine_learning
Nss power point_machine_learningNss power point_machine_learning
Nss power point_machine_learningGauravsd2014
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 
Acem machine learning
Acem machine learningAcem machine learning
Acem machine learningAastha Kohli
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learningJohnson Ubah
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive ModelDKALab
 

What's hot (20)

How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4
 
MLPA for health care presentation smc
MLPA for health care presentation   smcMLPA for health care presentation   smc
MLPA for health care presentation smc
 
Predictive data analytics models and their applications
Predictive data analytics models and their applicationsPredictive data analytics models and their applications
Predictive data analytics models and their applications
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big Data
 
Anomaly detection workshop
Anomaly detection workshopAnomaly detection workshop
Anomaly detection workshop
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 
Data analysis
Data analysisData analysis
Data analysis
 
Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and types
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
my IEEE
my IEEEmy IEEE
my IEEE
 
Nss power point_machine_learning
Nss power point_machine_learningNss power point_machine_learning
Nss power point_machine_learning
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Acem machine learning
Acem machine learningAcem machine learning
Acem machine learning
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 

Similar to Data Analytics, Machine Learning, and HPC in Today’s Changing Application Environment

G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsIstituto nazionale di statistica
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingMeghana Gowda
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxcloudserviceuit
 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New BossAndreas Dewes
 
Machine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP toolMachine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP toolFranki Chamaki
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Machine Learning - Lecture2.pptx
Machine Learning - Lecture2.pptxMachine Learning - Lecture2.pptx
Machine Learning - Lecture2.pptxNsitTech
 
Chapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxChapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxssuser957b41
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 

Similar to Data Analytics, Machine Learning, and HPC in Today’s Changing Application Environment (20)

Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modeling
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New Boss
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Machine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP toolMachine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP tool
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Intro 2 Machine Learning
Intro 2 Machine LearningIntro 2 Machine Learning
Intro 2 Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Machine Learning - Lecture2.pptx
Machine Learning - Lecture2.pptxMachine Learning - Lecture2.pptx
Machine Learning - Lecture2.pptx
 
Chapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxChapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptx
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 

More from Intel® Software

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.Intel® Software
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Intel® Software
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchIntel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019Intel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Intel® Software
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesIntel® Software
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision SlidesIntel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Intel® Software
 

More from Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Data Analytics, Machine Learning, and HPC in Today’s Changing Application Environment

  • 1. Intel HPC Developer Convention Salt Lake City 2016 Machine Learning Track Franz J. Király Data Analytics, Machine Learning and HPC in today’s changing application environment
  • 2. An overview ofdata analytics DATA Scientific Questions Exploration Statistical Questions Methods Quantitative Modelling Predictive/InferentialDescriptive/Explanatory StatisticalProgramming R python TheScientificMethod Scientific and Statistical Validation Knowledge (practical)
  • 3. Data analytics and data science in a broader context Data analytics Data mining, Machine learning Statistics, Modelling, Raw data Clean data Lot of problems and subtleties at these stages already often, most of manpower in „data“ project needs to go here first before one can attempt reliable Knowledge underlying arguments need to be explained well and properly Relevant findings and
  • 5. What „Big Data“ may mean in practice Kernel methods, OLS 10.000 Solution strategiesNumber of data samples Strategies that stop working in reasonable time Number of features 10.000.000 10.000.000.000 1.000 Reading in all the data Random forests 100 L1, LASSO (around the same order) Manual exploratory data analysis 1.000 Super-linear algorithms Linear algorithms, including Sub-sampling On-line models Feature extraction Large-scale strategies for super-linear algorithms Feature selection Distributed computing
  • 6. Large-scale motifs in data science Not necessarily a lot of data, but computationally intensive models Classical example: finite elements and other numerical models „Big models“ New fancy example: large neural networks aka „deep learning“ = where high-performance computing is helpful/impactful Computational challenge arises from processing all of the data Example: histogram or linear regression with huge amounts of data „Big data“ Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes = the „classic“, beloved by everyone = what it says, a lot of data (ca 1 million samples or more) Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting Model validation and model selection = this talk‘s focus Answers the question: which model is best for your data? Demanding even for simple models and small amounts of data! Example: is deep learning better than logistic regression, or guessing?
  • 7. Customer: Hospital specializing in treatment of patients with a certain disease. Meta-modelling: stylized case studies Scientific question: depending on patient characteristics, predict the event risk. Patients with this disease are at-risk to experience an adverse event (e.g. death) Data set: complete clinical records of 1.000 patients, including event if occurred Customer: Retailer who wants to accurately model behaviour of customers. Not of interest: which algorithm/strategy, out of many, exactly solves the task Scientific question: predict future customer behaviour given past behaviour Customers can buy (or not buy) any of a number of products, or churn. Data set: complete customer and purchase records of 100.000 customers Of interest: model interpretability; how accurate the predictions are expected to be Customer: Manufacturer wishes to find best parameter setting for machines. Scientific question: find parameter settings which optimizes the above Parameters influence amount/quality of product (or whether machine breaks) Data set: outcomes for 10.000 parameter settings on those machines whether the algorithm/model is (easily) deployable in the „real world“
  • 8. = data-centric and data-dependent modelling Model validation and model selection 1. There is no model that is good for all data. 2. For given data, there is no a-priori reason to believe that a certain type of model will be the best one. (otherwise the justification of validity is circular hence faulty) a scientific necessity implied by the scientific method and the following: Machine learning provides algorithms & theory for meta-modelling (otherwise the concept of a model would be unnecessary) (any such belief is not empirically justified hence pseudoscientific) 3. No model can be trusted unless its validity has been verified by a model-independent argument. and powerful algorithms motivated by meta-modelling optimality.
  • 10. modelling strategy Leitmotifs of Machine Learning Statistical models are objects in their own right „learning machines“ modelling strategy Engineering & statistics idea: Engineering & computer science idea: Computer science & statistics idea: Any abstract algorithm can be a modelling strategy/learning machine Future performance of algorithm/learning machine can be estimated „model validation“ „model selection“ „computational learning“ from the intersection of engineering, statistics and computer science Possibly non-explicit (and should) learning machine ?
  • 11. Problem types in Machine Learning ?? ? Supervised Learning: some data is labelled by expert/oracle Task: predict label from covariates statistical models are usually discriminative Examples: regression, classification
  • 12. Problem types in Machine Learning ? ? ! Unsupervised Learning: the training data is not pre-labelled Task: find „structure“ or „pattern“ in data statistical models are usually generative Examples: clustering, dimension reduction
  • 13. Advanced learning tasks Semi-supervised learning some training data are labelled, some are not On-line learning the data is revealed with time, models need to update Anomaly detection all or most data are „positive examples“, the task is to flag „test negatives“ Complications in the labelling Complications through correlated data and/or time Forecasting each data point has a time stamp, predict the temporal future Transfer learning the data comes in dissimilar batches, train and test may be distinct Reinforcement learning data are not directly labelled, only indirect gain/loss
  • 14. observations „training data“ predictions model fitting “learning” fitted model prediction new data ?? model tuning parameters e.g., to base decisions on What is a Learning Machine? Examples: generalized linear model, linear regression, support vector machine, neural networks (= „deep learning“), random forests, gradient boosting, … … an algorithm that solves, e.g., the previous tasks: Illustration: supervised learning machine
  • 15. Example: Linear Regression observations „training data“ predictions model fitting “learning” fitted model prediction new data ? Fit intercept or not?
  • 16. Model validation: does the model make sense? Model learning Prediction „the truth“ „training data“ „test data“ e.g. regression, GLM, advanced methods learnt model ? „test labels“ compare & quantify „out-of-sample“ „hold-out “ „in-sample“ Predictive models need to be validated on unseen data! Which means the part of data for testing has not been seen by the algorithm before! (note: this includes the case where machine = linear regression, deep learning, etc) The only (general) way to test goodness of prediction is actually observing prediction! ?? predictions e.g. evaluating the regression model prediction strategy learning machine
  • 17. „Re-sampling“: training data 1 test data Predictor 1 Predictor 2 Predictor 3training data 2 test data Predictor 1 Predictor 2 Predictor 3training data 3 test data 3 Predictor 1 Predictor 2 Predictor 3 all data errors 1,2,3 errors 1,2,3 errors 1,2,3 aggregate errors 1,2,3 comparison k-fold cross-validation how to obtain training/test splitstype of re-sampling pros/cons 2. obtain k train/tests splits via: 1. divide data in k (almost) equal parts each part is test data exactly once the rest of data is the training set often: k=5 good compromise between runtime and accuracy Multiple algorithms are compared on multiple data splits/sub-datasets leave-one-out when k is small compared to data size = [number of data points]-fold c.v. very accurate, high run-time repeated sub-sampling parameters: training/test size # of repetitions 1. obtain a random sub-sample of training/test data of specified sizes (train/test need not cover all data) can be arbitrarily quick can be arbitrarily inaccurate (depending on parameter choice) 2. repeat 1. desired number of times can be combined with k-fold State-of-art principle in model validation, model comparison and meta-modelling
  • 18. Quantitative model comparison a „benchmarking experiment“ results in a table like this model RMSE 15.3 ? Confidence regions (or paired tests) to compare models to each other: A is better than B / B is better than A / A and B are equally good Uninformed model (stupid model/random guess) needs to be included otherwise a statement „is better than an uninformed guess“ cannot be made. 9.5 13.6 20.1 ± 1.2 ± 0.9 ± 0.7 ± 1.4 MAE 12.3 7.3 11.4 18.1 ± 1.1 ± 0.8 ± 0.9 ± 1.7 „useful model“ = (significantly) better than uninformed baseline
  • 19. Meta-model: automated parameter tuning training data test data Parameters 1 Parameters 2 Parameters 3 mo del goodn ess 1 5 . 3 ? 9 . 5 1 3 . 62 0 . 1 ± 1 . 2 ± 0 . 9 ± 0 . 7 ± 1 . 4 Best parameters whole training data Re-sampled training data Important caveat: Which measure of predictive goodness Which inner re-sampling scheme Methods are usually less sensitive to these „new“ tuning parameters the „inner“ training/test splits need to be part of any „outer“ training set otherwise validation is not out-of-sample! Re-sampling is used to determine [best parameter setting] For validation, new unseen data needs to be used: all data training data test data tuning train tuning test „real“ test model goodness 1 5 . 3 ? 9 . 5 1 3 . 6 2 0 . 1 ± 1 . 2 ± 0 . 9 ± 0 . 7 ± 1 . 4 Model w. Best Parameter training data fit to all predict & quantify Multi-fold-schemes are nested: „splits within splits“
  • 20. Meta-Strategies in ML „Model tuning“ Model with tuning parameters Best tuning parameters are determined using data-driven tuning algorithm „Ensemble learning“ A B C D a number of (possibly „weak“) models A D B „strong“ ensemble model
  • 21. Object dependencies in the ML workflow all data One interesting dataset into multiple train/test splits training datatest data is re-sampled training datatest data training datatest data „Typical number of“ 5-10 on each of which the strategies are compared 1 2 M M = 5-20 most of which are parameter- tuned by the same principle 10-10.000 parameter combinations Ensembles: further nesting 10-1.000 base learners Runtime = 10 x 10 x 5 x 1.000 (x 100) x one run on N samples 3-5 nested splits outer splits N = 100-100.000 data points („small data“) (usually O(N²) or O(N³) )
  • 23. An incomplete list of influential toolboxes Modular API (e.g., methods) Model tuning, meta-methods Model validation and comparison GUILanguage R caret python multi- interface R Java 3rd party wrappers python Common models Not entirely scikit-learn is perhaps the most widely used ML toolbox mostly kernels some Few, mostly classifiers few python
  • 24. The object-oriented ML Toolbox API Learning Machines as found in the R/mlr or scikit-learn packages Leading principles: encapsulation, modularization modular structure Linear regression fit(traindata) „learning machine“ object predict(testdata) plus metadata & model info object orientation Abstraction models objects with unified API: Public interfaceConcept abstracted in R/mlr in sklearn fitting, predicting, set parameters Learner estimator Re-sampling schemes sample, apply & get results ResampleDesc splitter classes in model_selection Evaluation metrics compute from results, tabulate Measure metrics classes in metrics Meta-modelling wrapping machines by strategy Learning task benchmark, list strategies/measures Task Implicit, not encapsulated Tuning Ensembling Pipelining Pipeline various wrappersvarious wrappers fused classes
  • 25. HPC for benchmarking/validation today all data Scikit-learn: joblib training datatest data training datatest data training datatest data „Typical number of“ 5-10 1 2 M M = 5-20 10-10.000 parameter combinations 10-1.000 base learners Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive) 3-5 nested splits outer splits N = 100-100.000 data points („small data“) mlr: parallelMap 1 2 3 4 At the selected level: Distribute to clusters/cores (one of 1-4)
  • 26. HPC support tomorrow? 1 2 M Layer 2: Layer 1: full graph of dependencies: re-samples algorithms parameters … Scheduler for algorithms and meta-algorithms Data/taskpipeline DATA (e.g. Hadoop) Layer 3: Optimized Primitives Layer 4: Hardware API Combining (?) MapReduce, DAAL, dask, joblib -> TBB? e.g. MKL, CUDA, BLAS e.g. distributed, multi-core, multi-type/heterogeneous (image source: continuum analytics) Linear systems convex optimization stoch. gradient descent (image source: Intel math kernel library)
  • 27. Challenges in ML APIs and HPC Surprisingly few resources have been invested in ML toolboxes Most advanced toolboxes are currently open-source & academic Features that would be desirable to the practitioner but not available without mid-scale software development: Integration of (a) data management, (b) exploration and (c) modelling Full HPC integration on granular level for distributed ML benchmarking Non-standard modelling tasks, structured data (incl time series) data heterogeneity, multiple datasets, time series, spatial features, images etc forecasting, on-line learning, anomaly detection, change point detection especially challenging: integration in large scale scenarios e.g. MapReduce for divide/conquer over data, model parts, and models making full use parallelism for nesting and computational redundancies complete HPC architecture for whole model benchmarking workflow meta-modelling and re-sampling for these is an order of magnitude more costly