SlideShare una empresa de Scribd logo
1 de 37
Machine Learning
Data science for beginners, session 6
Machine Learning: your 5-7 things
Defining machine learning
The Scikit-Learn library
Machine learning algorithms
Choosing an algorithm
Measuring algorithm performance
Defining Machine Learning
Machine Learning = learning models from data
Which advert is the user most likely to click on?
Who’s most likely to win this election?
Which wells are most likely to fail in the next 6 months?
Machine Learning as Predictive Analytics...
Machine Learning Process
● Get data
● Select a model
● Select hyperparameters for that model
● Fit model to data
● Validate model (and change model, if necessary)
● Use the model to predict values for new data
Today’s library: Scikit-Learn (sklearn)
Scikit-Learn’s example datasets
● Iris
● Digits
● Diabetes
● Boston
Select a Model
Algorithm Types
Supervised learning
Regression: learning numbers
Classification: learning classes
Unsupervised learning
Clustering: finding groups
Dimensionality reduction: finding efficient representations
Linear Regression: fit a line to (numerical) data
Linear Regression: First, get your data
import numpy as np
import pandas as pd
gen = np.random.RandomState(42)
num_samples = 40
x = 10 * gen.rand(num_samples)
y = 3 * x + 7+ gen.randn(num_samples)
X = pd.DataFrame(x)
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(x,y)
Linear Regression: Fit model to data
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print('Slope: {}, Intercept: {}'.format(model.coef_, model.intercept_))
Linear Regression: Check your model
Xtest = pd.DataFrame(np.linspace(-1, 11))
predicted = model.predict(Xtest)
plt.scatter(x, y)
plt.plot(Xtest, predicted)
Reality can be a little more like this…
Classification: Predict classes
● Well pump: [working, broken]
● CV: [accept, reject]
● Gender: [male, female, others]
● Iris variety: [iris setosa, iris virginica, iris versicolor]
Classification: The Iris Dataset Petal
Sepal
Classification: first get your data
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Classification: Split your data
ntest=10
np.random.seed(0)
indices = np.random.permutation(len(X))
iris_X_train = X[indices[:-ntest]]
iris_Y_train = Y[indices[:-ntest]]
iris_X_test = X[indices[-ntest:]]
iris_Y_test = Y[indices[-ntest:]]
Classifier: Fit Model to Data
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski')
knn.fit(iris_X_train, iris_Y_train)
Classifier: Check your model
predicted_classes = knn.predict(iris_X_test)
print('kNN predicted classes: {}'.format(predicted_classes))
print('Real classes: {}'.format(iris_Y_test))
Clustering: Find groups in your data
Clustering: get your data
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
print("Xs: {}".format(X))
Clustering: Fit model to data
from sklearn import cluster
k_means = cluster.KMeans(3)
k_means.fit(iris.data)
Clustering: Check your model
print("Generated labels: n{}".format(k_means.labels_))
print("Real labels: n{}".format(Y))
Dimensionality Reduction
Dimensionality reduction: Get your data
Dimensionality reduction: Fit model to data
Recap: Choosing an Algorithm
Have: data and expected outputs
Want numbers? Try regression algorithms
Want classes? Try classification algorithms
Have: just data
Want to find structure? Try clustering algorithms
Want to look at it? Try dimensionality reduction
Model Validation
How well does the model fit new data?
“Holdout sets”:
split your data into training and test sets
learn your model with the training set
get a validation score for your test set
Models are rarely perfect… you might have to change parameters or model
● underfitting: model not complex enough to fit the training data
● overfitting: model too complex: fits the training data well, does badly on test
Overfitting and underfitting
The Confusion Matrix
True positive
False positive
False negative
True negative
Test Metrics
Precision:
of all the “true” results, how many were actually “true”?
Precision = tp / (tp + fp)
Recall:
how many of the things that were really “true” were marked as “true” by the
classifier?
Recall = tp / (tp + fn)
F1 score:
harmonic mean of precision and recall
F1_score = 2 * precision * recall / (precision + recall)
Iris classification: metrics
from sklearn import metrics
print(metrics.classification_report(iris_Y_test, predicted_classes))
Exercises
Explore some algorithms
Notebooks 6.x contain examples of machine learning algorithms. Run them,
play with the numbers in them, break them, think about why they might have
broken.

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
 

Similar a Session 06 machine learning.pptx

Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
 

Similar a Session 06 machine learning.pptx (20)

DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Quick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 minsQuick Machine learning projects steps in 5 mins
Quick Machine learning projects steps in 5 mins
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Dian Vitiana Ningrum ()6211540000020)
Dian Vitiana Ningrum  ()6211540000020)Dian Vitiana Ningrum  ()6211540000020)
Dian Vitiana Ningrum ()6211540000020)
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
 
Classification
ClassificationClassification
Classification
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 

Más de bodaceacat

Ardrone represent
Ardrone representArdrone represent
Ardrone represent
bodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
bodaceacat
 

Más de bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 

Último

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 

Último (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 

Session 06 machine learning.pptx

  • 1. Machine Learning Data science for beginners, session 6
  • 2. Machine Learning: your 5-7 things Defining machine learning The Scikit-Learn library Machine learning algorithms Choosing an algorithm Measuring algorithm performance
  • 4. Machine Learning = learning models from data Which advert is the user most likely to click on? Who’s most likely to win this election? Which wells are most likely to fail in the next 6 months?
  • 5. Machine Learning as Predictive Analytics...
  • 6. Machine Learning Process ● Get data ● Select a model ● Select hyperparameters for that model ● Fit model to data ● Validate model (and change model, if necessary) ● Use the model to predict values for new data
  • 8. Scikit-Learn’s example datasets ● Iris ● Digits ● Diabetes ● Boston
  • 10. Algorithm Types Supervised learning Regression: learning numbers Classification: learning classes Unsupervised learning Clustering: finding groups Dimensionality reduction: finding efficient representations
  • 11. Linear Regression: fit a line to (numerical) data
  • 12. Linear Regression: First, get your data import numpy as np import pandas as pd gen = np.random.RandomState(42) num_samples = 40 x = 10 * gen.rand(num_samples) y = 3 * x + 7+ gen.randn(num_samples) X = pd.DataFrame(x) %matplotlib inline import matplotlib.pyplot as plt plt.scatter(x,y)
  • 13. Linear Regression: Fit model to data from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept=True) model.fit(X, y) print('Slope: {}, Intercept: {}'.format(model.coef_, model.intercept_))
  • 14. Linear Regression: Check your model Xtest = pd.DataFrame(np.linspace(-1, 11)) predicted = model.predict(Xtest) plt.scatter(x, y) plt.plot(Xtest, predicted)
  • 15. Reality can be a little more like this…
  • 16. Classification: Predict classes ● Well pump: [working, broken] ● CV: [accept, reject] ● Gender: [male, female, others] ● Iris variety: [iris setosa, iris virginica, iris versicolor]
  • 17. Classification: The Iris Dataset Petal Sepal
  • 18. Classification: first get your data import numpy as np from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target
  • 19. Classification: Split your data ntest=10 np.random.seed(0) indices = np.random.permutation(len(X)) iris_X_train = X[indices[:-ntest]] iris_Y_train = Y[indices[:-ntest]] iris_X_test = X[indices[-ntest:]] iris_Y_test = Y[indices[-ntest:]]
  • 20. Classifier: Fit Model to Data from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski') knn.fit(iris_X_train, iris_Y_train)
  • 21. Classifier: Check your model predicted_classes = knn.predict(iris_X_test) print('kNN predicted classes: {}'.format(predicted_classes)) print('Real classes: {}'.format(iris_Y_test))
  • 22. Clustering: Find groups in your data
  • 23. Clustering: get your data from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target print("Xs: {}".format(X))
  • 24. Clustering: Fit model to data from sklearn import cluster k_means = cluster.KMeans(3) k_means.fit(iris.data)
  • 25. Clustering: Check your model print("Generated labels: n{}".format(k_means.labels_)) print("Real labels: n{}".format(Y))
  • 29. Recap: Choosing an Algorithm Have: data and expected outputs Want numbers? Try regression algorithms Want classes? Try classification algorithms Have: just data Want to find structure? Try clustering algorithms Want to look at it? Try dimensionality reduction
  • 31. How well does the model fit new data? “Holdout sets”: split your data into training and test sets learn your model with the training set get a validation score for your test set Models are rarely perfect… you might have to change parameters or model ● underfitting: model not complex enough to fit the training data ● overfitting: model too complex: fits the training data well, does badly on test
  • 33. The Confusion Matrix True positive False positive False negative True negative
  • 34. Test Metrics Precision: of all the “true” results, how many were actually “true”? Precision = tp / (tp + fp) Recall: how many of the things that were really “true” were marked as “true” by the classifier? Recall = tp / (tp + fn) F1 score: harmonic mean of precision and recall F1_score = 2 * precision * recall / (precision + recall)
  • 35. Iris classification: metrics from sklearn import metrics print(metrics.classification_report(iris_Y_test, predicted_classes))
  • 37. Explore some algorithms Notebooks 6.x contain examples of machine learning algorithms. Run them, play with the numbers in them, break them, think about why they might have broken.

Notas del editor

  1. What you’re learning isn’t the data, but a model that will help you understand (and possibly also explain) it.
  2. We bother making models because we want to start asking questions, and (hopefully) making changes in our world. Image from http://www.rosebt.com/blog/descriptive-diagnostic-predictive-prescriptive-analytics
  3. AKA import-instantiate-fit-predict Hyperparameter: things like “how many clusters of data do I think there are in this dataset?”
  4. Lots of great tutorials on http://scikit-learn.org/stable/ You import from this library, which is called “sklearn” in python code.
  5. Iris image from Nociveglia https://www.flickr.com/photos/40385177@N07/.
  6. Supervised versus unsupervised learning: supervised = give the algorithm both input data and the answers for that data (kinda like teaching), and it learns the connection between data and answers; unsupervised = give the algorithm just the data, and it finds the structure in that data Semi-supervised learning (where you only have a few answers) does exist, but isn’t talked about much. There’s also reinforcement learning, where you know if a result is better or worse, but not how much it’s better or worse.
  7. Fit a line to a set of datapoints. Use that line to predict new values
  8. This will give you 40 random samples around the line y = 3x + 7. Random.rand selects from a uniform distribution; random.randn selects from a standard normal distribution.
  9. Note the hyperparameter (fit_intercept). This says that your model doesn’t start at (0,0).
  10. predicted_slope = model.coef_ predicted_intercept = model.intercept_
  11. 1-feature linear regression on the Diabetes dataset. This is where you need to change your model. In this case, you’d start by trying more features, then adapting the model hyperparameters (e.g. it might not be a straight line that you need to fit) or the model that you use (e.g. linear regression might not be the best model type to use on this dataset).
  12. When there are just two classifications, it’s called binary classification.
  13. Classification: finding the link between data and classes. This is the Iris dataset. It’s one of Scikit-learn’s example datasets.
  14. print("Targets: {}".format(iris['target_names'])) print("Target data: {}".format(iris_Y)) print("Features: {}".format(iris['feature_names'])) print("Feature data: {}".format(iris_X))
  15. Why do we split into training and test sets? This is called a “holdout” set… we save some of our data, so we can use it to check how well our classifier does on data it hasn’t seen before. print(‘{} training points, {} test points’.format(len(iris_X_train), len(iris_X_test)))
  16. This is the k nearest neighbours algorithm. For every new datapoint, it looks at the N nearest datapoints it has classifications for, and assigns the new datapoint the class that’s most common amongst them. Here, we’re using 5 neighbours. We’re also using the Minkowski distance (https://machinelearning1.wordpress.com/2013/03/25/three-famous-metrics-manhattan-euclidean-minkowski/) : this tells the algorithm how to compute the distance between two points, so we can define which points are ‘closest’. Common distance metrics you’ll see in machine learning include: Manhattan, or “city block” distance: add the distance along the x axis to the distance along the y axis (“city block” because that’s how you navigate in Manhattan”) Euclidian distance: calculate the straight-line distance between the two points (e.g. sqrt(x^2 + y^2)) Minkowski distance: a variant of Euclidian distance, for large numbers of features
  17. This is the digits example dataset.
  18. This is all in notebook 6.5
  19. There’s no “best” algorithm for every problem. This is also known as the “no free lunch” theory. If you have data and estimate of better/worse: reinforcement learning There are lots of variants on these algorithms: the Scikit-learn cheat sheet will help you choose between them: http://scikit-learn.org/stable/tutorial/machine_learning_map/
  20. Overfitting: matches the training data well, performs badly on new data… has high variance Underfitting: doesn’t match the training data well, might perform well on new data… has high bias Bias/ Variance tradeoff: adjust your hyperparameters until the model performs well on the test data. See e.g. http://scott.fortmann-roe.com/docs/BiasVariance.html
  21. This is all about your parameters e.g. the difference between fitting a straight line, a quadratic curve or a n-dimensional curve. Figures from Jake Van Der Plas’ Python for Data Science book. We’ll talk about the bias-variance tradeoff later.
  22. False positive is also known as a “type 1 error”; false negative is also known as a “type 2 error”.
  23. These numbers are always between 0 and 1. If you want to play with F1, try it in Python, e.g.: import numpy as np p = np.array([.25, .25, .125, .5, .75]) r = np.array([.001, .10, .7, .9, .3]) 2*p*r / (p + r)
  24. Support: how many things that are actually this class did we use to calculate these metrics? Precision: of all the “true” results, how many were actually “true”? Recall: how many of the things that were really “true” were marked as “true” by the classifier? F1: combination of precision and recall