SlideShare a Scribd company logo
1 of 68
Dr. Parinaz Ameri
Intro to Machine Learning
for non-Data Scientists
Agenda
● 1.5 hours: Introduction to ML algorithms
● 1.5 hours: Implementing algorithms for different use-cases
● 1 hour: Working on a recommendation mini-project
Machine Learning in Daily Life
Source:
[xkcd_1838]
Machine Learning Definition
Arthur Samuel (1959):
“Field of study that gives computers the ability to learn without being explicitly
programmed.” [ML_Awad]
Source: [fortune]
Email Spam Filter
A Machine Learning Model
Machine Learning Definition
Tom Mitchell (1998):
“A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.” [ML_Mitchell]
E, T and P in a Spam Filter Example
● Task T:
○ Classify emails as Spam or Ham.
● Experience E:
○ Monitor you labeling emails as Spam or Not spam.
● Performance measure P:
○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.
Machine Learning Definition
Peter Flach (2012):
“Machine learning is the systematic study of algorithms and systems that improve their
knowledge or performance with experience.” [ML_Flach]
Source:
[towardsdatascience]
Machine Learning Main Ingredients
1. Tasks:
○ An abstract representation of a problem we want to solve regarding the domain objects
2. Models:
○ Representation of many tasks as a model from data points to outputs.
○ Produces as the output of a machine learning algorithm applied to training data.
3. Features:
○ A language definition in which we describe the relevant objects in our domain.
Source: [ML_Flach]
Machine Learning Main Ingredients
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
Source: [Medium]
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Tasks & Learning Algorithms
● Supervised Learning
○ Regression
○ Classification
● Unsupervised Learning
○ Clustering
● Reinforcement Learning
● Recommendation systems
Supervised Learning Algorithms
Data is Labeled = Right Answers are Given
Housing Price Prediction
750
Regression : Predict
a continuous valued
output
Breast Cancer (Malignant, Benign)
Classification :
Predict discrete
valued output (0,1)
Features in Classification
Other Features:
- Clump thickness
- Uniformity of cell
size
- Uniformity of cell
shape
- ...
Exercise 1
Should you treat the following problems with regression or classification?
Problem 1: You want to develop a learning algorithm to examine individual customer accounts
and determine if each account has been hacked.
Problem 2: You have a huge list of identical items and want to predict which how many of
them will be sold over next 3 months.
Unsupervised Learning Algorithms
Data is Not Labeled
Supervised Learning
X1
X2
Unsupervised Learning
X1
X2
Clustering
Clustering in Biology
Source: [researchgate]
More Clustering Applications
Social Network Analysis
Organizing Computing Clusters
Market Segmentation
Exercise 2
Which of the following problems would you address with Unsupervised Learning
algorithms?
1. Given a dataset of patients diagnosed as either having diabetes or not, learn
to classify new patients as having diabetes or not.
2. Given a database of customer data, automatically discover market segments
and group customers into different market segments.
3. Given a dataset of news articles found on the web, group them into set of
articles about the same story.
4. Given email labeled as spam/ham, learn spam filter.
Example of Supervised learning
Source:[radimrehurek]
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Models
Predictive model Descriptive model
Supervised learning Classification, Regression Subgrouping
Unsupervised learning Predictive clustering Clustering, Association Rule
discovery
Model Types
● Geometric
● Probabilistic
● Logical
Building a Linear Regression Model
Mean Squared Error
(MSE):
Measures the
average of the
squares of the errors
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation
Model Validation
● Goodness of fit (fit error)
● Goodness of prediction (prediction error): generalization error
Overfitting:
unnecessary increase of model complexity
Underfitting:
too simple model will not fit data properly
k-Fold Cross Validation
k=4 Cross Validation
Source: [wiki]
Mean Squared
Prediction Error:
computed on q
data points that
were not used in
estimating the
model
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation
New
Data
Prediction
Result
Get your hands dirty
Source: [karlstratos]
Installing docker with Anaconda image
1. Install docker with :
> sudo apt install docker.io
2. Add your current user to the docker group with the following command:
> sudo usermod -a -G docker $USER
3. Restart your computer
4. Register and proceed at https://hub.docker.com/_/anaconda
5. Download the docker of anaconda with the following command:
> docker pull continuumio/anaconda
6. Run docker:
> docker run -i -t continuumio/anaconda /bin/bash
7. Test your conda environment:
(base) root@9b9e483ba80e:/opt/conda# conda info
Running Jupyter Notebook
Run the following command in one line from host machine:
> docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c
"/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks &&
/opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 --
port=8888 --no-browser --allow-root"
- Open your Notebook in the browser
- Open a terminal and install: numpy pandas matplotlib scipy and sklearn
Local Download server
172.90.0.161
Python Libraries for Machine Learning
● NumPy (http://www.numpy.org/ ):
○ Introduce objects for multidimensional arrays and matrices
○ Provides vectorization of mathematical operations on arrays and matrices
● SciPy(https://www.scipy.org/scipylib/ ):
○ Collection of algorithms for linear algebra, statistics, optimization and etc.
○ Build on NumPy
● Pandas(http://pandas.pydata.org/ ):
○ Provide tools for data manipulation and handling missing data
● SciKit-Learn(https://scikit-learn.org/stable/ ):
○ Provide machine learning algorithms: classification, regression, clustering, model validation
etc.
● Matplotlib(https://matplotlib.org/ ):
○ Python 2D plotting library
Pandas DataFrame Data Types
Pandas type Python native type Description
obj string The most general dtype.
Will be assigned to your
column if it contains mixed
types (numbers and
strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold
this character.
float64 float Numeric characters with
decimals. If a column contains
numbers and NaNs(see below),
pandas will default to float64, in
case your missing value has a
decimal.
datetime64, timedelta[ns] N/A (but see thedatetimemodule
in Python’s standard library)
Values meant to hold time data.
Look into these for time series
experiments.
DataFrame Attributes
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labelsand column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values Numpy representation of the data
Exercise with DataFrame Attributes
1. How many records this data frame has?
2. How many elements are there?
3. What are the column names?
4. What types of columns we have in this data frame?
DataFrame Methods
df.method() description
head( [n] ), tail( [n] ) first/lastn rows
describe() generate descriptive statistics (for numeric
columns only)
max(), min() return max/min values for all numeric
columns
mean(), median() return mean/median values for all numeric
columns
std() standard deviation
sample([n]) returns a random sample of the data frame
dropna() drop all the records with missing values
Exercise with DataFrame Methods
1. Give the summary for the numeric columns in the dataset
2. Calculate standard deviation for all numeric columns
3. What are the mean values of the first 50 records in the dataset?
Hint: use head() method to subset the first 50 records and then calculate the mean
Handling Missing Values
● ‘NaN - NoT a Number’ shows missing values
● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or
medium (most common)
● But should be aware that something has been changed
● Could also ignore the sample or feature with missing values
Missing Values in Pandas
● Missing values in GroupBy method are excluded
● Many descriptive statistics methods have ‘skipna’ option to control if missing data should
be excluded . This value is set to True by default.
Dealing with Missing Values in DF
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, how='all') Drop column if all the values aremissing
dropna(thresh = 5) Drop rows that contain less than 5 non-
missing values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
Source: [Print_Lego]
Building a Linear Regression Model
Mean Squared Error
(MSE):
Measures the
average of the
squares of the errors
R-Squared
Where and
Here, yi^ is the fitted value for observation i and y¯ is the mean of Y.
k-Nearest Neighbors
Distance Measurements
KNN Algorithm
Accuracy
K-Means Clustering
K-Means Clustering Algorithm
Future Plans?
Further Learning
● Kaggle: is the place to do data science projects
● Seeing Theory : a visual introduction to probability and statistics.
● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI.
Software
Reading Recommendations
● Machine learning : The art and science of algorithms that make sense of data by Peter
Flach
● Python for Data Analysis by We McKinney
● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
References
[ML_Awad] Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA
[xkcd_1838] https://xkcd.com/1838/
[fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/
[ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press.
[ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2
[Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2
[karlstratos] http://karlstratos.com/drawings/drawings.html
[Print_Lego] https://www.pinterest.com/pin/422071796300372061/
[Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview-
57891c595e96
[researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related-
functional_fig6_26688269
References (2)
[redimrehurek] https://radimrehurek.com/data_science_python/
[wiki] https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Icon References
● Icons made by: Freepik from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY
We organize IT24.04.2019
Your Contact
Dr. Hamzeh Alavira
Founder, oranIT GmbH
alavirad@oranit.de
0049-176-8080-7585
Dr. Parinaz Ameri
Co-Founder, oranIT GmbH
ameri@oranit.de
0049-176-3497-0683

More Related Content

What's hot

Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
萍華 楊
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Sri Ambati
 

What's hot (20)

LR1. Summary Day 1
LR1. Summary Day 1LR1. Summary Day 1
LR1. Summary Day 1
 
L13. Cluster Analysis
L13. Cluster AnalysisL13. Cluster Analysis
L13. Cluster Analysis
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
sentiment analysis using support vector machine
sentiment analysis using support vector machinesentiment analysis using support vector machine
sentiment analysis using support vector machine
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
 
Machine Learning Lecture 2 Basics
Machine Learning Lecture 2 BasicsMachine Learning Lecture 2 Basics
Machine Learning Lecture 2 Basics
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
 

Similar to Intro to Machine Learning for non-Data Scientists

Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 

Similar to Intro to Machine Learning for non-Data Scientists (20)

Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво....NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdf
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Lec1 intoduction.pptx
Lec1 intoduction.pptxLec1 intoduction.pptx
Lec1 intoduction.pptx
 
Machine Learning - Lecture1.pptx.pdf
Machine Learning - Lecture1.pptx.pdfMachine Learning - Lecture1.pptx.pdf
Machine Learning - Lecture1.pptx.pdf
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Getting started with Machine Learning
Getting started with Machine LearningGetting started with Machine Learning
Getting started with Machine Learning
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 

Recently uploaded

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Intro to Machine Learning for non-Data Scientists

  • 1. Dr. Parinaz Ameri Intro to Machine Learning for non-Data Scientists
  • 2. Agenda ● 1.5 hours: Introduction to ML algorithms ● 1.5 hours: Implementing algorithms for different use-cases ● 1 hour: Working on a recommendation mini-project
  • 3. Machine Learning in Daily Life
  • 5. Machine Learning Definition Arthur Samuel (1959): “Field of study that gives computers the ability to learn without being explicitly programmed.” [ML_Awad] Source: [fortune]
  • 6. Email Spam Filter A Machine Learning Model
  • 7. Machine Learning Definition Tom Mitchell (1998): “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” [ML_Mitchell]
  • 8. E, T and P in a Spam Filter Example ● Task T: ○ Classify emails as Spam or Ham. ● Experience E: ○ Monitor you labeling emails as Spam or Not spam. ● Performance measure P: ○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.
  • 9. Machine Learning Definition Peter Flach (2012): “Machine learning is the systematic study of algorithms and systems that improve their knowledge or performance with experience.” [ML_Flach]
  • 11. Machine Learning Main Ingredients 1. Tasks: ○ An abstract representation of a problem we want to solve regarding the domain objects 2. Models: ○ Representation of many tasks as a model from data points to outputs. ○ Produces as the output of a machine learning algorithm applied to training data. 3. Features: ○ A language definition in which we describe the relevant objects in our domain.
  • 16. Tasks & Learning Algorithms ● Supervised Learning ○ Regression ○ Classification ● Unsupervised Learning ○ Clustering ● Reinforcement Learning ● Recommendation systems
  • 17. Supervised Learning Algorithms Data is Labeled = Right Answers are Given
  • 18. Housing Price Prediction 750 Regression : Predict a continuous valued output
  • 19. Breast Cancer (Malignant, Benign) Classification : Predict discrete valued output (0,1)
  • 20. Features in Classification Other Features: - Clump thickness - Uniformity of cell size - Uniformity of cell shape - ...
  • 21. Exercise 1 Should you treat the following problems with regression or classification? Problem 1: You want to develop a learning algorithm to examine individual customer accounts and determine if each account has been hacked. Problem 2: You have a huge list of identical items and want to predict which how many of them will be sold over next 3 months.
  • 26. More Clustering Applications Social Network Analysis Organizing Computing Clusters Market Segmentation
  • 27. Exercise 2 Which of the following problems would you address with Unsupervised Learning algorithms? 1. Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not. 2. Given a database of customer data, automatically discover market segments and group customers into different market segments. 3. Given a dataset of news articles found on the web, group them into set of articles about the same story. 4. Given email labeled as spam/ham, learn spam filter.
  • 28. Example of Supervised learning Source:[radimrehurek]
  • 29. Machine Learning Pipeline Data Preparation Training Data Test Data Feature Selection ML Algorithm Selection Building a model
  • 30. Models Predictive model Descriptive model Supervised learning Classification, Regression Subgrouping Unsupervised learning Predictive clustering Clustering, Association Rule discovery
  • 31. Model Types ● Geometric ● Probabilistic ● Logical
  • 32. Building a Linear Regression Model Mean Squared Error (MSE): Measures the average of the squares of the errors
  • 33. Machine Learning Pipeline Data Preparation Training Data Test Data Feature Selection ML Algorithm Selection Building a model Model Evaluation
  • 34. Model Validation ● Goodness of fit (fit error) ● Goodness of prediction (prediction error): generalization error
  • 36. Underfitting: too simple model will not fit data properly
  • 38. k=4 Cross Validation Source: [wiki] Mean Squared Prediction Error: computed on q data points that were not used in estimating the model
  • 39. Machine Learning Pipeline Data Preparation Training Data Test Data Feature Selection ML Algorithm Selection Building a model Model Evaluation New Data Prediction Result
  • 40. Get your hands dirty Source: [karlstratos]
  • 41. Installing docker with Anaconda image 1. Install docker with : > sudo apt install docker.io 2. Add your current user to the docker group with the following command: > sudo usermod -a -G docker $USER 3. Restart your computer 4. Register and proceed at https://hub.docker.com/_/anaconda 5. Download the docker of anaconda with the following command: > docker pull continuumio/anaconda 6. Run docker: > docker run -i -t continuumio/anaconda /bin/bash 7. Test your conda environment: (base) root@9b9e483ba80e:/opt/conda# conda info
  • 42. Running Jupyter Notebook Run the following command in one line from host machine: > docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 -- port=8888 --no-browser --allow-root" - Open your Notebook in the browser - Open a terminal and install: numpy pandas matplotlib scipy and sklearn
  • 44. Python Libraries for Machine Learning ● NumPy (http://www.numpy.org/ ): ○ Introduce objects for multidimensional arrays and matrices ○ Provides vectorization of mathematical operations on arrays and matrices ● SciPy(https://www.scipy.org/scipylib/ ): ○ Collection of algorithms for linear algebra, statistics, optimization and etc. ○ Build on NumPy ● Pandas(http://pandas.pydata.org/ ): ○ Provide tools for data manipulation and handling missing data ● SciKit-Learn(https://scikit-learn.org/stable/ ): ○ Provide machine learning algorithms: classification, regression, clustering, model validation etc. ● Matplotlib(https://matplotlib.org/ ): ○ Python 2D plotting library
  • 45. Pandas DataFrame Data Types Pandas type Python native type Description obj string The most general dtype. Will be assigned to your column if it contains mixed types (numbers and strings). int64 int Numeric characters. 64 refers to the memory allocated to hold this character. float64 float Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. datetime64, timedelta[ns] N/A (but see thedatetimemodule in Python’s standard library) Values meant to hold time data. Look into these for time series experiments.
  • 46. DataFrame Attributes df.attribute description dtypes list the types of the columns columns list the column names axes list the row labelsand column names ndim number of dimensions size number of elements shape return a tuple representing the dimensionality values Numpy representation of the data
  • 47. Exercise with DataFrame Attributes 1. How many records this data frame has? 2. How many elements are there? 3. What are the column names? 4. What types of columns we have in this data frame?
  • 48. DataFrame Methods df.method() description head( [n] ), tail( [n] ) first/lastn rows describe() generate descriptive statistics (for numeric columns only) max(), min() return max/min values for all numeric columns mean(), median() return mean/median values for all numeric columns std() standard deviation sample([n]) returns a random sample of the data frame dropna() drop all the records with missing values
  • 49. Exercise with DataFrame Methods 1. Give the summary for the numeric columns in the dataset 2. Calculate standard deviation for all numeric columns 3. What are the mean values of the first 50 records in the dataset? Hint: use head() method to subset the first 50 records and then calculate the mean
  • 50. Handling Missing Values ● ‘NaN - NoT a Number’ shows missing values ● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or medium (most common) ● But should be aware that something has been changed ● Could also ignore the sample or feature with missing values
  • 51. Missing Values in Pandas ● Missing values in GroupBy method are excluded ● Many descriptive statistics methods have ‘skipna’ option to control if missing data should be excluded . This value is set to True by default.
  • 52. Dealing with Missing Values in DF df.method() description dropna() Drop missing observations dropna(how='all') Drop observations where all cells is NA dropna(axis=1, how='all') Drop column if all the values aremissing dropna(thresh = 5) Drop rows that contain less than 5 non- missing values fillna(0) Replace missing values with zeros isnull() returns True if the value is missing notnull() Returns True for non-missing values
  • 54. Building a Linear Regression Model Mean Squared Error (MSE): Measures the average of the squares of the errors
  • 55. R-Squared Where and Here, yi^ is the fitted value for observation i and y¯ is the mean of Y.
  • 63. Further Learning ● Kaggle: is the place to do data science projects ● Seeing Theory : a visual introduction to probability and statistics. ● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI. Software
  • 64. Reading Recommendations ● Machine learning : The art and science of algorithms that make sense of data by Peter Flach ● Python for Data Analysis by We McKinney ● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
  • 65. References [ML_Awad] Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA [xkcd_1838] https://xkcd.com/1838/ [fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/ [ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press. [ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2 [Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2 [karlstratos] http://karlstratos.com/drawings/drawings.html [Print_Lego] https://www.pinterest.com/pin/422071796300372061/ [Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview- 57891c595e96 [researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related- functional_fig6_26688269
  • 66. References (2) [redimrehurek] https://radimrehurek.com/data_science_python/ [wiki] https://en.wikipedia.org/wiki/Cross-validation_(statistics)
  • 67. Icon References ● Icons made by: Freepik from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY
  • 68. We organize IT24.04.2019 Your Contact Dr. Hamzeh Alavira Founder, oranIT GmbH alavirad@oranit.de 0049-176-8080-7585 Dr. Parinaz Ameri Co-Founder, oranIT GmbH ameri@oranit.de 0049-176-3497-0683