2. Agenda
● 1.5 hours: Introduction to ML algorithms
● 1.5 hours: Implementing algorithms for different use-cases
● 1 hour: Working on a recommendation mini-project
5. Machine Learning Definition
Arthur Samuel (1959):
“Field of study that gives computers the ability to learn without being explicitly
programmed.” [ML_Awad]
Source: [fortune]
7. Machine Learning Definition
Tom Mitchell (1998):
“A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.” [ML_Mitchell]
8. E, T and P in a Spam Filter Example
● Task T:
○ Classify emails as Spam or Ham.
● Experience E:
○ Monitor you labeling emails as Spam or Not spam.
● Performance measure P:
○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.
9. Machine Learning Definition
Peter Flach (2012):
“Machine learning is the systematic study of algorithms and systems that improve their
knowledge or performance with experience.” [ML_Flach]
11. Machine Learning Main Ingredients
1. Tasks:
○ An abstract representation of a problem we want to solve regarding the domain objects
2. Models:
○ Representation of many tasks as a model from data points to outputs.
○ Produces as the output of a machine learning algorithm applied to training data.
3. Features:
○ A language definition in which we describe the relevant objects in our domain.
21. Exercise 1
Should you treat the following problems with regression or classification?
Problem 1: You want to develop a learning algorithm to examine individual customer accounts
and determine if each account has been hacked.
Problem 2: You have a huge list of identical items and want to predict which how many of
them will be sold over next 3 months.
27. Exercise 2
Which of the following problems would you address with Unsupervised Learning
algorithms?
1. Given a dataset of patients diagnosed as either having diabetes or not, learn
to classify new patients as having diabetes or not.
2. Given a database of customer data, automatically discover market segments
and group customers into different market segments.
3. Given a dataset of news articles found on the web, group them into set of
articles about the same story.
4. Given email labeled as spam/ham, learn spam filter.
41. Installing docker with Anaconda image
1. Install docker with :
> sudo apt install docker.io
2. Add your current user to the docker group with the following command:
> sudo usermod -a -G docker $USER
3. Restart your computer
4. Register and proceed at https://hub.docker.com/_/anaconda
5. Download the docker of anaconda with the following command:
> docker pull continuumio/anaconda
6. Run docker:
> docker run -i -t continuumio/anaconda /bin/bash
7. Test your conda environment:
(base) root@9b9e483ba80e:/opt/conda# conda info
42. Running Jupyter Notebook
Run the following command in one line from host machine:
> docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c
"/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks &&
/opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 --
port=8888 --no-browser --allow-root"
- Open your Notebook in the browser
- Open a terminal and install: numpy pandas matplotlib scipy and sklearn
44. Python Libraries for Machine Learning
● NumPy (http://www.numpy.org/ ):
○ Introduce objects for multidimensional arrays and matrices
○ Provides vectorization of mathematical operations on arrays and matrices
● SciPy(https://www.scipy.org/scipylib/ ):
○ Collection of algorithms for linear algebra, statistics, optimization and etc.
○ Build on NumPy
● Pandas(http://pandas.pydata.org/ ):
○ Provide tools for data manipulation and handling missing data
● SciKit-Learn(https://scikit-learn.org/stable/ ):
○ Provide machine learning algorithms: classification, regression, clustering, model validation
etc.
● Matplotlib(https://matplotlib.org/ ):
○ Python 2D plotting library
45. Pandas DataFrame Data Types
Pandas type Python native type Description
obj string The most general dtype.
Will be assigned to your
column if it contains mixed
types (numbers and
strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold
this character.
float64 float Numeric characters with
decimals. If a column contains
numbers and NaNs(see below),
pandas will default to float64, in
case your missing value has a
decimal.
datetime64, timedelta[ns] N/A (but see thedatetimemodule
in Python’s standard library)
Values meant to hold time data.
Look into these for time series
experiments.
46. DataFrame Attributes
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labelsand column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values Numpy representation of the data
47. Exercise with DataFrame Attributes
1. How many records this data frame has?
2. How many elements are there?
3. What are the column names?
4. What types of columns we have in this data frame?
48. DataFrame Methods
df.method() description
head( [n] ), tail( [n] ) first/lastn rows
describe() generate descriptive statistics (for numeric
columns only)
max(), min() return max/min values for all numeric
columns
mean(), median() return mean/median values for all numeric
columns
std() standard deviation
sample([n]) returns a random sample of the data frame
dropna() drop all the records with missing values
49. Exercise with DataFrame Methods
1. Give the summary for the numeric columns in the dataset
2. Calculate standard deviation for all numeric columns
3. What are the mean values of the first 50 records in the dataset?
Hint: use head() method to subset the first 50 records and then calculate the mean
50. Handling Missing Values
● ‘NaN - NoT a Number’ shows missing values
● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or
medium (most common)
● But should be aware that something has been changed
● Could also ignore the sample or feature with missing values
51. Missing Values in Pandas
● Missing values in GroupBy method are excluded
● Many descriptive statistics methods have ‘skipna’ option to control if missing data should
be excluded . This value is set to True by default.
52. Dealing with Missing Values in DF
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, how='all') Drop column if all the values aremissing
dropna(thresh = 5) Drop rows that contain less than 5 non-
missing values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
63. Further Learning
● Kaggle: is the place to do data science projects
● Seeing Theory : a visual introduction to probability and statistics.
● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI.
Software
64. Reading Recommendations
● Machine learning : The art and science of algorithms that make sense of data by Peter
Flach
● Python for Data Analysis by We McKinney
● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
65. References
[ML_Awad] Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA
[xkcd_1838] https://xkcd.com/1838/
[fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/
[ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press.
[ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2
[Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2
[karlstratos] http://karlstratos.com/drawings/drawings.html
[Print_Lego] https://www.pinterest.com/pin/422071796300372061/
[Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview-
57891c595e96
[researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related-
functional_fig6_26688269
67. Icon References
● Icons made by: Freepik from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY
68. We organize IT24.04.2019
Your Contact
Dr. Hamzeh Alavira
Founder, oranIT GmbH
alavirad@oranit.de
0049-176-8080-7585
Dr. Parinaz Ameri
Co-Founder, oranIT GmbH
ameri@oranit.de
0049-176-3497-0683