Intro to machine learning with scikit learn

1
Yossi Cohen
Machine Learning
with
Scikit-learn

3
ML Programming
1. Get Data
Get labels for supervised learning
1. Create a classifier
2. Train the classifier
3. Predict test data
4. Evaluate predictor accuracy
*Configure and improve by repeating 2-5

4
The ML Process
Filter
Outliers
Regression
Classify
Validate
configure
Model
Partition

5
Get Data & Labels
• Sources
–Open data sources
–Collect on your own
• Verify data validity and correctness
• Wrangle data
–make it readable by computer
–Filter it
• Remove Outliers
PANDAS Python library could assist in pre-
processing & data manipulation before ML
http://pandas.pydata.org/

6
Pre-Processing
Change formatting
Remove redundant data
Filter Data (take partial data)
Remove Outliers
Label
Split for testing (10/90, 20/80)

7
Data Partitioning
• Data and labels
–{[data], [labels]}
–{[3,7, 76, 11, 22, 37, 56,2],[T, T, F, T, F, F, F, T]}
–Data: [Age, Do you love Nutella?]
• Partitioning will create
–{[train data], [train labels],[test data], [test labels]}
–We usually split the data on a ration of 9:1
–There is a tradeoff between the effectiveness of
the test and the learning we could provide to the
classifier
• We will look at a partitioning function later

8
Learn (The “Smart Part”)
Classification
If the output is discrete to a limited amount of
classes (groups)
Regression
If the output is continues

10
Create Classifier
For most SUPERVISED LEARNING
algorithms this would be
C = ClassifyAlg(Params)
Its up to us (ML guys) to set the best
params
How?
1. We could develop a hunch for it
2. Perform an exhaustive search

11
Train the classifier
We assigned
C = ClassifyAlg(Params)
This is a general algorithm with some
initalizer and configurations.
In this stage we train it using:
C.fit(Data, Labels)

12
Predict
After we have a trained Algorithm
classifier C
Prdeicted_Labels = C.predict(Data)

13
Predictor Evaluation
We are not done yet
There is a need to evaluate the predictor
accuracy in comparison to other predictors
and to the system requirements
We will learn several methods for this

15
The Environment
• There are many existing environments and
tools we could use
–Matlab with Machine learning toolbox
–Apache Mahout
–Python with Scikit-learn
• Additional tools
–Hadoop / Map-Reduce to accelerate and
parallelize large data set processing
–Amazon ML tools
–NVIDIA Tools

16
Scikit-learn
• Installation Instructions in
http://scikit-learn.org/stable/install.html#install-official-release
• Depends on two other libraries
• numpy and scipy
• Easiest way to install on windows:
• Install WinPython
http://sourceforge.net/projects/winpython/files/WinPython_2.7/2.7.9.4/
–Lets install this together
For Linux / Mac computers just install the 3
libs separately using PIP

18
Data sets
There are many data sets to work on
One of them is the Iris data classification
into three groups. It has an interesting story
you could google later
Well work on the iris
data

19
Lab A – Plot the Iris data
Plot septal length vs septal width with labels
ONLY
How? Google Iris data and the scikit learn
environment
Try to understand the second part of the
program with the PCA

20
Iris Data
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

21
Plot Iris Data
plt.figure(2, figsize=(8, 6))
plt.clf()
plt.scatter(X[:, 0], X[:, 1],
c=Y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

22
Add PCA for better classification
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
cmap=plt.cm.Paired)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()

25
Thank you!
More About me:
Yossi CohenYossi Cohen
yossicohen19@gmail.comyossicohen19@gmail.com
+972-545-313092+972-545-313092
 Video compression and computer vision enthusiast & lecturer
 Surfer

Intro to machine learning with scikit learn

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Intro to machine learning with scikit learn

Similar a Intro to machine learning with scikit learn (20)

Más de Yoss Cohen

Más de Yoss Cohen (20)

Último

Último (20)

Intro to machine learning with scikit learn