9. No available PC? Use cloud computers
Such as Amazon EC2, Microsoft Azure, etc…
Virtualized PC on the Internet
「ゼロから始めるクラウドコンピューティング」
11
http://aws.amazon.com/jp/ec2/
https://azure.microsoft.com/ja-jp/
10. Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
12
12. In short
BST-DT > RF > BAG-DT > SVMs > ANN >
KNN > BST-STMP > DT > LOGREG > NB
Boosted Trees:
RF + boosting technique
Note!!
Feature dims. were 10-100.
RF usually requires 10xdim vectors for training
14
[Caruana, ICML06]
13. Random Forests and Boosted Trees
15
www.habe-lab.org/habe/RFtutorial/SSII2013_RFtutorial_Slides.pdf
http://www.slideshare.net/HitoshiHabe/ss-58784309
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
RF
Boosted Trees
14. What should you use anyway?
16
http://scikit-learn.org/stable/modules/kernel_approximation.html
15. Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
17
16. If you want real data to play with
18
https://www.kaggle.com/
17. Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
19
18. How do you usually use SVM?
Through Python/Matlab/R/…
In many cases, you are using libSVM
By downloading binary code
Why don’t we download a source code?
20
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
19. Which kernel you should use?
Gaussian kernel is the best in many cases
But it takes a lot of time
Linear kernel performs as well as Gaussian
When the data size is large
When the feature dimension is large
You may also consider using liblinear
What else?
You can use your own kernel
21
21. Optimize the parameters
For binary or MATLAB, use grid.py/grid.m
It tries to optimize C and g for Gaussian kernel
You should check the source code
Use n-cross validation
Sometimes, make train, validation, test data
23
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
24. Non numerical data
libSVM can handle only numerical data
× Sun:0, Mon: 1, Tue: 2
(There is no meaning in magnitude relation)
Change to Categorical/one-hot
Sun: (1, 0, 0, 0…)
Mon: (0, 1, 0, 0…)
Tue: (0, 0, 1, 0…)
26
25. Missing data?
There is no golden rule
Eliminate such vectors
Use average or median value
Use the most frequently appearing value
27
30. You can know probability
You can be probability instead of
obtaining +1/-1 labels or continuous values
Use “–b 1” option
It is useful for further processing
32
31. Look at the model file
When using a linear kernel or liblinear
Weight vector w will be saved
You can also know support vectors
33
http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f433
32. You can use your own kernel
34
README in libSVM
In some cases, using other kernels is recommended
c2, histogram intersection, etc…