2. About
• Computer Scientist w/ background in ML.
• London Machine Learning Meetup.
• Founder of Math.NET numerical library.
• Previously @ Microsoft Research.
• Data science team lead at Rangespan.
3.
4. Taxonomy Classification
• Input: raw product data
• Output: classification models, classified product data
ROOT
Electronics
Audio
Audio
Cables
Amps
…
Computers
…
Clothing
Pants
T-‐‑Shirts
…
Toys
Model
Rockets
…
…
13. Logistic Regression -‐‑ Model
word
printer-‐‑
ink
printer-‐‑hardware
cartridge
4.0
0.3
the
0.0
0.0
samsung
0.5
0.5
black
0.5
0.3
printer
-‐‑1.0
2.0
ink
5.0
-‐‑1.7
…
…
…
For each class
For each feature
Add the weight
Exponentiate & Normalize
10.0
Σ=
-‐‑0.6
Pr=
0.99997
0.0003
Data
Collection
Feature
Extraction
Training
Testing
Labelling
14. Logistic Regression -‐‑ Inference
• Optimise using Wapiti.
• Hyperparameter optimisation using grid search.
• Using development set to stop training?
Data
Collection
Feature
Extraction
Training
Testing
Labelling
17. Cross Validation Calibration
• Estimate classifier errors.
• DO NOT
o Test on training data.
o Leave data aside.
• Are my probability
estimates correct.
• Computation:
o Take x data points with p(.|x) =
0.9,
o Check that about 90% of labels
were correct.
Data
Collection
Feature
Extraction
Training
Testing
Labelling
Training Data
Error = 1.2%
Error = 1.1%
Error = 1.2%
Error = 1.2%
Error = 1.3%
=
Error = 1.2%
22. • High probability
datapoints
o Upload to production
• Low probability
datapoints
o Subsample
o Acquire more labels
Data
Collection
Feature
Extraction
Training
Testing
Labelling
ROOT
Electronics
Clothing
p(electronics|{text}) = 0.1
e.g. Mechanical Turk
24. Implementation
MongoDB
S3 Raw
S3 Training Data
S3 Models
1. JSON export
2. Feature Extraction
3. Training
4. Classification
25. Training
MapReduce
• Dumbo on Hadoop
• 2000 classifiers
• 5 fold CV (+ full)
• 20 hypers on grid
= 200.000 training runs
26. Labelling
• 128 chunks
• Full Cascade each
chunk
D
A CB
E
Chunk
1
Chunk
2
Chunk
3
Chunk
N
…
D
A CB
ED
A CB
ED
A CB
E
27. Thoughts
• Extra’s:
o Partial labeling: stop when probability
becomes low.
o Data ensemble learning.
• Most time spent feature engineering.
• Tie the parameters of the classifiers?
o Frustratingly easy domain adaptation, Hal
Daume III
• Partially flattening the hierarchy for
training?