Hierarchical Classification by Jurgen Van Gael

•

3 recomendaciones•2,358 vistas

PyData

Tecnología Empresariales

Hierarchical Classiﬁcation
Jurgen Van Gael - .

About
•  Computer Scientist w/ background in ML.
•  London Machine Learning Meetup.
•  Founder of Math.NET numerical library.
•  Previously @ Microsoft Research.
•  Data science team lead at Rangespan.

Taxonomy Classiﬁcation
•  Input: raw product data
•  Output: classification models, classified product data
ROOT
Electronics
Audio
Audio
Cables
Amps
…
Computers
…
Clothing
Pants
T-‐‑Shirts
…
Toys
Model
Rockets
…
…

Data
Collection
Feature
Extraction
Training
Testing
Labeling

Name: INK-M50 Black Ink Cartridge (600 pages)
Manufacturer: Samsung
Description: null
Label: toner-inkjet-cartridges
"category": "toner-inkjet-cartridges”,
"features": ["cartridge", "samsung", "black", "ink", "ink-m50",
"pages”]
Feature Extraction:
•  Text cleaning (stopword, lexicalisation)
•  Unigram + Bigram Features
•  LDA Topic Features
Data
Collection
Feature
Extraction
Training
Testing
Labelling

Hierarchical Classiﬁcation
D
A
C
B
E
D
A
C
E
B
4 (5) way multiclass classiﬁcation

Hierarchical Classiﬁcation
D
A
C
B
E
D
A
C
B
E
2 + 3 way multiclass classiﬁcation

Naïve Bayes Neural Network

Logistic Regression
Support Vector Machines …
?

Logistic Regression -‐‑ Model
word
printer-‐‑
ink
printer-‐‑hardware
cartridge
4.0
0.3
the
0.0
0.0
samsung
0.5
0.5
black
0.5
0.3
printer
-‐‑1.0
2.0
ink
5.0
-‐‑1.7
…
…
…
For each class
For each feature
Add the weight
Exponentiate & Normalize
10.0
Σ=
-‐‑0.6
Pr=
0.99997
0.0003
Data
Collection
Feature
Extraction
Training
Testing
Labelling

Logistic Regression -‐‑ Inference
•  Optimise using Wapiti.
•  Hyperparameter optimisation using grid search.
•  Using development set to stop training?
Data
Collection
Feature
Extraction
Training
Testing
Labelling

ROOT
Electronics
Clothing
Data
Collection
Feature
Extraction
Training
Testing
Labelling

Cross Validation Calibration
•  Estimate classifier errors.
•  DO NOT
o  Test on training data.
o  Leave data aside.
•  Are my probability
estimates correct.
•  Computation:
o  Take x data points with p(.|x) =
0.9,
o  Check that about 90% of labels
were correct.
Data
Collection
Feature
Extraction
Training
Testing
Labelling
Training Data
Error = 1.2%
Error = 1.1%
Error = 1.2%
Error = 1.2%
Error = 1.3%
=
Error = 1.2%

Data
Collection
Feature
Extraction
Training
Testing
Labelling
ROOT
Electronics
Clothing
Using Bayes rule to chain classiﬁers:

ROOT
Electronics
Clothing
p(electronics|{text}) = 0.1
Data
Collection
Feature
Extraction
Training
Testing
Labelling

•  High probability
datapoints
o  Upload to production
•  Low probability
datapoints
o  Subsample
o  Acquire more labels
Data
Collection
Feature
Extraction
Training
Testing
Labelling
ROOT
Electronics
Clothing
p(electronics|{text}) = 0.1
e.g. Mechanical Turk

Implementation
MongoDB
S3 Raw
S3 Training Data
S3 Models
1. JSON export
2. Feature Extraction
3. Training
4. Classiﬁcation

Training
MapReduce
•  Dumbo on Hadoop
•  2000 classifiers
•  5 fold CV (+ full)
•  20 hypers on grid
= 200.000 training runs

Labelling
•  128 chunks
•  Full Cascade each
chunk
D
A CB
E
Chunk
1
Chunk
2
Chunk
3
Chunk
N
…
D
A CB
ED
A CB
ED
A CB
E

Thoughts
•  Extra’s:
o Partial labeling: stop when probability
becomes low.
o Data ensemble learning.
•  Most time spent feature engineering.
•  Tie the parameters of the classifiers?
o Frustratingly easy domain adaptation, Hal
Daume III
•  Partially flattening the hierarchy for
training?

Más contenido relacionado

Similar a Hierarchical Classification by Jurgen Van Gael

Supervised Machine Learning in R

Babu Priyavrat

Ember

mrphilroth

Py conie 2014

Gloria Lovera

From Black Box to Black Magic, Pycon Ireland 2014

Gloria Lovera

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

Chetan Khatri

WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...

Women in Analytics Conference

Presentation

Tomas Lukas Komar

Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science. In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.

The Power of Auto ML and How Does it Work

Ivo Andreev

Spark MLlib - Training Material

Bryan Yang

Machine Learning for (JVM) Developers

Mateusz Dymczyk

Machine Learning 101 - AWS Machine Learning Web Day

AWS Germany

Data Mining 101

Ali Septiandri

Begin with Machine Learning

Narong Intiruk

In this presentation, we will show to how use Python for Machine Learning. The Orange framework, a open-source data mining tool developed at the University of Ljubljiana will be used. Orange is a scriptable environment for fast prototyping of new algorithms and testing schemes. It is a collection of Python-based modules that sit over the C++ core library and implement some functionality for which execution time is not crucial and which is easier done in Python than in C++.

A practical Introduction to Machine Learning in Python

Pierluigi Casale

MLBox 0.8.2

Axel de Romblay

DIY Java Profiling

Roman Elizarov

Automate Machine Learning Pipeline Using MLBox

Axel de Romblay

Feature Engineering

Sri Ambati

Practical Machine Learning Pipelines with MLlib

Databricks

Machine learning.docx

BanasthaliStudent

Similar a Hierarchical Classification by Jurgen Van Gael (20)

Supervised Machine Learning in R

Ember

Py conie 2014

From Black Box to Black Magic, Pycon Ireland 2014

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...

Presentation

The Power of Auto ML and How Does it Work

Spark MLlib - Training Material

Machine Learning for (JVM) Developers

Machine Learning 101 - AWS Machine Learning Web Day

Data Mining 101

Begin with Machine Learning

A practical Introduction to Machine Learning in Python

MLBox 0.8.2

DIY Java Profiling

Automate Machine Learning Pipeline Using MLBox

Feature Engineering

Practical Machine Learning Pipelines with MLlib

Machine learning.docx

Más de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...

PyData

Unit testing data with marbles - Jane Stewart Adams, Leif Walsh

PyData

TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

PyData

Using Embeddings to Understand the Variance and Evolution of Data Science... ...

PyData

Deploying Data Science for Distribution of The New York Times - Anne Bauer

PyData

Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma

PyData

To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.

Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...

PyData

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro

PyData

Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...

PyData

Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

PyData

Words in Space - Rebecca Bilbro

PyData

End-to-End Machine learning pipelines for Python driven organizations - Nick ...

PyData

Pydata beautiful soup - Monica Puerto

PyData

1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...

PyData

Extending Pandas with Custom Types - Will Ayd

PyData

Measuring Model Fairness - Stephen Hoover

PyData

What's the Science in Data Science? - Skipper Seabold

PyData

Applying Statistical Modeling and Machine Learning to Perform Time-Series For...

PyData

Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward

PyData

The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...

PyData

Más de PyData (20)