SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Hierarchical  Classification	
Jurgen Van Gael - .
About	
•  Computer Scientist w/ background in ML.
•  London Machine Learning Meetup.
•  Founder of Math.NET numerical library.
•  Previously @ Microsoft Research.
•  Data science team lead at Rangespan.
Taxonomy  Classification	
•  Input: raw product data
•  Output: classification models, classified product data
ROOT	
Electronics	
Audio	
Audio  
Cables	
 Amps	
 …	
Computers	
 …	
Clothing	
Pants	
 T-­‐‑Shirts	
 …	
Toys	
Model  
Rockets	
 …	
…
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labeling
Feature  Extraction
Name: INK-M50 Black Ink Cartridge (600 pages)
Manufacturer: Samsung
Description: null
Label: toner-inkjet-cartridges
"category": "toner-inkjet-cartridges”,
"features": ["cartridge", "samsung", "black", "ink", "ink-m50",
"pages”]
Feature  Extraction:	
•  Text  cleaning  (stopword,  lexicalisation)	
•  Unigram  +  Bigram  Features	
•  LDA  Topic  Features	
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling
h"p://radimrehurek.com/gensim
Training,  Testing  &  Labelling
Hierarchical  Classification	
D	
A	
 C	
B	
E	
D	
A	
 C	
 E	
B	
4  (5)  way  multiclass  classification
Hierarchical  Classification	
D	
A	
 C	
B	
E	
 D	
A	
 C	
B	
E	
2  +  3  way  multiclass  classification
Naïve  Bayes            Neural  Network	
	
Logistic  Regression	
Support   Vector   Machines   …	
?
Logistic  Regression  -­‐‑  Model	
word	
 printer-­‐‑
ink	
printer-­‐‑hardware	
cartridge	
 4.0	
 0.3	
the	
 0.0	
 0.0	
samsung	
 0.5	
 0.5	
black	
 0.5	
 0.3	
printer	
 -­‐‑1.0	
 2.0	
ink	
 5.0	
 -­‐‑1.7	
…	
 …	
 …	
For each class
For each feature
Add the weight
Exponentiate & Normalize
10.0	
Σ=	
 -­‐‑0.6	
Pr=	
 0.99997	
 0.0003	
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling
Logistic  Regression  -­‐‑  Inference	
•  Optimise using Wapiti.
•  Hyperparameter optimisation using grid search.
•  Using development set to stop training?
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling
h"p://wapiti.limsi.fr/
ROOT	
Electronics	
 Clothing	
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling
Cross Validation Calibration
•  Estimate classifier errors.
•  DO NOT
o  Test on training data.
o  Leave data aside.
•  Are my probability
estimates correct.
•  Computation:
o  Take x data points with p(.|x) =
0.9,
o  Check that about 90% of labels
were correct.
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling	
Training  Data	
Error  =  1.2%	
Error  =  1.1%	
Error  =  1.2%	
Error  =  1.2%	
Error  =  1.3%	
=	
Error  =  1.2%
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling	
ROOT	
Electronics	
 Clothing	
Using  Bayes  rule  to  chain  classifiers:
Active  Learning
ROOT	
Electronics	
 Clothing	
p(electronics|{text})  =  0.1	
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling
•  High probability
datapoints
o  Upload to production
•  Low probability
datapoints
o  Subsample
o  Acquire more labels
Data  
Collection	
Feature  
Extraction	
Training	
Testing	
Labelling	
ROOT	
Electronics	
 Clothing	
p(electronics|{text})  =  0.1	
e.g.  Mechanical  Turk
Implementation
Implementation	
MongoDB	
 S3  Raw	
 S3  Training  Data	
 S3  Models	
1.  JSON  export	
 2.  Feature  Extraction	
 3.  Training	
 4.  Classification
Training  
MapReduce	
•  Dumbo on Hadoop
•  2000 classifiers
•  5 fold CV (+ full)
•  20 hypers on grid
= 200.000 training runs
Labelling	
•  128 chunks
•  Full Cascade each
chunk
D
A CB
E
Chunk  
1	
Chunk  
2	
Chunk  
3	
Chunk  
N	
…	
D
A CB
ED
A CB
ED
A CB
E
Thoughts	
•  Extra’s:
o Partial labeling: stop when probability
becomes low.
o Data ensemble learning.
•  Most time spent feature engineering.
•  Tie the parameters of the classifiers?
o Frustratingly easy domain adaptation, Hal
Daume III
•  Partially flattening the hierarchy for
training?

Más contenido relacionado

Similar a Hierarchical Classification by Jurgen Van Gael

Similar a Hierarchical Classification by Jurgen Van Gael (20)

Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
 
Ember
EmberEmber
Ember
 
Py conie 2014
Py conie 2014Py conie 2014
Py conie 2014
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
 
Presentation
PresentationPresentation
Presentation
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
Machine Learning 101 - AWS Machine Learning Web Day
Machine Learning 101 - AWS Machine Learning Web DayMachine Learning 101 - AWS Machine Learning Web Day
Machine Learning 101 - AWS Machine Learning Web Day
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
A practical Introduction to Machine Learning in Python
A practical Introduction to Machine Learning in PythonA practical Introduction to Machine Learning in Python
A practical Introduction to Machine Learning in Python
 
MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
DIY Java Profiling
DIY Java ProfilingDIY Java Profiling
DIY Java Profiling
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Machine learning.docx
Machine learning.docxMachine learning.docx
Machine learning.docx
 

Más de PyData

Más de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Hierarchical Classification by Jurgen Van Gael