SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Machine Learning to
moderate Classifieds
Vaibhav Singh, Machine Learning Scientist
Content Moderation & Quality, OLX
Agenda
➔ Scale and Problem
➔ Feature generation
➔ Model Generation Pipeline
➔ Model Performance
➔ Architecture
➔ Model Validation and Management
Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1)	Google	play	store;	shopping/lifestyle	categories	
Note:	excludes	Letgo.		Associates	at	propor>onate	share	
→ People spend more than twice as long in
OLX apps versus competitors
	
	
became one of the top 3 classifieds app in US
less than a year after its launch
130 Countries
+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms;
77% of the total amount of cars manufactured!
+160,000 properties are listed daily
•  2 houses
•  2 cars
•  3 fashion items
•  2.5 mobile phones
At OLX, are listed every second:
●  Change title, description in a paid category so that they don’t need
to buy another ad post.
●  Duplicate Ads to get higher ranking and also to get higher chances
for selling
●  Add Phone numbers, Company information on image rather than in
description
●  Create multiple accounts to bypass free ad per user limit
●  Try to sell forbidden items with a title and description that may
evade keyword filters
Problem with User Posted Ads
“Feature engineering is the process of transforming raw
data into features that better represent the underlying
problem to the predictive models, resulting in improved
model accuracy on unseen data”
Feature Engineering
Data Leakage
➔  Remove obvious fields
eg: id, account numbers
➔  Remove variance and
standardize
➔  Cross Validation
➔  Add Noise
Feature hashing
➔  Good when dealing high
dimensional, sparse features --
dimensionality reduction
➔  Memory efficient
➔  Cons - Getting back to feature
names is difficult
➔  Cons - Hash collisions can
have negative effects
SVM Light Data Format
➔  Memory Efficient.
Features can be created
on one machine and
does not requires huge
clusters
➔  Cons - Number of
features is unknown
Lessons Learnt
➔  Choose your tech dependent on
data size. Do not go for hype
driven development
➔  Spend time on Feature
Generation and selection
➔  Increase relevance and minimize
redundancy
➔  Use the same Feature
Generation pipeline for both
training and prediction
Model Generation Pipeline
Lessons Learnt
➔  Automate and makes
things deterministic
➔  Airflow, Luigi and many
others are good choice
for Job dependency
management
Measuring Classifier Performance
➔  Accuracy not always the best metric
➔  PR good for measuring classifier performance
➔  Can use ROC for general classifier performance
➔  Choose one evaluation metric
Architecture
Flask
API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
Luigi
Ask Prediction
Return Prediction
Learning Ads
Lessons Learnt
➔  Always Batch
Batching will reduce CPU Utilization and
the same machines would be able to
handle much more requests
➔  Modularize, Dockerize and
Orchestrate
Containerize your code so that it is
transparent to Machine configurations
➔  Monitoring
Use a monitoring service
➔  Choose simple and easy tech
Validating Models
➔  Sample predictions and
manually verify
➔  Measure error rate
➔  Modify thresholds to
achieve desired error rate
Model Management
PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

Más contenido relacionado

Similar a PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Wei Di
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Databricks
 

Similar a PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh (20)

Agile Testing Framework - The Art of Automated Testing
Agile Testing Framework - The Art of Automated TestingAgile Testing Framework - The Art of Automated Testing
Agile Testing Framework - The Art of Automated Testing
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
 
Product Development in the Cloud
Product Development in the Cloud Product Development in the Cloud
Product Development in the Cloud
 
JourneyToLowCode_3of4.pdf
JourneyToLowCode_3of4.pdfJourneyToLowCode_3of4.pdf
JourneyToLowCode_3of4.pdf
 
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOETHacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales Intelligence
 
How To Build a Winning Conversion Optimization Strategy
How To Build a Winning Conversion Optimization StrategyHow To Build a Winning Conversion Optimization Strategy
How To Build a Winning Conversion Optimization Strategy
 
Tech-Talk Tuesday: How to Develop and Grow Your Optimization Efforts Into a S...
Tech-Talk Tuesday: How to Develop and Grow Your Optimization Efforts Into a S...Tech-Talk Tuesday: How to Develop and Grow Your Optimization Efforts Into a S...
Tech-Talk Tuesday: How to Develop and Grow Your Optimization Efforts Into a S...
 
Dashlane Mission Teams
Dashlane Mission TeamsDashlane Mission Teams
Dashlane Mission Teams
 
Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413
 
An overview of SugarCRM
An overview of SugarCRMAn overview of SugarCRM
An overview of SugarCRM
 
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
 
Reimagine Growth 3 - Session 2 - Planning your ASO strategy from 0 to 100
Reimagine Growth 3 - Session 2 - Planning your ASO strategy from 0 to 100Reimagine Growth 3 - Session 2 - Planning your ASO strategy from 0 to 100
Reimagine Growth 3 - Session 2 - Planning your ASO strategy from 0 to 100
 
Software for startups
Software for startupsSoftware for startups
Software for startups
 
ENT206 Product Development in the Cloud
ENT206 Product Development in the CloudENT206 Product Development in the Cloud
ENT206 Product Development in the Cloud
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
 
Monetization: Unlock More Value from Your APIs
Monetization: Unlock More Value from Your APIs Monetization: Unlock More Value from Your APIs
Monetization: Unlock More Value from Your APIs
 
Frappe Open Day - August 2018
Frappe Open Day - August 2018Frappe Open Day - August 2018
Frappe Open Day - August 2018
 
Unifying feature management with experiments - Server Side Webinar (1).pdf
Unifying feature management with experiments - Server Side Webinar (1).pdfUnifying feature management with experiments - Server Side Webinar (1).pdf
Unifying feature management with experiments - Server Side Webinar (1).pdf
 

Más de Pôle Systematic Paris-Region

Más de Pôle Systematic Paris-Region (20)

OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
 
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
 
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
 
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
 
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
 
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
 
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
 
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick MoyOsis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
 
Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?
 
Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin
 
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentOsis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
 
Osis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
 
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
 
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riotOSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
 
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
 
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
 
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
 
PyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelat
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

  • 1. Machine Learning to moderate Classifieds Vaibhav Singh, Machine Learning Scientist Content Moderation & Quality, OLX
  • 2. Agenda ➔ Scale and Problem ➔ Feature generation ➔ Model Generation Pipeline ➔ Model Performance ➔ Architecture ➔ Model Validation and Management
  • 3. Scale of business at OLX 4.4 APP RATING #1 app +22 COUNTRIES (1) 1) Google play store; shopping/lifestyle categories Note: excludes Letgo. Associates at propor>onate share → People spend more than twice as long in OLX apps versus competitors became one of the top 3 classifieds app in US less than a year after its launch 130 Countries +60 million monthly listings +18 million monthly sellers +52 million cars are listed every year in our platforms; 77% of the total amount of cars manufactured! +160,000 properties are listed daily •  2 houses •  2 cars •  3 fashion items •  2.5 mobile phones At OLX, are listed every second:
  • 4. ●  Change title, description in a paid category so that they don’t need to buy another ad post. ●  Duplicate Ads to get higher ranking and also to get higher chances for selling ●  Add Phone numbers, Company information on image rather than in description ●  Create multiple accounts to bypass free ad per user limit ●  Try to sell forbidden items with a title and description that may evade keyword filters Problem with User Posted Ads
  • 5. “Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data” Feature Engineering
  • 6. Data Leakage ➔  Remove obvious fields eg: id, account numbers ➔  Remove variance and standardize ➔  Cross Validation ➔  Add Noise
  • 7. Feature hashing ➔  Good when dealing high dimensional, sparse features -- dimensionality reduction ➔  Memory efficient ➔  Cons - Getting back to feature names is difficult ➔  Cons - Hash collisions can have negative effects
  • 8. SVM Light Data Format ➔  Memory Efficient. Features can be created on one machine and does not requires huge clusters ➔  Cons - Number of features is unknown
  • 9. Lessons Learnt ➔  Choose your tech dependent on data size. Do not go for hype driven development ➔  Spend time on Feature Generation and selection ➔  Increase relevance and minimize redundancy ➔  Use the same Feature Generation pipeline for both training and prediction
  • 11.
  • 12. Lessons Learnt ➔  Automate and makes things deterministic ➔  Airflow, Luigi and many others are good choice for Job dependency management
  • 13. Measuring Classifier Performance ➔  Accuracy not always the best metric ➔  PR good for measuring classifier performance ➔  Can use ROC for general classifier performance ➔  Choose one evaluation metric
  • 14. Architecture Flask API Queue Prediction Module Mongo Monitoring & Stats Graphite, Grafana Learning Module Scikit XGBoost Luigi Ask Prediction Return Prediction Learning Ads
  • 15. Lessons Learnt ➔  Always Batch Batching will reduce CPU Utilization and the same machines would be able to handle much more requests ➔  Modularize, Dockerize and Orchestrate Containerize your code so that it is transparent to Machine configurations ➔  Monitoring Use a monitoring service ➔  Choose simple and easy tech
  • 16. Validating Models ➔  Sample predictions and manually verify ➔  Measure error rate ➔  Modify thresholds to achieve desired error rate