PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

•

1 recomendación•332 vistas

The document discusses machine learning models used to moderate classified ads on OLX's platform. It covers the scale of OLX's business with over 60 million monthly listings, feature engineering to better represent the underlying moderation problem, building a model generation pipeline using tools like Scikit and XGBoost, measuring model performance, the system architecture, validating models on sample predictions, and managing models over time.

Tecnología

Machine Learning to
moderate Classifieds
Vaibhav Singh, Machine Learning Scientist
Content Moderation & Quality, OLX

Agenda
➔ Scale and Problem
➔ Feature generation
➔ Model Generation Pipeline
➔ Model Performance
➔ Architecture
➔ Model Validation and Management

Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1) Google play store; shopping/lifestyle categories
Note: excludes Letgo. Associates at propor>onate share
→ People spend more than twice as long in
OLX apps versus competitors

became one of the top 3 classifieds app in US
less than a year after its launch
130 Countries
+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms;
77% of the total amount of cars manufactured!
+160,000 properties are listed daily
•  2 houses
•  2 cars
•  3 fashion items
•  2.5 mobile phones
At OLX, are listed every second:

●  Change title, description in a paid category so that they don’t need
to buy another ad post.
●  Duplicate Ads to get higher ranking and also to get higher chances
for selling
●  Add Phone numbers, Company information on image rather than in
description
●  Create multiple accounts to bypass free ad per user limit
●  Try to sell forbidden items with a title and description that may
evade keyword filters
Problem with User Posted Ads

“Feature engineering is the process of transforming raw
data into features that better represent the underlying
problem to the predictive models, resulting in improved
model accuracy on unseen data”
Feature Engineering

Data Leakage
➔  Remove obvious fields
eg: id, account numbers
➔  Remove variance and
standardize
➔  Cross Validation
➔  Add Noise

Feature hashing
➔  Good when dealing high
dimensional, sparse features --
dimensionality reduction
➔  Memory efficient
➔  Cons - Getting back to feature
names is difficult
➔  Cons - Hash collisions can
have negative effects

SVM Light Data Format
➔  Memory Efficient.
Features can be created
on one machine and
does not requires huge
clusters
➔  Cons - Number of
features is unknown

Lessons Learnt
➔  Choose your tech dependent on
data size. Do not go for hype
driven development
➔  Spend time on Feature
Generation and selection
➔  Increase relevance and minimize
redundancy
➔  Use the same Feature
Generation pipeline for both
training and prediction

Lessons Learnt
➔  Automate and makes
things deterministic
➔  Airflow, Luigi and many
others are good choice
for Job dependency
management

Measuring Classifier Performance
➔  Accuracy not always the best metric
➔  PR good for measuring classifier performance
➔  Can use ROC for general classifier performance
➔  Choose one evaluation metric

Architecture
Flask
API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
Luigi
Ask Prediction
Return Prediction
Learning Ads

Lessons Learnt
➔  Always Batch
Batching will reduce CPU Utilization and
the same machines would be able to
handle much more requests
➔  Modularize, Dockerize and
Orchestrate
Containerize your code so that it is
transparent to Machine configurations
➔  Monitoring
Use a monitoring service
➔  Choose simple and easy tech

Validating Models
➔  Sample predictions and
manually verify
➔  Measure error rate
➔  Modify thresholds to
achieve desired error rate

Más contenido relacionado

Similar a PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

Agile Testing Framework - The Art of Automated Testing

Dimitri Ponomareff

B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale. See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.

Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence

Wei Di

Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...

Databricks

The journey to the cloud is an opportunity to transform outdated operations and development practices. DevOps, Agile software development, and Design Thinking are some of the popular methodologies used today to successfully speed delivery of Approved products and features and create a more customer-centric mindset. In this session, we break down the essential components of these methods and share tips on navigating common challenges when adopting these methods during a cloud migration.

Product Development in the Cloud

Amazon Web Services

JourneyToLowCode_3of4.pdf

VaibhavVaidya30

HacktoberFestPune - DSC MESCOE x DSC PVGCOET

TanyaRaina3

Transforming B2B Sales with Spark Powered Sales Intelligence

Songtao Guo

[VWO Webinar]: Getting your prospects to pay attention to your website is essential whether you’re running an eCommerce brand or generating leads for your sales team. So it's imperative to have a killer conversion optimization strategy in place to drive growth. Watch VWO and Ladder sharing conversion rate optimization insights on: How to craft a CRO strategy with structured hypotheses How to make use of both technical best practices and data-driven marketing Leveraging data to identify the weakest point in your funnel Your ideal CRO Toolkit - tools you need for optimization Building a framework of priorities - knowing what to test/implement first

How To Build a Winning Conversion Optimization Strategy

VWO

Tech-Talk Tuesday: How to Develop and Grow Your Optimization Efforts Into a S...

Daniel Caridi

Dashlane Mission Teams

Dashlane

Twin Cities Eloqua User Group 092413

Ron Corbisier

SugarCRM is an innovative and affordable CRM (Customer Relationship Management) solution that automates your core sales, customer service and marketing processes. SugarCRM is designed for every individual who engages with customers: sales professional, marketers, customer support agents and executives. Loaded is an Australian SugarCRM Platinum Partner, providing consulting, implementation, integration, customisation, hosting and support services.

An overview of SugarCRM

Loaded Technologies

One of the main concerns of conversion optimization professionals and their clients is the defined KPI (key performance indicator) of the test. If you are exclusively keeping an eye on one data point, it can easily muddle your insight. This can add to frustrations and the team may lose confidence in the process. There's value in a failed experiment. This talk by Christopher Nolan of Shipbob will give you an understanding of segregating good and "bad" test results through deeper analysis. He will share practical examples from his years of experience of working previously with Conversion Sciences, BigCommerce, and now Shipbob.

Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...

VWO

Reimagine Growth 3 - Session 2 - Planning your ASO strategy from 0 to 100

CleverTap

Software for startups

Jaideep Tibrewala

Many organizations that embark on a journey to the cloud view their effort as an opportunity to transform their outdated operations and development practices. DevOps, Agile software development, and Design Thinking are the popular methodologies used today to successfully speed the delivery of new products and features and create a more customer-centric mindset. In this session, we break down the essential components of each method and provide tips on how to navigate common challenges when adopting these methods during a cloud migration.

ENT206 Product Development in the Cloud

Amazon Web Services

Group 3 slide presentation

Michael Young

Do you want to scale your API program? Do you want to create new business opportunities with developers and partners? If so, monetization might be the right strategy for you. Monetization is influencing how APIs are delivered. It provides the flexibility to generate different API consumption models for developers, and it opens opportunities to derive value from APIs, for developers and for partners. Learn about: - Monetization trends and best practices - The industries that leverage API monetization today - The future of monetization

Monetization: Unlock More Value from Your APIs

Apigee | Google Cloud

Frappe Open Day - August 2018

Frappe Technologies Pvt. Ltd.

What’s common across companies like Netflix, Airbnb, Amazon, and Google? - Their ability to continuously experiment and roll out enhancements quickly across all aspects of their business. Today, every business is fundamentally a technology business. To stay ahead of the competition, you must iterate quickly with experiments across configurations in search algorithms, navigation, checkout flows, discounts, and security settings without breaking existing experience. That’s precisely what unifying experimentation with feature management enables for you. Join Thejas Sridhar, Manager of Product Marketing, to explore the power of unifying A/B testing and feature management for continuous development.

Unifying feature management with experiments - Server Side Webinar (1).pdf

VWO

Similar a PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh (20)

Agile Testing Framework - The Art of Automated Testing

Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence

Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...

Product Development in the Cloud

JourneyToLowCode_3of4.pdf

HacktoberFestPune - DSC MESCOE x DSC PVGCOET

Transforming B2B Sales with Spark Powered Sales Intelligence

How To Build a Winning Conversion Optimization Strategy

Tech-Talk Tuesday: How to Develop and Grow Your Optimization Efforts Into a S...

Dashlane Mission Teams

Twin Cities Eloqua User Group 092413

An overview of SugarCRM

Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...

Reimagine Growth 3 - Session 2 - Planning your ASO strategy from 0 to 100

Software for startups

ENT206 Product Development in the Cloud

Group 3 slide presentation

Monetization: Unlock More Value from Your APIs

Frappe Open Day - August 2018

Unifying feature management with experiments - Server Side Webinar (1).pdf

Más de Pôle Systematic Paris-Region

Short-range wireless communication technologies such as Bluetooth or ZigBee represent an important part of the Internet of Things ecosystem. By design, this category of smart devices has physically limited reachability inside their Wireless Personal Area Network (WPAN) and are not directly compatible with the TCP/IP stack. However, users may need to access them from anywhere at any moment. To address this problem, we design a new application-agnostic approach called RCM (Remote Connection Manager) enabling transparent communication between an application and out-of-range devices. It creates new IoT use cases by seamlessly mixing remote and local devices. We implemented an open-source prototype for Bluetooth Low Energy (BLE) technology on top of Linux and Android BLE stacks and demonstrated its efficiency through experiments performed on real devices.

OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...

Pôle Systematic Paris-Region

OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...

Pôle Systematic Paris-Region

On parle d’observabilité des services lorsque ceux-ci exposent des états et métriques internes pour améliorer la disponibilité globale. Qu’en est-il de l’observabilité des infrastructures sur lesquelles ils sont déployés, configurés et maintenus ? Les différents logs (centralisés, agrégés) permettent un bon début d’analyse mais il faut aussi observer les systèmes au fil de l’eau pour tracer chaque changement et les corréler avec le monitoring. Aujourd’hui, ces étapes de configuration IT devraient être prises en charge par les outils de gestion de configuration, qui deviennent la passerelle vers l’observabilité des opérations. Nous montrerons l'intérêt de cette approche pour la gestion IT moderne avec un retour d’expérience sur les challenges de leur mise en place dans Rudder, notre solution libre d’audit et de gestion de configuration en continu.

OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...

Pôle Systematic Paris-Region

My research is in virtualized infrastructure domain. I aim at minimizing electricity consumption while improving application performance. To achieve the first goal, I work both at the entire datacenter level (by providing better VM placement strategies) and at the physical machine level (by providing better power management policies). Concerning the second goal, I work both at the VM monitor level (for minimizing its overhead) and at the VM's operating system (OS) level (for making it aware of the fact that it is virtualized). In this talk I present two contributions of my research team, one for each objective. The first contribution presents Drowsy-DC, a novel way to reduce data center power consumption inspired by smartphones. The second contribution presents XPV (eXtended Para-Virtualization), a new principle for well virtualizing NUMA machines.

OSIS19_Cloud : Performance and power management in virtualized data centers, ...

Pôle Systematic Paris-Region

L'expérience du développement de CRESON, support pour des objets distants fortement cohérents dans Infinispan, par Etienne Riviere (UCLouvain). Cet exposé présentera des résultats obtenus dans le cadre du projet européen LEADS que j'ai coordonné et où l'entreprise Red Hat était partenaire. Le code produit a été intégré dans le “staging" de la base de données NoSQL Infinispan, et évalué avec un équivalent open source de Dropbox développé par CloudSpaces, un autre projet européen.

OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...

Pôle Systematic Paris-Region

OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...

Pôle Systematic Paris-Region

OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...

Pôle Systematic Paris-Region

Pointers are a notorious "defect attractor", in particular when dynamic memory management is involved. Ada mitigates these issues by having much less need for pointers overall (thanks to first-class arrays, parameter modes, generics) and stricter rules for pointer manipulations that limit access to dangling memory. Still, dynamic memory management in Ada may lead to use-after-free, double-free and memory leaks, and dangling memory issues may lead to runtime exceptions. The SPARK subset of Ada is focused on making it possible to guarantee properties of the program statically, in particular the absence of programming language errors, with a mostly automatic analysis. For that reason, and because static analysis of pointers is notoriously hard to automate, pointers have been forbidden in SPARK until now. We are working at AdaCore since 2017 on including pointer support in SPARK by restricting the use of pointers in programs so that they respect "ownership" constraints, like what is found in Rust. In this talk, I will present the current state of the ownership rules for pointer support in SPARK, and the current state of the implementation in the GNAT compiler and GNATprove prover, as well as our roadmap for the future.

Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy

Pôle Systematic Paris-Region

Osis18_Cloud : Pas de commun sans communauté ?

Pôle Systematic Paris-Region

Osis18_Cloud : Projet Wolphin

Pôle Systematic Paris-Region

La virtualisation est une technologie mature dont le surcoût est aujourd’hui marginal sur les machines grand public. Néamoins, ce surcoût augmente radicalement pour les machines reposant sur une architecture Non Uniform Memory Access (NUMA), omniprésentes dans les data centers. Les techniques de virtualisation actuelles exploitent mal cette architecture et causent une dégradation des performances des applications allant jusqu’à 700%. Cette présentation détaille les causes de telles dégradations et propose une méthode qui permet la virtualisation efficace d’architectures NUMA. Une évaluation de cette méthode montre qu’il est possible de multiplier par 2 ou plus la performance de 9 des 29 applications testées.

Osis18_Cloud : Virtualisation efficace d’architectures NUMA

Pôle Systematic Paris-Region

Nous présentons une solution Open Source de stockage et d’archivage distribué des données dont l’objectif est la pérennité des données. Il est basé sur le protocole BitTorrent et intègre un haut niveau de redondance, ainsi que d’un mécanisme de régénération automatique des données. Il peut être déployé à grande échelle en LAN et en WAN. Les agents sont compatibles avec des serveurs et postes clients Linux, Window ou Mac OS.

Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent

Pôle Systematic Paris-Region

Le logiciel est au cœur de notre société numérique et le code source des logiciels contient une part croissante de nos connaissances scientifiques, techniques et organisationnelles, au point d'être devenu désormais une partie intégrante du patrimoine de l'Humanité. La mission de «Software Heritage» est de veiller à ce que cette précieuse masse des connaissances soit collectée, préservée, organisée et mise à la disposition de tous. Construire une telle infrastructure pose des défis importants, à la fois techniques et stratégiques, et nous pouvons tous contribuer à les résoudre.

Osis18_Cloud : Software-heritage

Pôle Systematic Paris-Region

Dans cet exposé, on présentera OMicroB, une machine virtuelle OCaml pour microcontrôleurs à faibles ressources, inspirée des travaux précédents sur le projet OCaPIC. Cette machine virtuelle, destinée à être exécutée sur diverses architectures matérielles (AVR, PIC, ARM, ...) permet ainsi de factoriser le développement d’applications, mais aussi de généraliser l’analyse et le débogage du bytecode associé, tout en permettant un usage précautionneux de la mémoire. On cible alors des programmes ludiques ou de domotiques destinés à être exécutés sur des microcontrôleurs à faibles ressources, en insistant sur les particularités inhérentes à la programmation de systèmes embarqués.

OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...

Pôle Systematic Paris-Region

OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot

Pôle Systematic Paris-Region

OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...

Pôle Systematic Paris-Region

OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...

Pôle Systematic Paris-Region

The programming language Ada offers unique features to safely program a micro-controller. From the start, Ada was designed to make it difficult to introduce errors, and to make it easy to discover errors that were introduced. For example, language rules enforced at compile time make it possible to have safe concurrency by design. And run-time checking allows immediate detection of what would be "undefined behavior" in C/C++. In the first part of this presentation, we will present the benefits of using Ada for micro-controller programming, including support for debugging on a board. In the second part of this presentation, we will present how the Ada language and its subset SPARK provide a strong foundation for static analyzers, that make it possible to detect errors and provide guarantees on embedded software in Ada/SPARK.

OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...

Pôle Systematic Paris-Region

OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)

Pôle Systematic Paris-Region

PyParis 2017 / Un mooc python, by thierry parmentelat

Pôle Systematic Paris-Region

Más de Pôle Systematic Paris-Region (20)