The document discusses machine learning models used to moderate classified ads on OLX's platform. It covers the scale of OLX's business with over 60 million monthly listings, feature engineering to better represent the underlying moderation problem, building a model generation pipeline using tools like Scikit and XGBoost, measuring model performance, the system architecture, validating models on sample predictions, and managing models over time.
3. Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1) Google play store; shopping/lifestyle categories
Note: excludes Letgo. Associates at propor>onate share
→ People spend more than twice as long in
OLX apps versus competitors
became one of the top 3 classifieds app in US
less than a year after its launch
130 Countries
+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms;
77% of the total amount of cars manufactured!
+160,000 properties are listed daily
• 2 houses
• 2 cars
• 3 fashion items
• 2.5 mobile phones
At OLX, are listed every second:
4. ● Change title, description in a paid category so that they don’t need
to buy another ad post.
● Duplicate Ads to get higher ranking and also to get higher chances
for selling
● Add Phone numbers, Company information on image rather than in
description
● Create multiple accounts to bypass free ad per user limit
● Try to sell forbidden items with a title and description that may
evade keyword filters
Problem with User Posted Ads
5. “Feature engineering is the process of transforming raw
data into features that better represent the underlying
problem to the predictive models, resulting in improved
model accuracy on unseen data”
Feature Engineering
7. Feature hashing
➔ Good when dealing high
dimensional, sparse features --
dimensionality reduction
➔ Memory efficient
➔ Cons - Getting back to feature
names is difficult
➔ Cons - Hash collisions can
have negative effects
8. SVM Light Data Format
➔ Memory Efficient.
Features can be created
on one machine and
does not requires huge
clusters
➔ Cons - Number of
features is unknown
9. Lessons Learnt
➔ Choose your tech dependent on
data size. Do not go for hype
driven development
➔ Spend time on Feature
Generation and selection
➔ Increase relevance and minimize
redundancy
➔ Use the same Feature
Generation pipeline for both
training and prediction
12. Lessons Learnt
➔ Automate and makes
things deterministic
➔ Airflow, Luigi and many
others are good choice
for Job dependency
management
13. Measuring Classifier Performance
➔ Accuracy not always the best metric
➔ PR good for measuring classifier performance
➔ Can use ROC for general classifier performance
➔ Choose one evaluation metric
15. Lessons Learnt
➔ Always Batch
Batching will reduce CPU Utilization and
the same machines would be able to
handle much more requests
➔ Modularize, Dockerize and
Orchestrate
Containerize your code so that it is
transparent to Machine configurations
➔ Monitoring
Use a monitoring service
➔ Choose simple and easy tech