In a diverse country like India, socio-economic factors like religion, caste, language, income along with other common physical, professional based factors, play a vital role while searching for spouse. With surge of Internet connectivity, online matrimonial websites have become hugely popular to cater such needs. Most of the users registered on these portals have genuine intention of finding their desired life partner, however due to various factors it attracts few people with no genuine intention for the same. Such users are known as Fake/Spam profiles. These people lead to bad user experience as well as revenue loss for the online matrimony business. In this thesis we present an approach to identify such users suing machine learning techniques. Due to lack of large labelled examples for fake / suspicious users, we solve the above problem as anomaly detection problem. In this thesis, we use autoencoder which is widely used for anomaly detection. We capture user’s behaviour, profile information and edit history to detect him/her as in-genuine or genuine profile. We then treat this problem as a reconstruction task using autoencoder which is trained on a set of genuine profiles features. While prediction, the autoencoder shows small reconstruction error for genuine profiles and a very high reconstruction error for the fake users and detect them. The proposed system produces 91.76% accuracy with 90.2% recall for fake class. To the best of our knowledge, this is the first study done to detect fake/spam user profiles in online matrimony domain.
1. Detecting Fake Profiles On Online Matrimony
Vaibhav Garg Dr. Ponnurangam Kumaraguru (Chair)
linkedin.com/in/vaibhav-garg-
0a708899
facebook.com/in/vaibhav.gar
g.104203
@rk_check
2. 2
Thesis Committee
◆ Dr. Arun Balaji Buduru, IIIT Delhi
◆ Dr. Siddhartha Asthana, United Health Group (Optum)
◆ Dr. Ponnurangam Kumaraguru, IIIT Delhi
5. 5
Demo
* Due to the privacy policy of the company, we can not give demo on the actual
company’s portal.
6. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
6
8. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
8
9. 9
About the Data
◆ To dig into the problem, we chose a use case of India’s
leading matrimony website
◆ Ground Truth: 5,40,737 genuine profiles and very less
number of fake profiles.
◆ Data of Categorical Attributes : age, body type, caste, city,
country, education, height, income, manglik, marital status,
mother tongue, occupation, religion.
10. Categorical Data
10
Attribute Number of
Categories
Different Categories
Caste 470 Hindu: Arora, Hindu: Aggarwal,
Hindu: Brahmin etc.
Height 37 5’0, 5’1, 5’2, 5’3 etc.
Income 25 Rs. 0 - 1 Lakh, Rs 1-2 Lakh etc
Mother Tongue 42 Telugu, Bengali, Hindi-Delhi etc.
Occupation 69 Doctor, Analyst, IT-Engineer etc.
11. Categorical Data
11
Attribute Number of
Categories
Different Categories
Religion 10 Hindu, Muslim, Christian etc.
Body Type 4 Slim, Average, Athletic, Heavy
Country 214 India, Afghanistan, Australia etc.
City 3683 Delhi, UP, Ahmedabad etc.
Manglik 2 Manglik, Non-Manglik
12. Categorical Data
12
Attribute Number of
Categories
Different Categories
Marital Status 4 Never Married, Divorcee,
Separated and Widowed
Education 53 B.A, B.Com, B.Tech etc.
13. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
13
17. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
17
18. 18
Behavioural Trend for Caste Attribute
Experimented on 100 fake and 100 genuine profiles belonging to
Aggarwal Community
19. 19
Behavioural Trend for Marital Status Attribute
Experimented on 100 fake and 100 genuine profiles belonging to Non
Married Community
20. 20
Static Windows
User’s First 8 days Activity
First 12
hours
Day 0
… . . . . .
0th window 1st window
Day 0
Activity
Day 1
Activity
Day 6
Activity
Day 7
Activity
… . . . . .
Last 12
hours
Day 0
First 12
hours
Day 7
Last 12
hours
Day 7
15th window 16th window
24. Offline Results on Behaviour Features
24
Confusion Matrix Predicted Fake Predicted Clean
Actual Fake 2953 852
Actual Clean 168 17799
Above results are obtained on 3805 fake profiles and 17967 clean profiles
Drawback: The user has to be 8 days old on portal to be scrutinized through this approach
26. LIVE Results : False Negatives
26
Edit and Profile features needs to be incorporated !!
27. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
27
28. 28
Edit Summary for Mother Tongue Attribute
Experimented on 100 fake and 100 genuine profiles which registered
with Hindi-UP category
29. 29
Edit Summary for Income Attribute
Experimented on 100 fake and 100 genuine profiles which registered
with Rs 5-7.5 Lakh category
30. 30
Concept of Dynamic Windows
User’s Active Lifetime on portal = T seconds
User’s total initiates = N
Time period of first N/W
initiates
If we select no of windows = W
Time period of next N/W
initiates
Time period of last N/W
initiates
… . . . . .
0th window 1st window last window
31. Feature Designing
◆ Profile Features : One hot vector of profile attributes
◆ Behavior Features : In dynamic time windows, each feature stores the
proportion of initiates sent to a particular category of attribute
◆ Edit Features : In dynamic time windows, each feature stores the proportion of
time user has spent on that particular category of attribute
◆ Other Raw Features : In each window, we also store the total interests sent
and time duration of that window.
31
33. 33
Experimenting with number of dynamic windows
No of Windows Precision Recall Accuracy
Using 5 windows 0.170 0.510 0.8830
Using 4 window 0.192 0.635 0.8891
Using 3 windows 0.230 0.780 0.8977
Using 2 windows 0.242 0.804 0.8975
Using 1 window 0.266 0.866 0.8972
34. 34
Feature Selection on Best Model
Method Precision Recall Accuracy
Best Model 0.266 0.866 0.8972
Best Model + Feature
Selection
0.269 0.894 0.9083
Criteria Used = (Entropy for fake) - (Entropy for clean)
(Entropy for fake)
Precision is still low !!
35. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
35
36. 36
Affinity Features along with Behaviour Features
◆ An Affinity score between two categories i and j is the
likelihood score of a person having category i to send
interests to user having category j
◆ Affinity scores when incorporated with behaviour features
compare between how a user is expected to behave and
how he/she actually behaves on the platform
38. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
38
39. 39
Proposed Full length Feature Vector
Profile Features
Behaviour Features in
Time windows
Affinity Features
Edit Features in
Time windows
+ + +
41. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
41
42. 42
Final Results
Method Precision Recall Accuracy
Proposed Features +
Autoencoder
0.341 0.902 0.9176
Product team demanded for 25% precision at 60% recall !!
43. Outline
◆ About Online Matrimony
◆ About the Data
◆ Characteristics of a fake profile
◆ Using only Behaviour Trends
◆ Using Behavior, Edit and Profile Information
◆ Incorporating Community features
◆ Feature Engineering: Proposed Full length feature vector
◆ Final Results
◆ Conclusion
43
44. Conclusion
◆ We first studied the distinction in behaviour, profile and edit
pattern between genuine and fake users
◆ We incorporated these characteristics in the form of
features using dynamic time windows.
◆ We then trained the autoencoder model to detect fake
profiles on online matrimony.
44
47. Limitations and Future Work
◆ More number of samples for training autoencoder can lead
to more generalisation.
◆ We detected fake profiles using categorical attributes only.
Text spamming can be explored.
47