Publicidad
Publicidad

Más contenido relacionado

Presentaciones para ti(20)

Similar a Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni(20)

Publicidad

Más de Databricks(20)

Publicidad

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Aniket Kulkarni

  1. Merchant Churn Prediction using SparkML at PayPal
  2. Who are we? • Data Engineers – ETL pipelines using Spark • Like all great projects, we started from a hack! • Data Engineering to Machine Learning 2
  3. Agenda 1. Scale at PayPal 2. Understanding Merchant Churn 3. Machine Learning Workflow 4. Learnings 5. Spark ML 3
  4. . 200 Countries . 25 Currencies . 19 Million Merchants . 237 Million Active Users . 8 Billon Transactions per Year . 6 Billion Events per Day Scale at PayPal 4
  5. Understanding Merchant Churn Compliance Use Case for CLAC Story Triggers Impact Insights Increase in Compliance Limitations for CLAC in 2017 Regulations mandates merchants to complete Compliance verification Applicable to Merchants Exceeding $$ in a 12-month period Might lead to merchant’s account being suspended . Merchant not aware of limitation Merchant did not understand how to resolve limitation High impact for Small merchants Biggest churn driver for CLAC in 2017 $M in payments 5
  6. Churn Recovery Efforts Existing pipeline • Limited Success • Reactive process • Account managers reach out to merchants already churned • Reverse limitation and relaunch merchants takes time • Large set of merchants for reach-outs New Merchants Merchants Get Limitation Account Manager Relaunched MerchantsMerchants Churn 6
  7. Churn Recovery Efforts Enhanced pipeline • Proactive Process • Use machine learning pipeline to predict Time to reach $$ • Reach out to merchants before limitation is reached • Mitigate restriction and churn New Merchants Merchants Likely to Get Limitation Account Manager Merchants complete regulation ML Model Predict Revenue and Timelines 7
  8. ML Platform Data Models Integration Channel Metadata (Segment, Geo, Capacity, Priority, Channel, etc) Data Channel Integration Model 1 Model 2 Model N Salesforce Alerts Salesforce SSO E-mail … Feedback Data (Optimization & Learnings) Performance Tracking 8
  9. So where do we start from? 9
  10. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! 10
  11. Select Training Data Ask questions What datasets we use for training the model? Should we focus only on initial transactions? What data is relevant for new merchants? ? Should we consider Inflation and currency conversion? What merchants we should use to train model ? 11
  12. Data Analyze our datasets PAYMENTS ACTIVITY ConsumersMerchants Demographics Consumer Spending Low/mid/high shopper Country Visits data Payment attempts Successful transactions Account Identity Currency Country Industry Cross border Paypal products Transaction Amount Frequency New users Repeat users Transaction Type 12
  13. 13 Data Transformation Strategies Raw Features Merchant Profile Binning Transaction & Revenue Data Trendlines Weekly trends in transactions Binarization Payment Methods / Cross Border Seasonality Tune weights for Transaction data Transforming data into features
  14. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning 14
  15. 15 Feature Engineering Transforming data into features Multiple Source Stitching Indicator Variables Normalization Feature Selection Outlier Removal
  16. Data Preparation Multiple source stitching Source 1 : Coverage 20% Source 2 : Coverage 30% Stitch attribute values based on accuracy Enriched feature : Coverage 70% Source 3: Coverage 30% Industry & Sub-industry enrichment 16
  17. Data Preparation Indicator variables Type 1 Type 2 Type 3 Attribute X Count Count Count 3 features Type 1 count Type 3 count Type 2 count 17
  18. Data Preparation Indicator variables Type 1 Type 2 Type 3 Attribute X Calculate most active type Count Count Count 1 feature Most Active Type E.g. Gender Monthly transaction count 18
  19. Data Preparation Indicator variables Attribute X Calculate buckets and assign indicator Attribute bucket indicator E.g. Age Income 19
  20. Data Preparation Hypothesis testing Chi-square Selector pValue Top 30 features All features 20
  21. Data Preparation Outliers Dormant Merchants Restriction placed to not receive funds OUTLIERS Account locked 21
  22. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model 22
  23. Model selection Choosing the right label Choosing the right ‘y’ Week Quarter Year No. of days Classification Regression 23
  24. Model selection Choosing the right model Decision tree Naïve Bayes Gradient boosting tree Random forests Low Accuracy Classification • Accuracy improved • Overfitting • Add more categorical features • Accuracy improved • Overfitting persisted • Accuracy improved • Overfitting reduced • High time to train • Accuracy improved • Overfitting reduced • Low time to train Logistic Regression 24
  25. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model C r o s s v a l i d a t i o n & h y p e r p a r a m e t e r t u n i n g Fine-tune and reverification of model 25
  26. Hyper-parameter tuning and Cross validation Hyper-parameter Values Number of trees 5,10,15,20,25 Max Bins 5,10,20,30 Impurity Gini, Entropy Max Depth 5,10,20,30 Feature Subset Strategy auto Folds 3 Hyper-parameter values for Random Forest model 26
  27. Hyper-parameter tuning and Cross validation How do we measure we have the right model? 27 Accuracy AUC ROC Precision Recall AUC PR F1
  28. Hyper-parameter tuning and Cross validation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests Accuracy auROC auPR Model comparison 28
  29. Hyper-parameter tuning and Cross validation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests Accuracy auROC auPR Best F1 Model comparison 29
  30. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model C r o s s v a l i d a t i o n & h y p e r p a r a m e t e r t u n i n g Fine-tune and reverification of model P i p e l i n e a n d l e a r n i n g s View the final pipeline and learnings 30
  31. Pipeline and Learnings Final pipeline view Transaction Data Merchant Profile Customer Behavior Existing Merchants New Merchants Revenue Prediction Algo Timeline Prediction Algo ML Model ML Model Timeline Prediction Revenue Prediction Time-based merchant selection Channel : Salesforce Alerts, Email Notification Demo-social 31
  32. Pipeline and Learnings Learnings • Hypothesis testing • Outlier removal • Hyperparameter tuning • Categorical features vs continuous features • Time to train • Accuracy 32
  33. We’re here Learning to do Machine learning E x p l o r e d a t a Let’s analyze what kind of data we have S t o p c h u r n We’re done & merchants are happy! D a t a P r e p Let’s prepare the data for machine learning M o d e l s e l e c t i o n Let’s discuss the approach to decide the ‘y’ and choose a model C r o s s v a l i d a t i o n & h y p e r p a r a m e t e r t u n i n g Fine-tune and reverification of model P i p e l i n e a n d l e a r n i n g s View the final pipeline and learnings 33
  34. Thank you, SparkML! You’re awesome .. • Spark ETL -> Spark ML • Supports many models out of the box • Scalable for large data • Easy cross-validation • Extensive feature transformation suite ...many more 34
  35. QUESTIONS? 35
Publicidad