Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Random Forests Best Practices for the Business World

530 visualizaciones

Publicado el

By Gabby Shklovsky
PyData New York City 2017

This talk will explain best practices for successfully using random forests in the business world. It will focus on (1) best practices for preparing training data for random forests so that random forests can do what they do best and (2) best practices for interpreting random forest results to address concerns of business leaders that may not trust black box algorithms.

Publicado en: Tecnología
  • Sé el primero en comentar

Random Forests Best Practices for the Business World

  1. 1. CONFIDENTIAL © 2017 Revenue Analytics, Inc. Random Forests: Best Practices for the Business World PyData NYC 2017 Gabby Shklovsky Revenue Analytics
  2. 2. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 2 CONFIDENTIAL © 2017 Revenue Analytics, Inc. Random Forest Overview Best Practices for Training Random Forests Get the best predictions possible from data available Best Practices for Interpreting Random Forests Sanity check model outputs to catch data or code issues Break inside the black-box to provide insights to skeptical business leaders Agenda
  3. 3. CONFIDENTIAL © 2017 Revenue Analytics, Inc. Random Forest Overview
  4. 4. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 4  Random Forests are a type of supervised machine learning algorithms known as ensemble methods • Ensemble methods use multiple ML models which together give better predictive performance than any of the component models would give on their own • Random forests are ensembles of decision trees ‒ Train multiple (e.g. 30) tree models and blend the predictions of each tree to determine the combined prediction • Random forests are considered “black box” models ‒ Underlying tree structure allows for more human-interpretable insight than a typical black box algorithm Quick Overview of Random Forest Models
  5. 5. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 5  Decision tree algorithms are deterministic  Need randomization to produce ensemble of trees that are each somewhat different • Random sampling of data to create various input data sets ‒ Typically use bootstrapping (sampling with replacement) ‒ Each tree fits to slightly different patterns that appear in different samples • Random sampling of possible split variables ‒ Restrict set of potential split variables so “best” one not always chosen ‒ Disrupting greediness of split variable selection could actually yield more accurate tree What’s random about them?
  6. 6. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 6  Decision tree algorithms are deterministic  Need randomization to produce ensemble of trees that are each somewhat different • Random sampling of data to create various input data sets ‒ Typically use bootstrapping (sampling with replacement) ‒ Each tree fits to slightly different patterns that appear in different samples • Random sampling of possible split variables ‒ Restrict set of potential split variables so “best” one not always chosen ‒ Disrupting greediness of split variable selection could actually yield more accurate tree What’s random about them? Customer type Num employees Sell Amt Last Year Region Customer type Num employees Sell Amt Last Year Region Prediction: 100 Num employees < 1000 Pred = 100 n = 1000 Pred = 200 n = 1000 Customer type == A Pred = 120 n = 500 Pred = 160 n = 1500 Decision Tree Tree in Random Forest
  7. 7. CONFIDENTIAL © 2017 Revenue Analytics, Inc. Best Practices for Training Random Forests
  8. 8. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 8 Best Practice #1: Factor out linear relationships between predictor and response • A strong linear relationship often overpowers subtler effects • Let each model do what it does best  Example: B2B Company wants to predict customer spend next year • Available data: ‒ Customer spend past few years ‒ Customer attributes (e.g. size, industry) Best Practices for Training Random Forests
  9. 9. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 9  If use last year spend to predict next year spend, last year spend will dominate tree splits • Trees approximate a linear relationship in non-linear way (inefficient) • Trees unlikely to pick up second-order effects (usually the most interesting) Best Practice #1: Factor out linear relationships
  10. 10. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 10  Instead, change target variable to: Sell Amt Diff = Sell Amt – Sell Amt LY • Now other factors (customer type, num employees) are predictive • Last year spend still predictive – but in a different way Best Practice #1: Factor out linear relationships
  11. 11. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 11  Note that R^2 will be misleading when comparing model with linear component vs not • Scikit-learn RandomForestRegressor.score() defaults to r2_score  When model target variable is transformation of actual target value, make sure to measure accuracy on actual target value • For model B: Sell Amt Pred = Sell Amt LY Actual + Sell Amt Diff Predicted Best Practice #1: Factor out linear relationships Model R^2 score (RF) MAPE (final pred) Model A (linear) 0.86 23.7% Model B (difference) 0.01 17.5%
  12. 12. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 12 Best Practice #2: Feature Engineering is key i. Use domain expertise / business knowledge as a guide ii. Explicitly define interaction effects as new predictors iii. Use multiple metrics that are proxies for the same concept as predictors  Example: B2B Company wants to predict customer spend next year • Available data: ‒ Customer spend past few years ‒ Customer attributes (e.g. size, industry) • Focus on iii. use multiple metrics as proxies Best Practices for Training Random Forests
  13. 13. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 13  Want to add previous year spend difference as predictor • Should it be modeled in dollars or as a percentage? • Why not both? ‒ Often leads to better predictive performance ‒ Can also help make less vulnerable to outliers – If some trees split on % and others on $ then if one value is extreme but other is not, extreme effect will be somewhat muted Best Practice #2: Feature Engineering is key Model Previous Year Spend Diff Variables Included MAPE Model B None 17.5% Model C Sell Amt LY Diff $ 16.8% Model D Sell Amt LY Diff % 16.9% Model E Sell Amt LY Diff $ Sell Amt LY Diff % 16.7%
  14. 14. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 14 Best Practice #1: Factor out linear relationships between predictor and response • A strong linear relationship often overpowers subtler effects • Let each model do what it does best Best Practice #2: Feature Engineering is key i. Use domain expertise / business knowledge as a guide ii. Explicitly define interaction effects as new predictors iii. Use multiple metrics that are proxies for the same concept as predictors Best Practices for Training Random Forests
  15. 15. CONFIDENTIAL © 2017 Revenue Analytics, Inc. Best Practices for Interpreting Random Forests
  16. 16. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 16 Best Practice #3: Check that feature importances align with expectations/intuition • Confirm which features are actually adding predictive value • If features expected to be important are not may point to data or model issues • Can reveal new business insights Example: B2B Company wants to predict customer spend next year • Evaluate Model E from previous example (predicting Sell Amt Diff) Best Practices for Interpreting Random Forests Feature Importance Num employees 0.342 Sell Amt LY 0.278 Sell Amt LY Diff Pct 0.230 Sell Amt LY Diff $ 0.086 Customer Type == A 0.061 Customer Type == B 0 Customer Type == C 0
  17. 17. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 17 Best Practice #4: Check directional relationship between top predictors and response i. Manually step through 4-5 levels of a few trees to detect patterns ii. "Stress test" the model with synthetic data that varies the value of one predictor holding all else equal Example: B2B Company wants to predict customer spend next year • Evaluate Model E from previous example (predicting Sell Amt Diff) Best Practices for Interpreting Random Forest Results
  18. 18. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 18 Best Practice #4: Check directional relationships  Manually stepping through trees is a good way to sanity check model results Feature Relationship with Response Sell Amt LY Diff % Higher Sell Amt LY Diff % -> lower predicted Sell Amt Diff (regression to mean) Num employees Higher Num Employees -> lower predicted Sell Amt Diff Sell Amt LY Higher Sell Amt LY -> higher predicted Sell Amt Diff Customer Type A has lower predicted Sell Amt Diff than B or C Model E – tree 0
  19. 19. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 19 Best Practice #4: Check directional relationships  Manually stepping through trees is a good way to sanity check model results • Walk through at least 2-3 trees in forest to see if patterns consistent or not • May not always see clear directional relationship (↑ some places, ↓ others) Model E – tree 6
  20. 20. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 20 Best Practice #4: Check directional relationships  “Stress testing” the model can give a more holistic view of particular relationships • Create synthetic data that varies one feature while holding all else equal • Because non-linear, should use multiple “all else equal“ data sets  Stress test Sell Amt LY Diff $ because did not show up in manual tree checks • Each line represents a different set of values of all other variables • Shows that predicted Sell Amt Diff decreases as Sell Amt LY Diff $ increases, but only for high values of Sell Amt LY Diff $ • A few places where ↑ relationship rather than ↓
  21. 21. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 21 Best Practice #4: Check directional relationships  “Stress testing” the model can give a more holistic view of particular relationships • Create synthetic data that varies one feature while holding all else equal • Because non-linear, should use multiple “all else equal“ data sets  Stress test Num Employees • Confirms consistent ↓ relationship we saw in manual tree stepping • Size of ↓ relationship varies significantly based on values of other variables • See somewhat consistent inflection point between 1200-1400 employees
  22. 22. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 22 Best Practice #3: Check that feature importances align with expectations/intuition • Confirm which features are actually adding predictive value • If features expected to be important are not may point to data or model issues • Can reveal new business insights Best Practice #4: Check directional relationship between top predictors and response i. Manually step through 4-5 levels of a few trees to detect patterns ii. "Stress test" the model with synthetic data that varies the value of one predictor holding all else equal Best Practices for Interpreting Random Forests
  23. 23. CONFIDENTIAL © 2017 Revenue Analytics, Inc. 23  Contact Info • Email: gshklovsky@revenueanalytics.com • Twitter: @GabbyShklovsky • LinkedIn: https://www.linkedin.com/in/shklovskyg/ Questions?

×