Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Automatic Model Documentation with H2O

275 visualizaciones

Publicado el

This presentation was made on June 18, 2020.

Video recording of the session can be viewed here:

For many companies, model documentation is a requirement for any model to be used in the business. For other companies, model documentation is part of a data science team’s best practices. Model documentation includes how a model was created, training and test data characteristics, what alternatives were considered, how the model was evaluated, and information on model performance.

Collecting and documenting this information can take a data scientist days to complete for each model. The model document needs to be comprehensive and consistent across various projects. The process of creating this documentation is tedious for the data scientist and wasteful for the business because the data scientist could be using that time to build additional models and create more value. Inconsistent or inaccurate model documentation can be an issue for model validation, governance, and regulatory compliance.

In this virtual meetup, we will learn how to create comprehensive, high-quality model documentation in minutes that saves time, increases productivity, and improves model governance.

Speaker's Bio:

Nikhil Shekhar: Nikhil is a Machine Learning Engineer at He is currently working on our automatic machine learning platform, Driverless AI. He graduated from the University of Buffalo majoring in Artificial Intelligence and is interested in developing scalable machine learning algorithms.

Publicado en: Tecnología
  • Sé el primero en comentar

Automatic Model Documentation with H2O

  1. 1. ML AutoDoc Auto Documentation for ML Models Nikhil Shekhar (Machine Learning Engineer)
  2. 2. Confidential2 Machine Learning Documentation Challenges • Tedious for Data Scientist • Time Consuming – Iterations and reviews • Inconsistent • Incomplete • Error prone • Compliance Requirements – Banks – Healthcare
  3. 3. Confidential3 ML AutoDoc Overview • Automatically generates editable Word doc to document model creation (algos, techniques, data, etc) • Save Data Science Resources – automatically build required documentation • Customize to your business needs • Simple to use
  4. 4. Confidential4 AutoDoc Support Products Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
  5. 5. 5 H2O-3
  6. 6. Confidential6 H2O-3 ML AutoDoc Supported Algorithms • AutoDocs for Supervised Learning Models (H2O-3 and XGBoost) – XGBoost – Gradient Boosting Machine – Generalized Linear Model – Deep Learning – Distributed Random Forest (including Extremely Randomized Forest) – Stacked Ensembles Packaging • Integrated in H2O Steam • Python Package Docs
  7. 7. Confidential7 Prediction StatsPartial Dependence Feature ImportanceActual vs. Predicted
  8. 8. Confidential8 4Lines of Code Generation Code Example
  9. 9. Confidential9 Confidential9 • Steam exposes a service to generate model report (ML AutoDoc) for OSS H2O-3 models (e.g., GBM, AutoML) – The same service is used in Driverless AI to generate documentation for DAI models Steam: ML AutoDoc AutoDoc for OSS H2O- 3 AutoML model
  10. 10. 10 Scikit-learn Examples
  11. 11. Confidential11 Scikit-learn ML AutoDocs • ML AutoDoc has initial support of 3rd party models: Scikit Learn • AutoDocs for Supervised Learning Models • Scikit-Learn Linear Models: – LogisticRegression • Scikit Learn Ensemble Methods: – RandomForestClassifier – GradientBoostingClassifier – GradientBoostingRegressor AutoDoc for OSS Scikit Gradient Boosting Model
  12. 12. 12 Partial Dependence Response Rate AUC Shift DetectionConfusion Matrix
  13. 13. 13 Driverless AI Examples
  14. 14. Confidential14 AutoDocs in Driverless AI Algorithms Supported • XGBoost, LightGBM, Tensorflow, Additional Features • Included in Driverless AI • Explainability • Customized Reports
  15. 15. 15 Feature Importance Scoring Pipeline Actual v. Predicted PDP/ICE
  16. 16. Confidential16 Driverless AI
  17. 17. Confidential17 Driverless AI AutoDocs
  18. 18. Confidential18 Experiment Summary
  19. 19. Confidential19 Data Overview
  20. 20. Confidential20 Shift Detection
  21. 21. Confidential21 Methodology
  22. 22. Confidential22 Validation Strategy
  23. 23. Confidential23 Model Tuning All Models More Details
  24. 24. Confidential24 Features and Feature Engineering
  25. 25. Confidential25 Final Model
  26. 26. Confidential26 Driverless AI AutoDoc
  27. 27. Confidential27 Experiment Overview
  28. 28. Confidential28 Data Overview
  29. 29. Confidential30 Methodology
  30. 30. Confidential31 Model Tuning All Models More Details
  31. 31. Confidential32 Features
  32. 32. Confidential33 Final Model
  33. 33. Confidential34 Customization Model Diagnostics Model Interpretability • Additional Performance Metrics • Population Stability Index • Prediction Statistics per Quantile • Actual vs Predicted Plots • GINI Plot • Diagnostics on New Datasets • Perform Model Diagnostics on a list of new datasets • Partial Dependence Plots • Generate partial dependence plots on the n most important features • Includes histogram with frequency of each feature • Individual Conditional Expectation Plot • Add partial dependence plot for specific records only • Variable Importance • Calculate variable importance on original features using Permutation Importance • Filter variables to top relative importance or top n features
  34. 34. Confidential35 Population Stability Index # Calculate PSI config_overrides += "nautodoc_population_stability_index=true"
  35. 35. Confidential36 Prediction Statistics per Quantile # Enable the Prediction Statistics for each dataset split config_overrides += "nautodoc_prediction_stats=true"
  36. 36. Confidential37 Variable Importance # Enable the permutation feature importance table and plot config_overrides += "nautodoc_include_permutation_feature_importance=true"
  37. 37. Confidential38 Partial Dependence Plots
  38. 38. Confidential39 Partial Dependence Plots with ICE
  39. 39. Confidential40 Response Rate Plots (Binary Use Cases Only) # Enable the response rate plot for each dataset split config_overrides += "nautodoc_response_rate=true"
  40. 40. Confidential41 GINI Plot (Binary Use Cases Only) # Show the Gini Plot config_overrides += "nautodoc_gini_plot=true"