Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Introduction & Hands-on with H2O Driverless AI

316 visualizaciones

Publicado el

These slides were presented by Marios Michailids and John Spooner at Dive into H2O: London on June 17, 2019.

Marios's session can be found here: https://youtu.be/GMtgT-3hENY

John's session can be found here: https://youtu.be/5t2zw4bVfsw

Publicado en: Software
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Introduction & Hands-on with H2O Driverless AI

  1. 1. Confidential1
  2. 2. Welcome to #DiveIntoH2O London 4:00 PM – Tea & biscuits & networking 4:30 PM - Welcome by H2O.ai Team 4:32 PM - Session by Vladimir Brenner, Partner Account Manager, Disruptors & AI, Intel AI 4:50 PM - Introduction to H2O Driverless AI with Marios Michailidis, Kaggle Grandmaster, H2O.ai 5:30 PM - Pizza break! 5:45 PM - Hands-on with H2O Driverless AI with John Spooner, H2O.ai 7:30 PM - Q&A and networking 8:00 PM - Session ends Presented By
  3. 3. Introduction to H2O Driverless AI Marios Michailidis
  4. 4. Confidential4 Background • Competitive data scientist at H2O.ai • PhD in ensemble methods at UCL • Former kaggle #1 – over 140+ competitions
  5. 5. Confidential5 H2O is the open source leader in AI. Democratize AI for Everyone. AI for good. 5 Confidential
  6. 6. Confidential6 Growing Worldwide Open Source Community 14,000 Companies Using H2O 155,000 Data Scientists 120K Meetup Members H2O World – NYC, London, SF Thousands attending live and online
  7. 7. Confidential7 H2O.ai Product Suite Automatic feature engineering, machine learning and interpretability • 100% open source – Apache V2 licensed • Built for data scientists – interface using R, Python on H2O Flow (interactive notebook interface) • Enterprise support subscriptions • Enterprise software • Built for domain users, analysts and data scientists – GUI-based interface for end-to-end data science • Fully automated machine learning from ingest to deployment • User licenses on a per seat basis (annual subscription) H2O AI open source engine integration with Spark Lightning fast machine learning on GPUs In-memory, distributed machine learning algorithms with H2O Flow GUI Open Source
  8. 8. Confidential8 Driverless AI Advantage • Simple interface • Automatic feature engineering to accuracy • Automatic recipes for solving wide variety use-cases • Automatic tuning to and tune the right ensemble of models
  9. 9. Confidential9 Confidential9 Driverless AI Features Target Data Quality and Transformation Modeling Table Model Building Model Data Integration + Typical Enterprise ML Workflow
  10. 10. Confidential10 AUC, Accuracy, Precision… Time, hardware Insight-visualization Feature Engineering Predictions Model Interpretability • Some input data • A target variable • An objective (or a success metric) • Some allocated resources H2O Driverless AI process x1 x2 x3 x4 y 0.14 0.69 0.01 0.71 1 0.22 0.44 0.45 0.69 1 0.12 0.35 0.51 0.23 0 0.22 0.42 0.79 0.60 1 0.93 0.82 0.72 0.50 1 0.32 0.58 0.28 0.22 0 0.95 0.59 0.68 0.09 1 0.34 0.58 0.35 0.81 0 0.05 0.80 0.28 0.86 1 0.23 0.49 0.63 0.03 0 0.05 0.34 0.53 0.73 1 0.74 0.02 0.33 0.56 0 Regression: How much will a customers spend? Classification: Will a customer make a purchase? Yes or No X y xi xj yes no
  11. 11. Confidential11 Automatic Visualization (AutoViz)
  12. 12. Confidential12 Visualizations outlier detection Correlation graph Heat map
  13. 13. Confidential13 Animal Dog Dog Cat Fish Cat Dog Fish Cat Cat Fish L.Encoding 2 2 1 3 1 2 3 1 1 3 Frequency 3 3 4 3 4 3 3 4 4 3 Is_Dog? Is_Cat? Is_Fish? 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 Cost 100 150 300 50 250 75 25 400 225 40 Animal Avg.cost Cat 293.75 Dog 108.333 Fish 38.3333 108.33 108.33 293.75 38.33 293.75 108.33 38.33 293.75 293.75 38.33 Feature engineering: Categorical features
  14. 14. Confidential14 Age 40 45 45 45 49 51 56 61 62 71 73 74 77 78 78 Age 40-51 52-56 57-77 78+ Age 40 45 45 45 missing 51 56 61 62 71 73 74 77 78 78 Age 40 45 45 45 61.14 51 56 61 62 71 73 74 77 78 78 Age 40-51 52-56 57-77 78+ missing Age 40-51 40-51 40-51 40-51 40-51 40-51 52-56 57-77 57-77 57-77 57-77 57-77 57-77 78+ 78+ Age 40-51 40-51 40-51 40-51 missing 40-51 52-56 57-77 57-77 57-77 57-77 57-77 57-77 78+ 78+ Mean= 61.14 Feature engineering: Numerical features Age Income
  15. 15. Confidential15 Age income 40 40 45 14 45 100 45 33 49 20 51 44 56 65 61 80 62 55 71 15 73 18 74 33 77 32 78 48 78 41 * + 1600 80 630 59 4500 145 1485 78 980 69 2244 95 3640 121 4880 141 3410 117 1065 86 1314 91 2442 107 2464 109 3744 126 3198 119 Animal color Dog brown Dog white Cat black Fish gold Cat mixed Dog black Fish white Cat white Cat black Fish brown Animal_color Dog_brown Dog_white Cat_black Fish_gold Cat_mixed Dog_black Fish_white Cat_white Cat_black Fish_brown Age Animal 13 Dog 12 Dog 15 Cat 3 Fish 16 Cat 10 Dog 6 Fish 20 Cat 13 Cat 2 Fish derived 11.66 11.66 16 3.66 16 11.66 3.66 16 16 3.66 Animal Age Dog 11.66 Cat 16 Fish 3.66 Feature engineering: Interactions
  16. 16. Confidential16 This dog is cute! I love dogs dog this is jedi love cute emperor bingo ! I 2 1 1 0 1 1 0 0 1 1 • TF(IDF) • Stemming, Spellchecking , n-grams, stop words • SVD • Word2vec • Other pre-trained embeddings (Glove, FasText) word f1 f2 f3 f4 f5 dog 0.8 0.2 0.1 -0.1 -0.4 this -0.2 0 0.1 0.8 0.3 love 0.1 0.2 0.7 -0.5 0.1 cute 0.2 0.2 0.8 -0.4 0 King – Man = Queen Feature engineering:Text
  17. 17. Confidential17 Date 1/1/2018 2/1/2018 3/1/2018 4/1/2018 5/1/2018 6/1/2018 7/1/2018 8/1/2018 9/1/2018 10/1/2018 Day Month Year weekday weeknum 1 1 2018 2 1 2 1 2018 3 1 3 1 2018 4 1 4 1 2018 5 1 5 1 2018 6 1 6 1 2018 7 1 7 1 2018 1 2 8 1 2018 2 2 9 1 2018 3 2 10 1 2018 4 2 Date Sales 1/1/2018 100 2/1/2018 150 3/1/2018 160 4/1/2018 200 5/1/2018 210 6/1/2018 150 7/1/2018 160 8/1/2018 120 9/1/2018 80 10/1/2018 70 Lag1 Lag2 - - 100 - 150 100 160 150 200 160 210 200 150 210 160 150 120 160 80 120 MA2 - 100 125 155 180 205 180 155 140 100 Feature engineering: Time series
  18. 18. Confidential18 Common open source packages used in ML
  19. 19. Confidential19 • Optimizing maximum depth • Tree expansion (loss vs depth) • Learning rate • Number of trees 1 2 3 3 2 Hyper parameter tuning
  20. 20. Validation approach Train data Past Future x0 x1 x2 x3 y 0.94 0.27 0.80 0.34 1 0.02 0.22 0.17 0.84 0 0.83 0.11 0.23 0.42 1 0.74 0.26 0.03 0.41 0 0.08 0.29 0.76 0.37 0 0.71 0.76 0.43 0.95 1 0.08 0.72 0.97 0.04 0 0.84 0.79 0.89 0.05 1 0.94 0.27 0.80 0.34 1 0.02 0.22 0.17 0.84 0 0.83 0.11 0.23 0.42 1 0.74 0.26 0.03 0.41 0 0.08 0.29 0.76 0.37 0 0.71 0.76 0.43 0.95 1 0.08 0.72 0.97 0.04 0 0.84 0.79 0.89 0.05 1 K=4 pred 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Fold : 1 pred 0.96 0.03 0.00 0.00 0.00 0.00 0.00 0.00 Train Predict pred 0.96 0.03 0.90 0.12 0.00 0.00 0.00 0.00 Fold : 2Fold : 3 pred 0.96 0.03 0.90 0.12 0.03 0.77 0.00 0.00 Fold : 4 pred 0.96 0.03 0.90 0.12 0.03 0.77 0.18 0.91
  21. 21. Confidential21 Genetic algorithm approach x1 x2 x3 x4 y 0.14 0.69 0.01 0.71 1 0.22 0.44 0.45 0.69 1 0.12 0.35 0.51 0.23 0 0.22 0.42 0.79 0.60 1 0.93 0.82 0.72 0.50 1 0.32 0.58 0.28 0.22 0 0.95 0.59 0.68 0.09 1 0.34 0.58 0.35 0.81 0 0.05 0.80 0.28 0.86 1 0.23 0.49 0.63 0.03 0 0.05 0.34 0.53 0.73 1 0.74 0.02 0.33 0.56 0 Iteration1/10 X% accuracy Feature Importance x4 1 x2 0.5 x3 0.2 x1 0.01 Tuned Iteration2/10 + Random Transformation sqrt(x2) Z% accuracy Feature Importance x2+x4 1 x4 0.4 sqrt(x2) 0.3 x3 0.2 x2 0.05
  22. 22. Confidential22 Feature ranking based on permutations 0.5 0.52 0.69 0.2 1 0.18 0.94 0.57 0.68 0 0.67 0.6 0.76 0.91 1 0.25 0.68 0.57 0.07 1 0.83 0.07 0.76 0.56 0 0.49 0.92 0.64 0.02 0 0.17 0.7 0.57 1 0.96 0.75 0.59 1 0.96 0.14 0.09 0 0.61 0.38 0.07 0 0.07 0.13 0.87 1 0.66 0.27 0.48 1 0.41 0.56 0.21 0.61 0.26 0.02 0.02 0.56 0.21 0.41 0.61 0.26 Train Validation fit score 80% accuracy 70% accuracy (drop) Shuffle The 10% difference is how important that feature is Bring back to normal Repeat for all features
  23. 23. Confidential23 Stacking X0 x1 x2 xn y 0.17 0.25 0.93 0.79 1 0.35 0.61 0.93 0.57 0 0.44 0.59 0.56 0.46 0 0.37 0.43 0.74 0.28 1 0.96 0.07 0.57 0.01 1 A X0 x1 x2 xn y 0.89 0.72 0.50 0.66 0 0.58 0.71 0.92 0.27 1 0.10 0.35 0.27 0.37 0 0.47 0.68 0.30 0.98 0 0.39 0.53 0.59 0.18 1 B X0 x1 x2 xn y 0.29 0.77 0.05 0.09 ? 0.38 0.66 0.42 0.91 ? 0.72 0.66 0.92 0.11 ? 0.70 0.37 0.91 0.17 ? 0.59 0.98 0.93 0.65 ? C pred0 pred1 pred2 y 0.24 0.72 0.70 0 0.95 0.25 0.22 1 0.64 0.80 0.96 0 0.89 0.58 0.52 0 0.11 0.20 0.93 1 B1 pred0 pred1 pred2 y 0.50 0.50 0.39 ? 0.62 0.59 0.46 ? 0.22 0.31 0.54 ? 0.90 0.47 0.09 ? 0.20 0.09 0.61 ? C1 Train algorithm 0 on A and make predictions for B and C and save to B1, C1 Train algorithm 1 on A and make predictions for B and C and save to B1, C1 Train algorithm 2 on A and make predictions for B and C and save to B1, C1 Train algorithm 3 on B1 and make predictions for C1 Preds3 0.45 0.23 0.99 0.34 0.05 Consider datasets A,B,C.Target variable (y) is known for A,B
  24. 24. Confidential24 “The ability to explain or to present in understandable terms to a human.” – Finale Doshi-Velez and Been Kim. “Towards a rigorous science of interpretable machine learning.” arXiv preprint. 2017. https://arxiv.org/pdf/1702.08608.pdf FAT*: https://www.fatml.org/resources/principles-for-accountable-algorithms XAI: https://www.darpa.mil/program/explainable-artificial-intelligence Machine Learning Interpretability?
  25. 25. Confidential25 Interpretability and accuracy trade-off Linear Models Exact explanations for approximate models. Machine Learning Approximate explanations for exact models.
  26. 26. Confidential26 Decision Tree Surrogate Models Partial Dependence and Individual Conditional Expectation https://github.com/jphall663/interpretable_machine_learning_with_python Machine Learning Interpretability examples
  27. 27. Confidential27 Scoring Pipeline • Python • MOJO (Java)
  28. 28. Confidential28 Thank you marios@h2o.ai in/mariosmichailidis/ @StackNet_ https://www.facebook.com/StackNet/
  29. 29. Confidential29 Introduction to H2O Driverless AI John Spooner Director of Solution Engineering - EMEA
  30. 30. DRIVERLESS AI HANDS ON WORKSHOP
  31. 31. CONFIDENTIAL CONFIDENTIAL The Game Plan Set up Driverless AI Environment Introduction into the data sets Showtime – Time to get hands on
  32. 32. CONFIDENTIAL CONFIDENTIAL Please Create an account on Aquarium Navigate to aquarium.h2o.ai Create a new account
  33. 33. CONFIDENTIAL CONFIDENTIAL Enter your details Create a new account
  34. 34. CONFIDENTIAL CONFIDENTIAL Sign-in to aquarium Go to Browse Labs Choose David Whiting’s Lab
  35. 35. CONFIDENTIAL CONFIDENTIAL Sign-in to aquarium Go to Browse Labs Choose David Whiting’s Lab And click start lab
  36. 36. CONFIDENTIAL CONFIDENTIAL Prediction problem
  37. 37. CONFIDENTIAL CONFIDENTIAL Time Series Forecasting Problem
  38. 38. Confidential38 Experiment Settings ACCURACY • Relative accuracy – higher values should lead to higher confidence in model performance (accuracy) • Impacts things such as level of data sampling, how many models are used in the final ensemble, parameter tuning level, among others Accuracy Time Interpretability • Relative time for completing the experiment • Higher settings mean: – More iterations are performed to find the best set of features – Longer “early stopping” threshold • Relative interpretability – higher values favor more interpretable models • The higher the interpretability setting, the lower the complexity of the engineered features and of the final model(s)
  39. 39. CONFIDENTIAL CONFIDENTIAL Accuracy Accuracy Max Rows x Cols Ensemble Level Target Transformation Parameter Tuning Level Num Folds Only First Fold Model Distribution Check 1 100K 0 False 0 3 True No 2 1M 0 False 0 3 True No 3 50M 0 True 1 3 True No 4 100M 0 True 1 3-4 True No 5 200M 1 True 1 3-4 True Yes 6 500M 2 True 1 3-5 True Yes 7 750M <=3 True 2 3-10 Auto Yes 8 1B <=3 True 2 4-10 Auto Yes 9 2B <=3 True 3 4-10 Auto Yes 10 10B <=4 True 3 4-10 Auto Yes
  40. 40. CONFIDENTIAL CONFIDENTIAL Time Time Iterations Early Stopping Rounds 1 1-5 None 2 10 5 3 30 5 4 40 5 5 50 10 6 100 10 7 150 15 8 200 20 9 300 30 10 500 50
  41. 41. CONFIDENTIAL CONFIDENTIAL Interpretability Interpretability Ensemble Level Target Transformation Feature Engineering Feature Pre- Pruning Monotonicity Constraints 1 - 3 <= 3 None Disabled 4 <= 3 Inverse None Disabled 5 <= 3 Anscombe Clustering (ID, distance) Truncated SVD None Disabled 6 <= 2 Logit Sigmoid Feature selection Disabled 7 <= 2 Frequency Encoding Feature selection Enabled 8 <= 1 4th Root Feature selection Enabled 9 <= 1 Square Square Root Bulk Interactions (add, subtract, multiply, divide) Weight of Evidence Feature selection Enabled 10 0 Identity Unit Box Log Date Decompositions Number Encoding Target Encoding Text (TF-IDF, Frequency) Feature selection Enabled Good start
  42. 42. CONFIDENTIAL CONFIDENTIAL Showtime Data Set Review Visualisation Automated Machine Learning Machine Learning Interpribility Free Style
  43. 43. CONFIDENTIAL CONFIDENTIAL Showtime Customer Churn Prediction Sales Forecasting Train Dataset – card_train Test Dataset – card_test Train Dataset – Cannabis_train Test Dataset – Cannabis_test Target: target (Default) Features: Credit Limit, Sex, Education, Marriage, Age, Status 1-6, Bill Amt 1-6, PayAmt1-6 Target – Weekly_sales Features – Is holiday, type, temperature, fuel_price, markdown,CPI, unemployment,sample_weight Groups: Sales_Date, Organisation Tasks - Review data - Visualistion of data - Build a predictive model. - Correctly specify training and test dataset - Apply the following knob settings: 1,1,10 - Once experiment has finished, explore results. Download the experiment summary - Interpret the model Tasks - Review data - Visualistion of data - Build a predictive model. Remember to: - Correctly specify training and test dataset - Set the time column to sales_date - Set group columns to Sales_Date, Organisation - Apply the following settings: 4,4,1 - Once experiment has finished, explore results. Download the experiment summary - Interpret the model Additional tasks - Modify the experiment settings to improve the model accuracy - Modify the expert settings to understand advanced capability Additional tasks - Modify the experiment settings to improve the model accuracy - Modify the expert settings to understand advanced capability
  44. 44. CONFIDENTIAL CONFIDENTIAL Showtime Data Set Review Visualisation Automated Machine Learning Machine Learning Interpribility Free Style
  45. 45. The Data Column Description ID ID of each client CreditLimit Amount of credit in NT dollars Sex Gender (M, F) Education (1: graduate school, 2: university, 3: high school, 4: others, 5-6: unknown) Marriage Marital status (M, S, D, O) Age Age in years Status1 … Status6 Repayment status in September, 2005 – April, 2005 BillAmt1 … BillAmt6 Amount of bill statement in September, 2005 – April, 2005 (NT dollar) PayAmt1 … PayAmt6 Amount of previous payment in September, 2005 – April, 2005 (NT dollar) Default Default payment (1=yes, 0=no)
  46. 46. Payment History Data 1 Month Ago Status1: ≤0, 1 BillAmt1 PayAmt1 2 Months Ago Status2: ≤0, 1, 2 BillAmt2 PayAmt2 3 Months Ago Status3: ≤0, 1, 2, 3 BillAmt3 PayAmt3 ... 6 Months Ago Status6: ≤0, 1, ..., 6 BillAmt6 PayAmt6 Status: -2: No balance -1: Paid in full 0: Minimum balance paid 1: One month late 2: Two months late etc.

×