Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

A Look Under the Hood of H2O Driverless AI

765 visualizaciones

Publicado el

Driverless AI is's latest flagship product for automatic machine learning. It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and production deployment. Driverless AI turns Kaggle-winning grandmaster recipes into production-ready code (Java and C++), and is specifically designed to avoid common mistakes such as under- or overfitting, data leakage or improper model validation, some of the hardest challenges in data science. Other industry-leading capabilities include automatic data visualization and machine learning interpretability.

With Driverless AI, data scientists of all proficiency levels can train and deploy modeling pipelines with just a few clicks from the GUI. Advanced users can use the client API from Python or R. Driverless AI builds hundreds or thousands of models under the hood to select the best feature engineering and modeling pipeline for every specific problem such as churn prediction, fraud detection, real-estate pricing, store sales prediction, marketing ad campaigns and many more.

With Bring-Your-Own-Recipe, domain experts and advanced data scientists can now write their own recipes and seamlessly extend Driverless AI with their favorite tools from the rich ecosystem of open-source data science and machine learning libraries.

In this talk, we explain how Driverless AI works and demonstrate it with live demos.

Arno's Bio:

Arno Candel is the Chief Technology Officer at He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high-performance machine-learning algorithms since 2012. Previously, he spent a decade in supercomputing at ETH and SLAC and collaborated with CERN on next-generation particle accelerators.

Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He was named “2014 Big Data All-Star” by Fortune Magazine and featured by ETH GLOBE in 2015. Follow him on Twitter: @ArnoCandel.

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

A Look Under the Hood of H2O Driverless AI

  1. 1. Plano, TX 5/1/19 Arno Candel CTO @ArnoCandel A Look Under the Hood of H2O Driverless AI
  2. 2. LinkedIn Workforce Report | United States | August 2018 Why Driverless AI?
  3. 3. Driverless AI: AutoML for the Enterprise Tabular Data with Outcomes Automatic ML & DS Grandmaster Recipes • Feature Engineering • Time Series • Model Tuning / Ensembling • Overfitting Protection • Bring Your Own Recipe Powered by datatable, 
 H2O-3 and H2O4GPU ML Interpretability
 (reason codes in production) Automatic Report Scoring Pipeline
 (Python & Java, C++ soon) AutoVis Scores
 Diagnostics Debugging ML: machine learning
 DS: data science Put models in production in days vs months
  4. 4. Confidential3 Industry Use Cases Save Time. Save Money. Gain a Competitive Advantage. Wholesale / Commercial Banking • Know Your Customers (KYC) • Anti-Money Laundering (AML) Card / Payments Business • Transaction frauds • Collusion fraud • Real-time targeting • Credit risk scoring • In-context promotion Retail Banking • Deposit fraud • Customer churn prediction • Auto-loan Financial Services • Early cancer detection • Product recommendations • Personalized prescription matching • Medical claim fraud detection • Flu season prediction • Drug discovery • ER and hospital management • Remote patient monitoring • Medical test predictions Healthcare • Predictive maintenance • Avoidable truck-rolls • Customer churn prediction • Improved customer viewing experience • Master data management • In-context promotions • Intelligent ad placements • Personalized program recommendations Telecom • Funnel predictions • Personalized ads • Credit scoring • Fraud detection • Next best offer • Next best customer • Smart profiling • Prediction • Customer recommendations • Ad predictions and spend Marketing and Retail Driverless AI: Used Across Many Industries
  5. 5. Confidential4 “Driverless AI is giving amazing results in terms of feature and model performance” Venkatesh Ramanathan Senior Data Scientist, PayPal “Driverless AI helped us gain an edge with our Intelligent Marketing Cloud for our clients. AI to do AI, truly is improving our system on a daily basis.” Martin Stein Chief Product Officer, G5 “H2O Driverless AI feature engineering is better than anything I've seen out there right now. And the scoring pipeline generation is probably one of the bigger pluses for me. These features alone have provided us with a true competitive edge in agile manufacturing. It's a massive time saver.” Dr. Robert Coop AI and ML Manager, Stanley Black & Decker “Driverless AI powers our data science team to operate efficiently and experiment at scale… with this latest innovation, we have the opportunity to impact care at large.” Bharath Sudarshan Director of Data Science, Armada Health “ is doing a great job in enhancing the product at such a rapid rate. Each release provides significant increases in usability and value. Driverless AI gives startups like ours an effective alternative to large data science teams and their outsized cost. It can dramatically reduce the time needed to deliver first- rate ML models for a wide range of markets.” Driverless AI Customer Feedback Marc Stein CEO, Driverless AI: Customer Feedback
  6. 6. Driverless AI Architecture InfoWorld Tech of the Year Award: 2018 & 2019
  7. 7. 2 months for Grandmasters — 2 hours for Driverless AI single run, fully automated: 2h on DGX Station! 6h on PC Driverless AI: 10th place in private LB at Kaggle (out of 2926) Driverless AI: top 10 in BNP Paribas Kaggle competition
  8. 8. Driverless AI — Teamwork and Maker’s Culture
  9. 9. Feature v1.0 v1.1 v1.2 v1.3 v1.4 v1.5
 v1.6 LTS v1.7 v1.8 LTS v2.0 Kaggle Grandmaster Recipes for i.i.d. data, XGBoost Models Automatic Visualization Machine Learning Interpretability Standalone Python Scoring Pipeline Hardware acceleration: NVIDIA GPUs (DGX-1 etc.) User Management and Security (LDAP/Kerberos) Data Connectors: NFS/HDFS/S3/GCS/BigQuery, CSV/Excel/Parquet/Feather Native Installer (RPM/DEB) and Cloud Neutral: Amazon/Microsoft/Google Kaggle Grandmaster Recipes for Time-Series Automatic Documentation Deep Learning TensorFlow Models (CPU/GPU) Standalone Java Scoring Pipeline (MOJO) Deep Learning for NLP / Text (CPU/GPU) LightGBM Models (CPU/GPU) Improved Time-Series Recipes (Multiple Windows, MLI for Time-Series Local Feature Brain Improved Scalability, FTRL Models, Model Diagnostics, Data Splitting, Retrain Final Model, etc. C++ Scoring Pipeline (Runtime for MOJO), with Python and R bindings Improved Time-Series Recipes (backtesting, test-time augmentation, single time-series) Project Workspace Bring Your Own Recipe (Transformers, Models, Scorers) - Custom Python Code Data Augmentation Model Monitoring R client API Multi-Node and Multi-User Deployment Driverless AI Roadmap v1.7.0 MAY ‘19
  10. 10. MLI - Machine Learning Interpretation Gain confidence in models before deploying them! Shapley values, partial dependence, ICE, original and transformed features
  11. 11. Automatic Visualization Scalable outlier detection (no sampling) Contains novel statistical algorithms to
 only show “relevant” aspects of the data 
 (soon: actionable recipes and interactive visualization)
  12. 12. Secret Sauce: 1) Grandmaster Feature Engineering Numerical/Categorical Interactions, Target Encoding, Clustering, Dimensionality Reduction, Weight of Evidence, etc. Time-Series: Lags and historical aggregates with causality constraints
  13. 13. Secret Sauce: 2) Grandmaster Pipeline Tuning + Validation 19,000 features tested 1,000 models trained reliable generalization estimates (overfitting avoidance) Example: Driverless AI BNP Paribas on 3-GPU workstation evolutionary strategies DOI: 10.1126/science.aaa9375 MTV 1 final optimal scoring pipeline massively parallel processing (multi-CPU, multi-GPU)
  14. 14. Typically better for structured data (CSV, SQL, Transactional) Typically better for unstructured data (Images, Video, Audio, Text) GLM/CART/RF/GBM/XGBoost
 K-Means/PCA/SVD TensorFlow Deep Learning Secret Sauce: 3) Statistical Learning & Deep Learning
  15. 15. time: Gap=1 | Forecast Horizon=2 invalid lag size (no information available) valid lag size (information available) 1 2 3 4 5 6 7 8 9 10 11 12 [Gap] "[ Gap ]" "8" "9" [Gap] [Gap] test tvs train tvs valid train test Time Series in Driverless AI • Automatic Selection or Manual Control for: • Forecast Horizon • Gap between Training and Production
  16. 16. Text / Natural Language Processing in Driverless AI Now also CharCNN and Bi-GRU LSTM, and custom embeddings!
  17. 17. 1.7.0: BYOR — Bring Your Own Recipe!
  18. 18. Open-Source Recipes - Makers Gonna Make! Bring Your Own Recipe!
  19. 19. Bring Your Own Recipes At Full Speed! BYOR is first-class citizen:
 native integration, no performance penalty, no memory overhead, no restrictions, even MOJOs possible. Dev API = BYOR API
  20. 20. With Freedom Comes Responsibility Now some of the responsibility is with the creator and user of the Recipe. Example: User disables all but 3 specific custom transformers: {MyLog, MyRound, MyRandom} and Identity for numerical columns: Features like log(EDUCATION) will show up, even though there is no statistical benefit (same signal:noise as EDUCATION). Solution: DAI needs more statistical checks - WIP
  21. 21. AutoDoc - Automatic Documentation of Experiments Full transparency into automation process:
 Validation scheme, model tuning, feature selection, ensembling, metrics, diagnostics. Includes custom recipes, fully editable/customizable Word document.
  22. 22.
  23. 23. Live Demo