Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
3. • Software Architect @
o 17+ years professional experience
• Microsoft Azure MVP
• External Expert Horizon 2020, Eurostars-Eureka
• External Expert InnoFund Denmark, RIF Cyprus
• Business Interests
o Web Development, SOA, Integration
o IoT, Machine Learning, Computer Intelligence
o Security & Performance Optimization
• Contact
ivelin.andreev@icb.bg
www.linkedin.com/in/ivelin
www.slideshare.net/ivoandreev
About me
4. Contents
1. Machine Learning Workflow
2. Visual Interface for Azure ML Service
3. Automated ML
4. Advanced ML with Azure Monitor
5. Deep Learning with Tensorflow
6. AI Ops
7. Cognitive Vision Services
8. Insights with Text Analytics and Vision
9. Cognitive Decision Service
10. Cognitive Search Service
11. Version Control for ML
12. VS Code for Python ML
13. Bot Framework
14. Search Bots with Cognitive Services
15. Bot Architecture Best Practices
16. AI and Cognitive Services in Power BI
17. Form Processing with AI Builder
6. ML is a Process
• Iterative data science process:
o Business problem understanding
o Data collection, cleaning, exploration
o Model building
o Performance evaluation
o Deployment
• Auto ML: Automate environment,
data preparation,
experimentation,
deployment
7. AutoML is not Auto Data Science
• Any ML Task = {data} + {problem type} + {loss function}
• ML project effort and budget
o 80% data preparation, 15% modeling and evaluation
o Repetitive effort (react to changes in objectives and data)
• AutoML as a tool
o A recommender system for ML pipelines
to achieve accuracy with less time
• Objective
o Offload data scientists from of repetitive tasks
o Automate problem solution on data with minimal loss
8. AutoML fills the gap
between “supply” and
“demand” on ML market
AutoML outperforms an
average Data Scientist
9. Auto ML Builds ML Pipelines
User Input: Dataset, Performance goals, Constraints (CPU, RAM, time)
Auto ML Magic
Results: Automatically determine a pipeline structure with minimal loss on the
validation set within CPU/Memory constraints
Auto ML Steps
1. Determine pipeline structure
2. Select algorithm for each step
3. Tune hyper-parameters
Performance Evaluation
• All 3 steps shall be completed;
• Iterate until performance goals reached
10. ML Pipeline Steps
An ML pipeline is a technical solution to stitch ML phases and automate workflows
• Data
o Select preprocessing strategy (imbalanced and missing data, normalization, outliers)
o Features (feature extraction, engineering, selection)
• Modeling
o Select algorithm
o Tune hyperparameters (i.e. number of trees)
o Train multiple models, create ensemble
o Score, evaluate, select the best model
• Training & Deployment
o Parallel training on a cluster, Maintain versioning
11. ML Pipeline Benefits
• Advantages of ML Pipelines
o Parallel and unattended execution
o Reusability through pipeline templates for specific scenarios
o Versioning data and results using pipeline SDK
o Modularity separating areas of concern
o Collaboration among data scientists across ML design process
o Scalability – single ML pipeline can be trained on multiple machines;
different ML pipelines can be tested in parallel on many nodes
• Open Issue
How do pipelines “learn” what to do???
12. “No free lunch” theorem simplified
(David Wolpert, 1996)
1. Model is simplification of reality
2. Simplification is based on bias
3. Bias fails in some situations
Conclusion 1: No algorithm or
parameter set is always the best.
Conclusion 2: Use knowledge
about data and context.
13. Automated Data Preparation
Step 1: Data Ingestion
• Requires data storage (Azure Blob mounted by default)
• Data quality issues are common (missing data, mixed units and formats)
• Evaluate quality, select initial features (statistical analysis and visualization)
Rule of Thumb: No algorithm could achieve good results with bad data input
Step 2: Data profiling and cleansing
• AutoML provides a variety of statistics to verify dataset is ready for modelling
o Non-numeric (Min, Max, Count)
o Numeric (Mean, StdDev, Variance, Distribution histogram)
• Cleansing cannot be done in GUI
o Python SDK: azureml.dataprep
o ML Turn on “Automtic preprocessing” option
14. Auto ML Guardrails
What is: Safeguard users against common issues with data and make corrections
Missing Values
• Strategies: Drop rows; intelligently replace missing values based on other data
Class Imbalances
• Most ML algorithms assume equal distribution, majority classes add more bias
• Strategies: Oversampling (add instances to minority class); Undersampling (majority)
Data Leakage
• Dataset includes information that would not be available at time of prediction
• Actual outcome is already known, model performance will be perfect
• Strategies: Remove leaky features; Add noise; Hold back unseen test data
15. Automated Data Preparation
Step 3: Feature Engineering
• Impute missing values (mode for categorical, mean for numerical)
• Create categorical features from numeric with low diversity
• YYYY, MM, dd, HH, mm, ss, Day of week, Day of year, Quarter, Week Nr from date
• One-hot encode low cardinality categorical vars (i.e. Gender -> IsMale, IsFemale)
• K-means clustering on each numeric columns for distance to centroid feature
• Term frequency for text variables
• Outlier treatment
Note: General-purpose steps are not domain specific (i.e. income/debt ratio)
16. Automated Data Preparation
Step 3 just got you into a problem
• Feature engineering could generate too many features
• Solution need to avoid overfitting, reduce model training time
• We did not put domain knowledge
Step 4: Feature Selection (limited in AutoML)
• Drop high cardinality variables (noise)
• Drop no variance variables (non-informative)
Possible future improvements
• Drop highly correlated fields
17. Algorithm Selection and Hyperparametrization
Challenges of Configuration Space
• High-dimensionality (multiple continuous, categorical, binary variables)
• Conditionality (some parameter values are relevant in combination)
• No Gradient (loss function has no gradient, expensive evaluation)
Opt3: Bayesian OptimizationOpt1: Grid Search / Brute Force
• Cartesian product on hyperparameter combinations
• The simplest method, dimensionality curse
Opt2: Random Search
• Random configurations within certain budget
• Good baseline, no assumptions, easy parallelization
18. Meta Learning in AutoML
Challenges
• Avoid starting from scratch on new ML tasks
• Learn from experience, efficiently and in systematic data-driven way
Prerequisite
• Collect meta-data to describe previous tasks (parameters, pipeline structure, evaluations)
Result
• Meta-learner to recommend promising configurations w/o exhaustive search
Notes
• If datasets have similar results on few pipelines => similar results on remaining pipelines
• Operates similarly to recommender systems
• Privacy: AML has no need to access customer data, only pipeline results
19. Cross-Validation and Ensembling
Cross Validation
• Divide training data in k-subsets
• Repeat k-times: hold out ki, validate on k-1 subsets;
• Average error estimation across k error estimations
Ensembling (bagging, boosting, stacking)
• Combine few of best ML models for improved accuracy at no extra cost
21. Azure ML Designer vs Azure ML Studio
• ML Studio – collaborative drag-drop workspace to build, test and deploy ML
• Azure ML – designer, SDK and CLI for data prep., train and deploy ML at scale
Azure ML Designer ML Studio (Classic)
Availability Preview (2019) Generally available (GA) (2015)
Drag-drop interface Yes Yes
Scalability With compute target Up to 10GB data limit training
Module rich Important only Multiple
Compute AML computer CPU/GPU Proprietary compute, CPU only
ML Pipeline Authoring, publishing N/A
ML Ops Flexible deployment and versioning Basic management and deploy
Model portability Portable Proprietary, non-portable
Auto ML Through SDK N/A
22. Azure ML
What is: cloud-based environment to rapidly build and deploy machine learning
models, by auto-scaling powerful CPU or GPU clusters
How to:
1. 4 Development environments for AML – cloud-based notebook VM (easiest);
local (with Azure subscription), Data Science VM and Azure Databricks
2. Create workspace (Python SDK or Azure Portal)
3. azureml.dataprep Python package to explore, cleanse and transform
4. Train target (Local PC, Azure Linux VM, HDInsight for Spark)
5. azureml.train recommend pipeline based on target metrics
6. Register models for tag, search and deploy (even models trained outside AML)
7. Deploy to Azure Container Instance serverless containers
23. Interpreting Learning Results (Classification)
• Confusion Matrix
o Rows – true class, Columns – predicted class
o Good model = most values along the diagonal
• Precision-Recall Chart
o Precision = TP / (TP + FP), ability to label correctly
o Recall = TP / (TP + FN), ability to find all instances
o Macro Average PR – independent PR average
o Micro Average PR – weighted PR average (imbalanced)
o Draw PR chart - at different threshold values
• ROC Chart – TP Rate / FP Rate over different thresholds
FPR = FP / (FP + TN) (best is close to 0), TPR = TP (TP + FN) (best is close to 1)
24. Lift, Gain and Calibration Charts
• Lift Chart – How many times the model is better than random
o Ratio of gain%/random expectation% at a given decile level
o Green line – baseline random guess
• Gain Chart – how much to sample to get target sensitivity (TPR)
o X – percentile addressed, Y - portion positive responses
o Green line - baseline random guess
• Calibration Chart
o Confidence of a predictive model
o Predicted vs actual probability
o Good model: y=x
o Overly confident: y=0 and y=1
Note: perfectly calibrated classifier != perfect classifier
25. Containers meet Machine Learning
• Steps: (from Portal or AML SDK management API)
o Add model (from local workspace or upload model)
o Add driver script
o Add package dependency file (YML)
o The system creates Docker image and register to Workspace
• Deployment
o Azure Container Instance (ACI) - test, Azure Kubernetes Service (AKS) - prod
o Azure ML Compute, Azure IoT Edge
• Operationalization
o REST API is created automatically
26. Operationalization
• REST APIs
o Deployment an AML model web service creates single and batch REST API
o APIs consumed by azureml.core.webservice
• Performance Degradation
o Performance in real life may differ from during training
o Data drift - change in characteristics of input data over time
• Monitoring and Drift Analysis
o Input data change over time and lead to performance degradation
o Configure inference data to snapshot and profile against baseline
o ML model trained to detect differences
o Model performance converted to drift coefficient
27. Takeaways
• Books
o AI MVP Book: Automated Machine Learning
https://www.amazon.com/gp/aw/d/B082P5MK8Y
o Practical Automated ML on Azure
• The No Free Lunch Theorem
https://www.kdnuggets.com/2019/09/no-free-lunch-data-science.html
• Azure ML Studio vs Azure ML Services designer
https://www.codit.eu/blog/azure-machine-learning-studio-vs-services/
https://docs.microsoft.com/en-us/azure/machine-learning/compare-azure-ml-to-
studio-classic
• Bayes Theorem
https://towardsdatascience.com/understanding-bayes-theorem-7e31b8434d4b