Introduction to Machine Learning with Azure & Databricks

CCG:
Upcoming Workshops
Data Modernizationin a Day | March 30th | 9:00 AM – 12:00PM EST
• This workshop will cover everything from whiteboarding migration strategiesto hands-on experiences with data
migration tools.
Analytics in a Day Ft. Synapse Workshop| April 20th | 9:00 AM – 1:00 PM EST
• Learn how to simplify and accelerate your journey towards the modern data warehouse.
Data Governance Workshopwith CCG+Profisee | May 4th | 9:00 AM – 12:00
PM EST
• Learn how leveraging an MDM strategywithin the contextof Data Governance drives organizationalalignment,
ensures data quality, and accelerates Digital Transformation.
Readmore and registeratccganalytics.com/events
Follow usonLinkedIn@CCGAnalyticstostayupto date on events

PLEASE POSTQUESTIONSIN THE
CHAT!
PLEASE MUTE YOUR LINE WHEN
NOT SPEAKING!
CCG WILL CONTROLMUTING AND
UNMUTING.
THIS SESSION WILL BE RECORDED.
WE WILL SHARE SLIDES WITH YOU.
TO MAKE PRESENTATION LARGER,
DRAG THE BOTTOMHALF OF
SCREEN ‘UP’
Housekeeping

Intro to Machine Learning
with Azure & Databricks

CCG is a full-service cloud analytics provider.
Strategy and Governance
• Data GovernanceSolution
• Data Privacy Solution
• Strategy Roadmap Solution
Services
• Health Assessments
• Roadmaps
• Data Governance
• Data Privacy
• Master Data Management with
Profisee
• Metadata Management
Analytics and Insights
• Customer Intelligence Solution
• Visualization & Reporting Solutions
Services
• Dashboards & Visualizations
• OperationalReporting
• Data Exploration
• Customer Insights
• Marketing Analytics
• Power BI
• D365 Customer Insights
AI and Data Science
• Machine Learning Solution
• Model As A Service
Services
• PrescriptiveAnalytics
• AzureCognitive Services
• Natural LanguageProcessing
• Computer Vision / Image
• ML Ops
• Data Mining
• Data Science Enablement
• Data Science Roadmap
Data and Infrastructure
• PlatformModernization Solution
• Cloud Migration and Management
Services
• DR/BC
• Security
• AzureGovernance
• Data Warehousing
• Data Integration
• Data Architecture
• PowerApps
• Synapse DW

Brian Beesley
Data Science Practice Director
At CCG since 2017, Brian has led Machine Learning and AI initiatives with a
broad range of clients from numerous industries, including financial
services, healthcare, industrial goods, retail, professional services,
construction, and entertainment.
Spent 2012-2017 mostly focused on large banks and insurance clients on
both regulatory compliance and discretionary projects doing:
– Model Process & Governance
– Data Aggregation & Reporting
– Data Management & Governance
– Business Analysis
– Program Management & Org Change Management
Master’s Cert in Financial Services Analytics from Stevens Institute of
Technology ’16
Bachelor’s in Business Economics and Public Policy from Indiana University
Kelley School of Business ‘09
TODAY’S SPEAKER

Machine
Learning 101
Azure
Machine
Learning
Overview &
Demo
Databricks
Overview
Machine
Learning 101
9:00-9:45 9:45-10:30 10:30-11:00

An introduction to Machine Learning and its uses in business
Machine Learning 101

Why should anyone care about Machine Learning?
What is Machine Learning?
How does Machine Learning work?
Ok, but how does it really work?
How can an organization use Machine Learning?
Agenda

The concepts in Machine Learning are not new.
https://www.quantinsti.com/blog/machine-learning-basics
anotherhuman.
four

Though the concepts have been around, Machine Learning has just started
getting buzz in recent years because the barriers to entry are much lower.
Flood of data and decreasingcosts of storage
Increasing computationalpower
Increasedattention from researchers
Growth of open source technologies
Supportfrom industries

Analyzing
Images
Understanding
Language
Forming &
Executing Strategy
Personalized
Recommendations
Autonomous
Decisions
Predicting
Asset Values
Machine Learning has tons of useful applications you already encounter or
hear about every day.

Sales/Marketing
• Price Optimization
• Inventory Forecasting
• Customer Segmentation
• Cross Sell / Upsell /
Recommendation Engines
• Customer Churn Predictions
• Customer Lifetime Value
Finance
• AssetPricing
• Risk Analysis
• Fraud detection
• Market Forecasting
• Anti Money Laundering
Operations
• Inventory Forecasting
• Robotics
• Automated Workflows
• Predictive Maintenance
• Schedule Optimization
• IoTProduction Line
Monitoring
Service
• Single View of Customer
• Customer Serviceanalysis
• Chat Bots / Digital Assistants
• Social Media Analysis
• Lead Scoring
Machine Learning doesn’t just have to be the realm of high tech.
There are practical ways to incorporate it across the business.

Machine Learning lends several benefits to enterprise decision support.

Machine Learning is a discipline that supports the data science process.
It is a technique, and its value is in the outputs it drives.
Discipline Process Decision Actions
Data Science
A broad process for generating insights that may
involve data ingestion from one or many sources
(including external data, streaming data, or big
data), data processing and cleansing, model
generation using either statistical ormachine
learning approaches, model selection, model
deployment and maintenance, and visualization
of data.
Advanced Analytics
Apply data science to predictive (what
will happen?) or prescriptive (what
should we do?) business use cases.
Artificial Intelligence /
Cognitive Computing
Apply data science to approximate
human intuition and decision making
(e.g. strategy, creativity, planning) or
human sensory function s (e.g.
computer vision, natural language
understanding, etc.)
Statistics
A branch of math for generating descriptions
or inferences about a population, often based
on samples of the population. Inferences may
take the form of “models,” which are
equations that approximate the data’s
inherent relationships.
Machine Learning
Combines computer science with math
concepts to generate models by rapidly
iterating on large datasets.
Other Analytics Disciplines
High Performance Computing, Data
Engineering, Visualization, etc.
Automation /
Robotics /
Intelligent Devices
Strategy / Operations

Advanced Analytics can enable predictive and prescriptive uses of data.
Traditional
analytics focus on
understanding and
explaining the data
that has been
collected.
Advanced Analytics
focus on generating
new data in the
form of predictions
or decisions, and
going the extra step
to automate
decision-making
when possible.

Simply put, machine learning is the science of making best guesses by
iterative trial and error.
101010101010101010101010101010101010
010101010101010101010101010101010101

Machine Learning works by using “algorithms” to generate “models”.
A model is a repeatable, data-driven approach to making a best guess.
It does this by formalizing mathematical relationships between data in the form of either:
– Rules (e.g. predict applicants will default on a loan if Credit Score < 700 and Debt to Income Ratio > 30%)
– Or an equation (e.g. predict Home Price = 100*Square Footage + 2*Average Income in the Area)
Note that this is different from other types of models, like operating models or data models
Statistical Model Data Model
OperatingModel
People
Process Technology
Data
Guide
Support
Enable

What’s a model?
Data
Prior month sales: $4MM
2 months prior: $3MM
Program / Model
This month sales =
(prior month +
2 months prior +
3 months prior)
/ 3
Answer
This month’s sales = $3MM?
In the past we’ve told computers how to use data to answer our questions.

Answer
Lastmonth’s sales: $2MM
Data
Answer
Data
Answer
Data
Answer
Data
But we’ve found that if we give the machine historic facts, we can let it
find the right program / model to plug in for future answers.
Answer
Data
Program / Model
This month’s sales =
1/8 * Prior month +
1/3 * 2 months prior +
1/4 * 3 months prior
What’s a model?

Answer
Data
Answer
Data
Answer
Data
Answer
Data
Once we have our machine-defined program, we can use it
with new data to make better predictions.
Answer
Data
Program / Model
This month’s sales =
1/8 * Prior month +
1/3 * 2 months prior +
1/4 * 3 months prior
New Data
Answer
This month’s sales = $5MM
What’s a model?

What is an algorithm?
The word algorithm gets used a lot, but it isn’t always defined.
A defined set of steps for solving a problem
Often involves repeating steps
In Machine Learning, it may or may not have an ending condition. Some common ending conditions are:
– The problem is solved to our satisfaction
• For example – stop when the last 4 iterations have been 95% accurate or better
– The problem hasn’t been solved but we don’t seem to be getting any closer to solving it
• For example – stop if the last 10 iterations have not seen any improvement in accuracy
– The process has run for a long time
• For example – stop after the program has run for 12 hours, regardless of whether progress is still being made

Collect the data and randomly create initial decision rules.
Design a method for measurably evaluating how good or bad your hypothesis is.
Update your hypothesis in a way that marginally improves the performance of your decision rules.
Continue this process until either you are satisfied with the results, or your hypothesis can’t improve
anymore with the data available.
What is an algorithm?
Create a
hypothesis
Evaluate the
hypothesis
Adjust the
hypothesis
Repeat until
convergence
Almost all machine learning algorithms follow the same general pattern.

Supervised Learning Unsupervised Learning
We know the “right answers” for some of the scenarios.
– We may have history we can look back on
– We may be hoping to replicate human decision making
There aren’t necessarily “right answers,” we just
want to get a better understanding of our data.
There are two main families of algorithms to choose from.
Image credit: Gowthamy Vaseekaran via Medium.com, available at https://medium.com/@gowthamy/machine-learning-supervised-learning-vs-unsupervised-learning-f1658e12a780
Predict our profits next quarter.
Identify the number written on a check.
Predict a user’s rating for a given product.
Group our customers into segments.
Find the most important variables in a dataset.
Identify credit card transactions that are out of the ordinary.

Now let’s walk through two of the most popular machine learning approaches
and discuss how the algorithms are applied.
Classification Clustering

Everyone will repay their loan.
Create a
hypothesis
20 outstanding loans
Use classification when you want to guess a non-numeric value, like a yes/no answer. We
will take a decision tree approach.

Calculate accuracy as the % of predictions that are correct based on your current set of rules.
Evaluate the
hypothesis
12 repaid, 8 defaulted
Accuracy = 12/20 = 60%

Income > 60k
Income < 60k
Find the next branch by looking for the data split that would have the biggest impact on the purity of
each node. There are several ways to do this mathematically (Gini Index, Information Gain, Chi-
Square).
Adjust the
hypothesis
Credit Score > 700
Credit Score < 700
DTI > 40%
DTI < 40%
70%
50%
60% weighted
71%
53%
59% weighted
80%
73%
75% weighted

Repeat the process for each of your new “leaf” nodes. Stop when you reach an acceptable level of
accuracy, or when your accuracy begins getting worse with independent data.
Repeat until
convergence
DTI > 40%
DTI < 40%
Credit Score > 700
Credit Score < 700
Income > $60k
Income < $60k
100%
50% 100%
100%
80% weighted

Classification is used for lots of problems that copy human intuition.
Think about how you classify information to identify these images!
But with more advanced
approaches like convolutional
neural networks these
pictures can definitely be
classified by a machine.
These use cases areobviously
morecomplex than our
simple decision tree.

Imagine Marketing
has asked you to split
these customers into
3 groups.
How would you do it?
Use clustering when there’s no “correct” classification, but you still want to assign
individuals to groups. This algorithm is called k-means clustering.

I can segment my customers by assigning them to 3 groups. We’ll set down 3 random “anchors” and
assign each customer to its closest anchor.
Create a
hypothesis

Move the anchors to the center of each cluster. Count how many anchors are actually closer to one of
the other anchors.
Evaluate the
hypothesis

Re-assign each customer to the group corresponding to the center they’re closest to.
Adjust the
hypothesis

Repeat until
convergence
Move the anchors again. Continue re-assigning customers and moving the anchors until the anchors
stop moving.

This is just the tip of the iceberg.
There are several algorithms available for various types of problems.

Engaging with Machine Learning
Image inspired by Microsoft
Delivering analytics with Machine Learning requires
alignment across people, process, technology, and data.

The sources of data can for machine learning can be quite broad. People
Process Technology
Data
Data
Warehouses
•Curated & Governed data
•Big data
•Cloud or on-prem
Data Lakes
•Unstructured & Semi-
structured data
•Streaming data
•Partially curated
Externally
Procured
Data
•May be purchased from3rd
party providers
•May be scraped fromthe
web
•May requiredesigning
research experiments
Data science teams typically have the programming and data integration skills to use data from
anywhereit can be found.

Data scientists combine broad skills to integrate
data, build models, and drive business value.
People
Process Technology
Data

Scoring,
Performance
monitoring, etc.
Business
Understanding
Start
Modeling
Feature
Engineering
Model
Training
Model
Evaluation
Deployment
Intelligent
Applications
Web
Services
Model
Store
Data Acquisition
& Understanding
Data Source
Pipeline
Environment
Wrangling,
Exploration&
Cleaning
Transform, Binning
Temporal, Text, Image
Feature Selection
Algorithms, Ensemble
Parameter Tuning
Retraining
Model Management
Cross Validation
Model Reporting
A/B Testing
On-Premises vs Cloud
Database vs Files
Streaming vs Batch
Low vs High Frequency
On-premises vs Cloud
Database vs Data Lake vs…
Small vs Medium vs Big Data
Structured vs Unstructured
Data Validation and Cleanup
Visualization
Customer
Acceptance
End
Let’s look at the Microsoft Team Data Science Process to see how models are
built.
People
Process Technology
Data

Traditional Analytics
Store and access data. Filter and aggregate it. Visualize it.
Show it to the business
so they can take action.
Machine Learning
Filter and aggregate it. Create a model. Generate new data
(predictions,etc.).
The new data can be stored
with the rest of the data for
use in analytics.
Or it can be visualized
directly to gain insights.
Or it can automate
decisions or actions,
allowing better processes
to run faster and 24/7.
The outputs of the data science process can be used in traditional analytics,
analyzed directly, or fed into automated decision-making.
People
Process Technology
Data

We’ll spend the rest of the workshop talking about the tools that enable all this
to happen.
+
Develop models faster with automated machine learning
Use any Python environment and ML frameworks
Manage models across the cloud and the edge.
Prepare data clean data at massive scale
Enable collaboration between data scientists and data engineers
Access machine learning optimized clusters
Azure Machine Learning
Python-based machine learning service
Azure Databricks
Apache Spark-based big-data service
People
Process Technology
Data

Machine
Learning 101
Azure
Machine
Learning
Overview &
Demo
Databricks
Overview
Azure
Machine
Learning
Overview &
Demo
9:00-9:45 9:45-10:30 10:30-11:00

Train and evaluate model
Azure Machine Learning offers a suite of tools for managing the Machine
Learning lifecycle.
Organize model assets
A
B
C
Deploy and manage model

Automated ML provides an easy way to quickly iterate through
multiple models. Train Organize
A
B
C
Deploy
How much is this car worth?

A
B
C
Deploy
Mileage
Condition
Car brand
Year of make
Regulations
…
Parameter 1
Parameter 2
Parameter 3
Parameter 4
…
Gradient Boosted
Nearest Neighbors
SGD
Bayesian Regression
LGBM
…
Mileage Gradient Boosted Criterion
Loss
Min Samples Split
Min Samples Leaf
XYZ Model
Which algorithm? Which parameters?
Which features?
Car brand
Year of make

A
B
C
Deploy
Which features?
Mileage
Condition
Car brand
Year of make
Regulations
…
Gradient Boosted
Nearest Neighbors
SGD
Bayesian Regression
LGBM
…
Nearest Neighbors
Criterion
Loss
Min Samples Split
Min Samples Leaf
XYZ Model
Iterate
Gradient Boosted N Neighbors
Weights
Metric
P
ZYX
Mileage
Car brand
Year of make
Car brand
Year of make
Condition

A
B
C
Deploy
Which features?
Iterate

A
B
C
Deploy
Enter data
Define goals
Apply constraints
Input Intelligently test multiple models in parallel
Optimized model

For those who prefer a no-code experience, there’s a drag-n-drop
interface in Azure Machine Learning Designer. Train Organize
A
B
C
Deploy

Azure Machine Learning allows you to take advantage of cloud
compute through local tools or Azure Notebooks. Train Organize
A
B
C
Deploy

For image data, you can also train custom object detection models
with the intuitive Labeling interface. Train Organize
A
B
C
Deploy

Experiments allow you to capture training metrics to run side-by-side
comparisons and easily select the best model. Train Organize
A
B
C
Deploy

Pipelines can organize multiple data preparation and modeling steps
into a single resource. Train Organize
A
B
C
Deploy

Explain machine learning models to support business users and
compliance processes. Train Organize
A
B
C
Deploy

And apply fairness assessments when needed.
Train Organize
A
B
C
Deploy

Apply version control in a centralized model registry.
Train Organize
A
B
C
Deploy

Models can be deployed to containers and shipped to the edge or
accessed via Rest APIs. Train Organize
A
B
C
Deploy
• Identify and promote your best models
• Capture model telemetry
• Retrain models with APIs
• Deploy models anywhere
• Scale out to containers
• Infuse intelligence into the IoT edge
• Build and deploy models in minutes
• Iterate quickly on serverless infrastructure
• Easily change environments
Proactivelymanage
model performance
Deploy models
closer to your data
Bringmodels
to life quickly
Train and evaluate models
Model MGMT, experimentation,
and run history
Azure
ML service
Containers
AKS ACI
IoT edge
Docker
Azure
ML service

Monitor data drift over time to know when your model may require
re-training. Train Organize
A
B
C
Deploy

That was a high-level overview of the
tools and utilities.
Let’s dive in to see the citizen data
science tool in action.

Machine
Learning 101
Azure
Machine
Learning
Overview &
Demo
Databricks
Overview
Databricks
Overview
9:00-9:45 9:45-10:30 10:30-11:00

What is Databricks?
Why scale out vs. scale up?
What is Spark?
Why Databricks?
Agenda

What is Databricks, in a nutshell?
is a unified platform powered by Apache Spark,
capable of abstracting complex cluster management to
scale out your data processing and machine learning
workloads, with intelligent optimizations to dynamically
reallocate workersgiven computational demands.

Scaling Out with Distributed Processing vs. Scaling Up
Option A
A-G
H-N
O-T
U-Z
Imagine this… I need to find every entry in the phone book with my first name. I’d like to hire
someone to read through the entire phone book and pick them out.
Option B

Option A
A-D
E-I
J-M
N-R
Imagine this… I need to find every entry in the phone book with my first name. I’d like to hire
someone to read through the entire phone book and pick them out.
S-V
W-Z
Option B

Imagine this… I need to extract every numeric column in my dataset and normalize the values in each.
I need to perform a grid search of hyperparameters to improve the accuracy of my classification model.
I need to train an algorithm to make correct classifications based on several features.
Option A
Option B
These common pieces of machine learning pipelinesmay sound simple, but in working with big data, tasks
like these can add hours, days, or weeksto your timeline, or be too cost inefficient to complete at all.
More flexible
More easily scalable
WithDatabricks and Spark,
easy tospin up and manage

What is Spark?
2010
Started at
UC Berkeley
2013
Databricks
started &
donated to ASF 2014
Spark 1.0 and
additions to Spark
Core (SQL, ML,
GraphX)
2015
DataFrames/Datasets
Tungsten
ML Pipelines
Apache
Spark
2.0
2016
Apache Spark 3.0
released,Adaptive
Query Execution,
new Pandas
function APIs
2020
Continuedfeature
development to further
support distributedML
2018
Easier
Smarter
Faster
is an open source framework enablingdistributed
cluster computing for large scale data processing.
The Spark architectureworks toscale processing out
across compute resources withamanaging driver
node assigning processing tasks toworker nodes.
Spark was foundedwith the singular goal to
“democratize”the “superpower”of big databy
offering high-level APIs anda unifiedengine to
complete processing at all steps of the datapipeline.
Since then, thousands of contributors have developedSpark projectsthat improve
the accessibility andversatility of the Spark framework anddistributedprocessing.
. . .

Scaling Out with Databricks
is a unified platform powered by Apache Spark,
capable of abstracting complex cluster management to
scale out your data processing and machine learning
workloads, with intelligent optimizations to dynamically
reallocate workersgiven computational demands.
Databricks brings scaling out to your workloads in a way that’s easy to spin up,
familiar to work with, and integrates with tools you already use every day.

Let’s take a brief tour of Databricks

In an
accessible
setting
Multiple languages in Databricks Notebooks (Python, R, Scala, SQL)
Databricks Connect: connect external tools with Databricks (IDEs, RStudio, Jupyter…)
Work on a single node and utilize the mostcommon ML frameworks
FamiliarOptions & Distributed Frameworks on Databricks
Distributed
machine
learning
Spark MLlib for distributed models
Migrate Single Node to distributed with just a few lines of code changes
Distributed hyperparameter search (Hyperopt, Gridsearch)
PandasUDF to distribute models over subsets of data or hyperparameters
Koalas: Pandas DataFrameAPI on Spark
Deep Learning distributedtraining(HorovodRunner)

Enhanced Accessibility on Azure Databricks
Not an Azure Marketplace or
a 3rd party hosted service
PAAS: Platform as a Service
Azure
Databricksis a
first party
service on
Azure.
Azure Storage Services: Directly
access data in Azure Blob Storage
and Azure DataLake Store
Azure Active Directory: For user
authentication, eliminate the
need to maintain two separate
sets of users in Databricks and
Azure.
Azure
Databricksis
integrated
seamlessly with
Azure services.

MLflow is an open source platform for managing the end-to-end machine learning
lifecycle. MLflow offers an integrated experience for tracking and securing machine
learning model training runs and running machine learning projects.
What is MLflow?

Tracking
• Record and query
experiments: code,
data, configuration,
results
Projects
• Package data science
code in a format to
reproduce runs on
any platform
Models
• Deploy machine
learning models in
diverse serving
environments
Registry
• Store, annotate,
discover, and manage
models in a central
repository
Serving
• Host ML models as
REST endpoints that
are updated
automatically
MLflow’s Five Key Components

Machine
Learning 101
Azure
Machine
Learning
Overview &
Demo
Databricks
Overview
9:00-9:45 9:45-10:30 10:30-11:00

We can work with your business to deliver custom predictive and prescriptive
analytics across the lifecycle.
Machine Learning Strategy
• Develop a backlog of
predictive and prescriptive
use cases
• Refine and prioritize use
cases by value
• Develop a predictive
roadmap
Model Development /
Data Mining
• Aggregate data from across
internal and external data
sources
• Perform correlation
analyses, develop models,
and find new relationships
in your data
Model Maintenance
• Monitor and maintain
statistical models to sustain
predictive power
• Develop a model telemetry
dashboard
• Test model design changes
to improve predictive power
Model Governance & Operating Model
• Assess existing Data Science & Artificial Intelligence maturity
• Develop standards and processes to help guide data science output
• Build a Data Science Center of Excellence
Model Deployment / MLOps
• Customize and deploy pre-
existing models from Azure
Cognitive Services
• Deploy custom model as an
API or batch job, or support
deployment in existing
systems
RapidInsight Prototype Offering
Model as a Service Subscription Offering
Elastic AI Research & Development
MLOps POC
Managed Services
Accelerators

THANK YOU
www.ccganalytics.com | (813) 968-3238

Introduction to Machine Learning with Azure & Databricks

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Introduction to Machine Learning with Azure & Databricks

Similar a Introduction to Machine Learning with Azure & Databricks (20)

Más de CCG

Más de CCG (20)

Último

Último (20)

Introduction to Machine Learning with Azure & Databricks