Alberto Diaz Martin
alberto.diaz@encamina.com - @adiazcan
Alberto Diaz cuenta con más de 15 años de experiencia en la Industria IT, todos ellos trabajando
con tecnologías Microsoft. Actualmente, es Chief Technology Innovation Officer en ENCAMINA,
liderando el desarrollo de software con tecnología Microsoft, y miembro del equipo de
Dirección.
Para la comunidad, trabaja como organizador y speaker de las conferencias más relevantes del
mundo Microsoft en España, en las cuales es uno de los referentes en SharePoint, Office 365 y
Azure. Autor de diversos libros y artículos en revistas profesionales y blogs, en 2013 empezó a
formar parte del equipo de Dirección de CompartiMOSS, una revista digital sobre tecnologías
Microsoft.
Desde 2011 ha sido nombrado Microsoft MVP, reconocimiento que ha renovado por séptimo
año consecutivo. Se define como un geek, amante de los smartphones y desarrollador.
Fundador de TenerifeDev (www.tenerifedev.com), un grupo de usuarios de .NET en Tenerife, y
coordinador de SUGES (Grupo de Usuarios de SharePoint de España, www.suges.es)
• Infrastructure management
• Data exploration and visualization at scale
• Time to value - From model iterations to intelligence
• Integrating with various ML tools to stitch a solution together
• Operationalize ML models to integrate them into applications
Challenges for Data Scientists
Machine Learning on Azure
Sophisticated pretrained models
To simplify solution development
Azure
Databricks
Machine Learning
VMs
Popular frameworks
To build advanced deep learning solutions TensorFlow KerasPytorch Onnx
Azure
Machine Learning
LanguageSpeech
…
SearchVision
On-premises Cloud Edge
Productive services
To empower data science and development teams
Powerful infrastructure
To accelerate deep learning
Flexible deployment
To deploy and manage models on intelligent cloud and edge
Recommended architecture to build e2e ML solutions
ServeStore Prep and trainIngest
Batch data
Streaming data
Azure Kubernetes
service
Power BI
Azure analysis
services
Azure SQL data
warehouse
Cosmos DB, SQL DB
Azure Data Lake Storage
Azure Data Factory
Azure Event
Hubs
Azure Databricks
Azure Machine
Learning service
Apps
Model Serving
Ad-hoc Analysis
Operational
Databases
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, ADLS, Azure Storage, Azure Data
Factory, Azure AD, Event Hub, IoT Hub, HDInsight Kafka, SQL DB)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
Optimized Databricks Runtime Engine
DATABRICKS I/O High Concurrency
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
A Z U R E D A T A B R I C K S
• SQL, Python, Scala & R Support
• Code in your favorite language
• Source data from File System, Object stores, HDFS, Database, Pub-Sub
systems & Others
• Read and write data from/to multiple sources
• Optimized for Azure Blob Store, ADLS, SQLDW, Event Hubs & Cosmos DB
• File Formats
• CSV, JSON, Parquet, Text, ORC, XML & More
Your Language, Your Data (Anywhere), Your Format
PREP & TRAIN
Collect and prepare data Train and evaluate model
A
B
C
Operationalize and manage
Azure Databricks
Azure Data Factory
Azure Databricks
Azure Databricks
Azure ML Services
Collect and prepare all of your data at scale
Ingest
Azure Data
Factory
Store
Azure Blob
Storage
Understand and transform
Azure
Databricks
• Leverage open source technologies
• Collaborate within teams
• Use ML on batch streams
• Build in the language of your choice
• Leverage scale out topology
• Scale compute and storage separately
• Integrate with all of your data sources
• Create hybrid pipelines
• Orchestrate in a code-free environment
Leverage best-in-class
analytics capabilities
Scale
without limits
Connect to data
from any source
Train and evaluate machine learning models
• Easily scale up or scale out
• Autoscale on serverless infrastructure
• Leverage commodity hardware
• Determine the best algorithm
• Tune hyperparameters to optimize models
• Rapidly prototype in agile environments
• Collaborate in interactive workspaces
• Access a library of battle-tested models
• Automate job execution
Scale compute resources
to meet your needs
Quickly determine the
right model for your data
Simplify model
development
Automated ML capabilities
Azure ML
Services
Automated ML
Scale out clusters
Infrastructure
Azure
Databricks
Machine learning
Tools
Azure
Databricks
S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W
Offers a set of parallelized machine learning algorithms (see next
slide)
Supports Model Selection (hyperparameter tuning) using Cross
Validation and Train-Validation Split.
Supports Java, Scala or Python apps using DataFrame-based API (as
of Spark 2.0). Benefits include:
• An uniform API across ML algorithms and across multiple languages
• Facilitates ML pipelines (enables combining multiple algorithms into a
single pipeline).
• Optimizations through Tungsten and Catalyst
• Spark MLlib comes pre-installed on Azure Databricks
• 3rd Party libraries supported include: H20 Sparkling Water, SciKit-
learn and XGBoost
Enables Parallel, Distributed ML for large datasets on Spark Clusters
Why use Azure Databricks for Machine learning?
• Complete platform in one (Data ingestion, exploration,
transformation, featurization, model building, model tuning, and
even model serving).
• No need to copy the data in our system to do ML on it.
• DataScientists like the ease of use of our platform.
• Deep learning algorithms are now available!
• Productionization Features built in.
Operationalize and manage models with ease
• Identify and promote your best models
• Capture model telemetry
• Retrain models with APIs
• Deploy models anywhere
• Scale out to containers
• Infuse intelligence into the IoT edge
• Build and deploy models in minutes
• Iterate quickly on serverless infrastructure
• Easily change environments
Proactively manage
model performance
Deploy models
closer to your data
Bring models
to life quickly
Train and evaluate models
Azure
Databricks
Model MGMT, experimentation,
and run history
Azure
ML Services
Containers
AKS ACI
IoT edge
Docker
• ML Model Export allows you to export models and full ML
pipelines
• Then imported into Spark and non-Spark platforms to do scoring, make
predictions
• Targeted at low-latency, lightweight ML-powered applications
• We recommend using MLeap, an open source solution for
ML Model Export that works well in Azure Databricks
ML Export
Build and deploy deep learning models
• Choose VMs for your modeling needs
• Process video using GPU-based VMs
• Run experiments in parallel
• Provision resources automatically
• Leverage popular deep learning toolkits
• Develop your language of choice
Scale compute
resources in any environment
Quickly evaluate
and identify the right model
Streamline
AI development efforts
Azure ML Services
Scale out clusters
Azure
Databricks
Notebooks
Azure
Databricks
Scale out clusters
Batch AI
MS Cognitive
Toolkit
Keras
TensorFlow
PyTorch
Azure Databricks for deep learning modeling
Tools InfrastructureFrameworks
Leverage powerful GPU-enabled VMs
pre-configured for deep neural
network training
Use HorovodEstimator via a native
runtime to enable build deep learning
models with a few lines of code
Full Python and Scala support for
transfer learning on images
Automatically store metadata in
Azure Database with geo-replication
for fault tolerance
Use built-in hyperparameter tuning
via Spark MLLib to quickly optimize the
model
Simultaneously collaborate within
notebooks environments to streamline
model development
Load images natively in Spark
DataFrames to automatically decode
them for manipulation at scale with
distributed DNN training on Spark
Improve performance 10x-100x over
traditional Spark deployments with
an optimized environment
Seamlessly use TensorFlow, Microsoft
Cognitive Toolkit, Caffe2, Keras, and more
Ready-to-use clusters with Azure Databricks Runtime for ML
Deep Learning
Supports Deep Learning Libraries/frameworks including:
Microsoft Cognitive Toolkit (CNTK).
o Article explains how to install CNTK on Azure Databricks.
TensorFlowOnSpark
BigDL
Offers Spark Deep Learning Pipelines, a suite of tools for working with and
processing images using deep learning using transfer learning. It includes
high-level APIs for common aspects of deep learning so they can be done
efficiently in a few lines of code:
Azure Databricks supports and integrates with a number of Deep Learning libraries and
frameworks to make it easy to build and deploy Deep Learning applications
Distributed Hyperparameter Tuning
Transfer Learning
Fast, easy, and collaborative Apache Spark™-based analytics platform
Azure Databricks
Built with your needs in mind
Role-based access controls
Effortless autoscaling
Live collaboration
Enterprise-grade SLAs
Best-in-class notebooks
Simple job scheduling
Seamlessly integrated with the Azure Portfolio
Increase productivity
Build on a secure, trusted cloud
Scale without limits