Más contenido relacionado La actualidad más candente (20) Similar a Deep Learning on Apache Spark (20) Deep Learning on Apache Spark1. Deep Learning On Spark
Using BigDL on Qubole
Dash Desai
Technology Evangelist
@iamontheinet
3. Copyright 2017 © Qubole
What is Machine Learning?
Gives ‘computers the ability to learn without
being explicitly programmed’ - Wikipedia
4. Copyright 2017 © Qubole
What is Deep Learning?
Form of ML that uses a model of computing—
inspired by the structure of the brain
5. Copyright 2017 © Qubole
Deep Learning Applications
Computer Vision / Image Recognition / Object Detection
Speech Recognition / Natural Language Processing (NLP)
Recommendation Systems (Products, Matchmaking, etc.)
Prediction (Stock Market, Healthcare, etc.)
Anomaly Detection (Cybersecurity, etc.)
6. Copyright 2017 © Qubole
What is Apache Spark?
A fast and general-purpose engine
for large-scale, distributed data
processing
MLlib
Spark’s scalable machine learning library
High-quality algorithms; 100x faster than MapReduce
Usable in Java, Scala, Python, and R
8. Copyright 2017 © Qubole
Deep Learning: Other Popular Non-Spark Options
TensorFlow* (Google)
• Natively distributed out-of-the-box
Keras
• Naturally runs on distributed frameworks/back-ends
• Theano, MXNet (CMU, MIT, NYU), TensorFlow, CNTK (Microsoft)
*Not to be confused with TensorFlow On Spark (TFOS) by Yahoo
10. Copyright 2017 © Qubole
What is BigDL?
Distributed deep learning library
for Apache Spark
Open sourced by Intel (in Dec 2016)
Feature parity with DL frameworks such as Caffe, Torch
Integrates with Spark ML pipeline and Spark Streaming
Supports Model snapshots
Intel MKL (Math Kernel Library); multi-threading within
each Spark task
11. Copyright 2017 © Qubole
Cont…
Includes 100+ Layers (highest level building block in DL)
Includes 20+ Loss functions (help with model fitting)
Optimization methods include SGD, Adagard, LBFGS
Numeric computing via Tensor & high level neural networks
Scaling: synchronous mini-batch SGD and all-reduce
communication on Spark
What is BigDL?
12. Copyright 2017 © Qubole
BigDL vs TensorFrames
TensorFrames — can call TF from individual
partitions of a DataFrame or an RDD (in PySpark)…
However, since TF is not natively integrated
with Spark, it does not support distributed deep
learning such as for model training or fine
tuning.
13. Copyright 2017 © Qubole
BigDL vs TensorFlow on Spark (TFOS), Caffe
TensorFlow on Spark* (TFOS) and Caffe on Spark —
use Spark executors to launch TF or Caffe instances
on the cluster…
However, model training, predictions, etc. are
performed outside of Spark across multiple TF or
Caffe instances…
• Run as standalone jobs outside of the pipeline
• Very fine-grained/limited interaction with
analytics pipeline
*Not to be confused with natively distributed TensorFlow by Google
18. Copyright 2017 © Qubole
Demo: Recognize Handwritten Digits
On
Use Model
Train On Dataset
… with everything running on …
…
…
… to recognize handwritten digits …
20. 00Copyright 2017 © Qubole
Qubole
Qubole automates, controls and orchestrates all big data workloads including Data
Science so that you can optimize for performance, cost and scale.
Built for Anyone Who Uses
Data
Analysts
Data Scientists
Data Engineers
Data Admins
A Single Platform
for Any Use Case
ETL & Reporting
Ad Hoc Queries
Machine Learning
Streaming
Vertical Apps
Open Source Engines,
Optimized for the Cloud
Cloud-Native,
Cloud-Optimized,
Cloud-Agnostics
23. Copyright 2017 © Qubole
Cluster LifeCycle Management on Qubole
Note: Available on Apache Spark, Hadoop, and Presto as a service on Qubole
Auto-scaling Clusters
• Policy-driven
• One-time setup; Runtime modifications
• Work load aware upscaling and downscaling
• No wasted resources results in lowered TCO
Heterogeneous Clusters
• Mix-and-match instance types
• On-Demand and Spot instances (on AWS)
24. 00Copyright 2017 © Qubole
Qubole: High-level View
User Access Qubole Tier Customer’s Azure Account
QUBOLE UI
VIA BROWSER
SDK
ODBC
EPHEMERAL WEB TIER
WEB SERVERS
Default Hive
Metastore
RDS–Qubole
User, Account
Configurations
(Encrypted
credentials)
Encrypted
Result Cache
(Optional)
Custom Hive
Metastore
(Optional) Other
RDS
Encrypted
HDFS
Slave
Encrypted
HDFS
Slave
Master
Ephemeral
Cluster,
Managed by
Qubole
Data Flow within
Customer’s CloudRESTAPI
(HTTPS)
25. Thank you!
Dash Desai
Technology Evangelist
@iamontheinet
Getting Started
Install BigDL on Qubole + Demo App: http://bit.ly/deep_learning_bigdl_qubole
BigDL: https://github.com/intel-analytics/BigDL