Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud
Using Intel Analytics Zoo & Alluxio
Jennie Wang, Intel
Louie Tsai, Intel
Bin Fan Alluxio
04/23/2020 1

Agenda
• Part1: Deep Learning & Analytics Zoo
• Part2: Challenges in Hybrid Environment
• Architecture: Analytics Zoo + Alluxio
• Part3: Experimental Result
2

Deep Learning & Analytics Zoo
3

Data Scale Driving
Deep Learning Process
“Machine Learning Yearning”,
Andrew Ng, 2016
4

Real-World ML/DL Systems Are
Complex Big Data Analytics Pipelines
“Hidden Technical Debt in Machine Learning Systems”,
Sculley et al., Google, NIPS 2015 Paper
5

Analytics Zoo: End-to-End DL Pipeline
Made Easy for Big Data
Prototype on laptop
using sample data
Experiment on clusters
with history data
Deployment with
production, distributed big
data pipelines
• “Zero” code change from laptop to distributed cluster
• Directly accessing production big data (Hadoop/Hive/HBase)
• Easily prototyping the end-to-end pipeline
• Seamlessly deployed on production big data clusters
6

Analytics Zoo
Recommendation
Distributed TensorFlow & PyTorch on Spark
Spark Dataframes & ML Pipelines for DL
RayOnSpark
Inference Model
Models &
Algorithms
Integrated
Analytics & AI
Pipelines
Time Series Computer Vision NLP
Unified Data Analytics and AI Platform
https://github.com/intel-analytics/analytics-zoo
Automated ML
Workflow
AutoML for Time Series Automatic Cluster Serving
Compute
Environment
K8s Cluster Spark Cluster
Python Libraries
(Numpy/Pandas/sklearn/…)
DL Frameworks
(TF/PyTorch/OpenVINO/…)
Distributed Analytics
(Spark/Flink/Ray/…)
Laptop Hadoop Cluster
Powered by oneAPI
7

Distributed TensorFlow on Spark in Analytics Zoo
#pyspark code
train_rdd = spark.hadoopFile(…).map(…)
dataset = TFDataset.from_rdd(train_rdd,…)
#tensorflow code
import tensorflow as tf
slim = tf.contrib.slim
images, labels = dataset.tensors
with slim.arg_scope(lenet.lenet_arg_scope()):
logits, end_points = lenet.lenet(images, …)
loss = tf.reduce_mean(
tf.losses.sparse_softmax_cross_entropy(
logits=logits, labels=labels))
#distributed training on Spark
optimizer = TFOptimizer.from_loss(loss,
Adam(…))
optimizer.optimize(end_trigger=MaxEpoch(5))
Write TensorFlow inline with Spark code
Analytics Zoo API in blue 8

Spark Dataframe & ML Pipeline for DL
#Spark dataframe code
parquetfile = spark.read.parquet(…)
train_df = parquetfile.withColumn(…)
#Keras API
model = Sequential()
.add(Convolution2D(32, 3, 3))
.add(MaxPooling2D(pool_size=(2, 2)))
.add(Flatten()).add(Dense(10)))
#Spark ML pipeline code
estimater = NNEstimater(model,
CrossEntropyCriterion())
.setMaxEpoch(5)
.setFeaturesCol("image")
nnModel = estimater.fit(train_df)

RayOnSpark
Run Ray programs directly on YARN/Spark/K8s cluster
“RayOnSpark: Running Emerging AI Applications on Big Data Clusters with Ray and Analytics Zoo”
https://medium.com/riselab/rayonspark-running-emerging-ai-applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a
Analytics Zoo API in blue
sc = init_spark_on_yarn(...)
ray_ctx = RayContext(sc=sc, ...)
ray_ctx.init()
#Ray code
@ray.remote
class TestRay():
def hostname(self):
import socket
return socket.gethostname()
actors = [TestRay.remote() for i in range(0,
100)]
print([ray.get(actor.hostname.remote())
for actor in actors])
ray_ctx.stop()
10

Distributed Cluster Serving
P5
P4
P3
P2
P1
R4
R3
R2
R1
R5
Input Queue for requests
Output Queue (or files/DB tables)
for prediction results
Local node or
Docker container Hadoop/Yarn/K8s cluster
Network
connection
Model
Simple
Python script
https://software.intel.com/en-u
s/articles/distributed-inference
-made-easy-with-analytics-zoo
-cluster-serving#enqueue request
input = InputQueue()
img = cv2.imread(path)
img = cv2.resize(img, (224,
224))
input.enqueue_image(id, img)
#dequeue response
output = OutputQueue()
result = output.dequeue()
for k in result.keys():
print(k + “: “ +
json.loads(result[k]))
√ Users freed from complex distributed inference solutions
√ Distributed, real-time inference automatically managed Analytics Zoo
− TensorFlow, PyTorch, Caffe, BigDL, OpenVINO, …
− Spark Streaming, Flink, …

Scalable AutoML for Time Series Prediction
“Scalable AutoML for Time Series Prediction using Ray and Analytics Zoo”
https://medium.com/riselab/scalable-automl-for-time-series-prediction-usin
g-ray-and-analytics-zoo-b79a6fd08139
Automated feature selection, model selection and hyper parameter tuning using Ray
tsp = TimeSequencePredictor(
dt_col="datetime",
target_col="value")
pipeline = tsp.fit(train_df,
val_df, metric="mse",
recipe=RandomRecipe())
pipeline.predict(test_df)

Production Deployment with Analytics Zoo for
Spark and BigDL
http://mp.weixin.qq.com/s/xUCkzbHK4K06-v5qUsaNQQ
https://software.intel.com/en-us/articles/building-large-scale-image-feature-extraction-with-bigdl-at-jdcom
• Reuse existing Hadoop/Spark clusters for deep learning with no changes (image search, IP protection, etc.)
• Efficiently scale out on Spark with superior performance (3.83x speed-up vs. GPU severs) as benchmarked by JD
13

Technology End UsersCloud Service
Providers
And Many More
*Other names and brands may be claimed as the property of others.
software.intel.com/AIonBigData
Not a full list

Hybrid Cloud & Alluxio
An Open Source Data Orchestration Layer
www.alluxio.io
15

Co-located
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in
the cloud,
public or private
Enable & accelerate
access big data across
data centers
Support analytics across
datacenters
HDFS for Hybrid Cloud
Big data journey & innovation
16

Challenge: Data Gets Increasingly Remote
from Compute
▪ Challenging Scenarios
▪ Data-driven initiatives in need of more compute
▪ Hadoop system on-prem, but it’s remote
▪ Object data growth in a cloud region, but it’s remote
▪ How to make remote data local to the compute without
copies?
▪ Business benefits
▪ Faster data-driven insights: data immediately available for
compute
▪ More elastic computing power to solve problems quicker
▪ Up to 80% lower egress costs
Datacenter
17

Solution: “Zero-copy” bursting to scale to
the cloud
AnalyticsZoo
Alluxio
Accelerate big data frameworks
on the public cloud
AnalyticsZoo
Alluxio
Burst big data workloads in
hybrid cloud environments
On premise
18

The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
19

Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
20

Zero-Copy Burst: View the I/O Stack
FAST
104
- 105
MB/s
MODERATE 103
- 104
MB/s
SLOW 10 - 103
MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem
21

Environments for performance results
EC2 Instance Type r5.8xlarge
Number of vCPU per instance 32
Size of memory per instance 256GB
Network speed 10Gbps
Disk space 100GB
Operation System Ubuntu 18.04
Apache Spark version 2.4.3
BigDL version 0.10.0
Analytics Zoo version 0.7.0
Alluxio version 2.2.0

Environments for performance results
Application : Inception Model on Imagenet
https://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/an
alytics/zoo/examples/inception
Used 6 “r5.8xlarge”
instances. One worker
per instance.
Have 6 executors

Performance measurement
Measure data loading
time for training and
test data set
Job0 : load training data set
Job1 : load testing data set
Two stages :
stage 0 and stage 1 in Job 0
Two stages :
stage 2 and stage 3 in Job 1

Performance measurement
Using S3 data Using Alluxio data

Performance Results
Achieve 1.5X
speedup by using
Alluxio
Standard deviation is small
for both w & w/o testings

Legal Disclaimers
• Intel technologies’ features and benefits depend on system configuration and may require enabled
hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
• No computer system can be absolutely secure.
• Tests document performance of components on a particular test, in specific systems. Differences in
hardware, software, or configuration will affect actual performance. Consult other sources of
information to evaluate performance as you consider your purchase. For more complete information
about performance and benchmark results, visit http://www.intel.com/performance.
Intel, the Intel logo, Xeon, Xeon phi, Lake Crest, etc. are trademarks of Intel Corporation in the U.S.
and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2019 Intel Corporation
28

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Similar to Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio