Shanghai Apache Spark+AI Online Meetup (https://www.meetup.com/Shanghai-Apache-Spark-AI-Meetup/events/269342169/) on Mar 13, 2020
Topic: Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo (https://github.com/intel-analytics/analytics-zoo)
Speaker: Shan Yu, Intel
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
1. Mar 13, 2020
Automated Time Series Analysis using
Deep Learning, Ray and Analytics Zoo
Shan Yu, Shengsheng Huang, Jason Dai
AI
2. Mar 13, 2020
• Background
• Introduction of Analytics Zoo
• Background about Time Series Analysis
• Background about AutoML and Ray
• Time Series Analysis using AutoML and Ray on Analytics Zoo
• Use Case Sharing
Agenda
4. Mar 13, 2020
What is Analytics Zoo
Accelerating Data Analytics + AI Solutions At Scale
Distributed, High-Performance
Deep Learning Framework
for Apache Spark
https://github.com/intel-analytics/bigdl
Unified Analytics + AI Platform
Distributed TensorFlow, Keras, PyTorch
and BigDL on Apache Spark
https://github.com/intel-analytics/analytics-zoo
5. Mar 13, 2020
Unified Big Data Analytics and AI Platform
Production
Data pipeline
Prototype on laptop
using sample data
Experiment on clusters
with history data
Production deployment w/
distributed data pipeline
• Easily prototype the integrated data analytics & AI solution
• “Zero” code change from laptop to distributed cluster
• Directly access production data (Hadoop/Hive/HBase) without data copy
• Seamlessly deployed on production big data clusters
Seamless Scaling from Laptop to Production
6. Mar 13, 2020
Analytics Zoo
Unified Big Data Analytics and AI Platform
https://github.com/intel-analytics/analytics-zoo
Recommendation
Distributed TensorFlow & PyTorch on Spark
Spark Dataframes & ML Pipelines for DL
RayOnSpark
Model Serving
Models &
Algorithms
Integrated
Analytics & AI
Pipelines
Library &
Framework
Time Series Computer Vision NLP
ML Workflow AutoML for Time Series Automatic Cluster Serving
Python Libraries
(Numpy/Pandas/…)
DL Frameworks
(TF/PyTorch/…)
Distributed Analytics
(Spark/Flink/Ray/…)
Distributions
(Cloudera/Databricks/….)
7. Mar 13, 2020
Time Series Analysis
• Time Series data
• A series of data that is observed
sequentially in time.
• Numerical & unstructured
• Stock prices, sales volume, CPU/IO
monitoring data, etc.
• Example of time series analysis
• Product demand prediction
• Network quality analysis
• Predictive maintenance for high-
value equipment
Total volume of taxi passengers in NYC from 2014/07-2015/02 ( source :
https://github.com/intel-analytics/analytics-zoo/blob/master/apps/anomaly-
detection/anomaly-detection-nyc-taxi.ipynb)
8. Mar 13, 2020
AutoML Overview
Taking the Human out of Learning Applications : A Survey on Automated
Machine Learning. Yao, Q., Wang, et. al
9. Mar 13, 2020
Ray and Ray On Spark
https://medium.com/riselab/rayonspark-running-emerging-ai-applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a
• Ray
• A distributed framework for
emerging AI applications
• RayOnSpark
• Directly run Ray programs on big
data cluster
• Seamlessly integrate ray into spark
data processing pipeline
10. Mar 13, 2020
Time Series Analysis using AutoML and
Ray on Analytics Zoo
11. Mar 13, 2020
Laptop Spark ClusterYARN ClusterK8s Cluster
Distributed TensorFlow & PyTorch on Spark
Spark Dataframes & ML Pipelines for DL
RayOnSpark
Model Inference
Recommendation
Computer
Vision
NLP
Cluster Serving
AutoML
Framework
Integrated
Analytics and
AI Pipelines
TimeSeries
Algorithms
Hyper-Parameter
Tuning
Feature Generation Model Selection
AnalyticsZoo
Trend Prediction
……
Anomaly
Detection
ML
Workflow
Built-in
Algorithms
and Models
Time Series Solution
…
User Models
Time Series Solution In Analytics Zoo
• Time series Applications
• Time series forecasting
• Anomaly detection
• Time Series Clustering
• etc
• AutoML
• Seamless scaling
• Full-stack Intel SW+HW
Optimization w/ Analytics
Zoo
12. Mar 13, 2020
• AutoML Framework
• FeatureTransformer
• Model
• SearchEngine
• Pipeline
• Time Series Prediction w/
AutoML
• TimeSequencePredictor
• TimeSequencePipeline
AutoML + Time Series Analysis Framework
In Analytics Zoo
*Other names and brands may be claimed as the property of others.
https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and-
analytics-zoo-b79a6fd08139
13. Mar 13, 2020
Typical Workflow of Training w/ AutoML
FeatureTransformer
Model
SearchEngine
Search presets
Workflow implemented in TimeSequencePredictor
trial
trial
trial
trial
…best model
/parameters
trail jobs
Pipeline
with tunable parameters
with tunable parameters
configured with best parameters/model
Each trial runs a different
combination of hyper parameters
Ray Tune
14. Mar 13, 2020
• Training a Predictor
• fit (w/ automl)
• recipe
• distributed
• Using a Pipeline
• save/load
• evaluate/predict
• fit (incremental)
General API Usage
18. Mar 13, 2020
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Preprocess RDD of Tensor Model Code of TF
DL Training & Inferencing
Data Model
SIMD Acceleration
Time Series Based Network Quality Prediction in SK Telecom
https://databricks.com/session_eu19/apache-spark-ai-use-case-in-telco-network-quality-analysis-and-prediction-with-geospatial-visualization
19. Mar 13, 2020
Unsupervised Time Series Anomaly Detection for Baosight
https://software.intel.com/en-us/articles/lstm-based-time-
series-anomaly-detection-using-analytics-zoo-for-apache-
spark-and-bigdl
21. Mar 13, 2020
More Information about
AutoML + Time Series in Analytics Zoo
• Scalable AutoML for Time Series Analysis
• Source code as a branch of analytics-zoo repo @ https://github.com/intel-analytics/analytics-zoo/tree/automl
• README @ https://github.com/intel-analytics/analytics-zoo/blob/automl/pyzoo/zoo/automl/README.md
• Blog https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and-analytics-zoo-b79a6fd08139
• Anomaly Detection Reference Examples
• Time Series Forecast w/ AutoML https://github.com/intel-analytics/analytics-zoo/blob/automl/apps/automl
• Anomaly Detection based on Forecast https://github.com/intel-analytics/analytics-zoo/tree/master/apps/anomaly-
detection
• Anomaly Detection based on AutoEncoder https://github.com/intel-analytics/analytics-
zoo/tree/master/apps/anomaly-detection-hd
• Real-world Customer Applications
• Baosight’s anomaly detection for intelligent equipment management. Details refer to http://software.intel.com/en-
us/articles/lstm-based-time-series-anomaly-detection-using-analytics-zoo-for-apache-spark-and-bigdl
• Yunda anomaly detection for AIOps https://www.intel.cn/content/www/cn/zh/analytics/artificial-intelligence/yunda-
brings-quality-change-to-the-express-delivery-industry.html
22. Mar 13, 2020
• Project website
• https://github.com/intel-analytics/analytics-zoo
• https://github.com/intel-analytics/bigdl
• Tutorials
• CVPR 2018: https://jason-dai.github.io/cvpr2018/
• AAAI 2019: https://jason-dai.github.io/aaai2019/
• “BigDL: A Distributed Deep Learning Framework for Big Data”
• In proceedings of ACM Symposium on Cloud Computing 2019 (SOCC’19)
• Use cases
• Azure, CERN, MasterCard, Office Depot, Tencent, Midea, etc.
• https://analytics-zoo.github.io/master/#powered-by/
More Information on Analytics Zoo