Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019
1. Jongwook Woo
HiPIC
CalStateLA
The 11th International Conference on
Internet (ICONI) 2019
Dec 16 2019
Jongwook Woo, PhD, jwoo5@calstatela.edu
Monika Mishra, Mingoo Kang+
+Hanshin University, Korea
Big Data AI Center (BigDAI)
California State University Los Angeles
Rating Prediction using Deep
Learning and Spark
2. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Deep Learning in Big Data
Rating Prediction in Distributed Deep Learning
3. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
4. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
5. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Collaboration with HDP, CDH, Oracle, Amazon
using Hadoop Big Data
https://www.cloudera.com/more/customers/csula.html
6. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Deep Learning in Big Data
Rating Prediction in Distributed Deep Learning
7. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
New Technology: Big Data
What is Big Data? Data or Systems?
Large Scale Data?
– Many people only see the data point of view
– 3 Vs, 5Vs
Systems?
– YES
Big Data [1]
New Computer Systems to store and compute Large-Scale data
– How to Store and Process large scale data
• Not only for computing power
• Parallel Distributed Computing Systems => Data Intensive Super Computer
– Does not need to be Tera-/Peta-Bytes of data set
• Linearly Scalable
8. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
9. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive
10. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)
11. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677
1409년(태종 9) 최해산(崔海山), 아버지 최무선(崔茂宣)
[출처] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
12. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Super Computer vs Big Data vs Cloud
Traditional Super Computer
(Parallel File Systems: Lustre, PVFS, GPFS)
Cluster for Store
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
However, Cloud Computing adopts
this separated architecture:
with High Speed N/W and Object
Storage
Cluster for Compute
13. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data: Linearly Scalable
Some people questions that the system to handle 1 ~ 3GB of
data set is not Big Data
Well…. add more servers as more data in the future in Big Data platform
– it is linearly scalable once built
– n time more computing power ideally
Data Size: < 3 GB Data Size: 200 TB >
Add n
servers
14. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Deep Learning in Big Data
Rating Prediction in Distributed Deep Learning
15. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Scale Driving: Deep Learning Process
Deep Learning and Massive Data [3]
“Machine Learning Yearning” Andrew Ng 2016
16. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data and Deep Learning
Deep Learning and Massive Data
the larger the data set
–The higher the accuracy is
Single GPU server needed with the following service
–Data Engineering, Data Analysis, Data Science
• Need to build new systems for the services?
–Deep Learning is isolated from Big Data
• For Data Engineering, Data Analysis, Data Science
17. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep learning experts
The
Gap
Big Data Engineers, Scientists, Analysts, etc.
Another Gap between Deep Learning and Big Data
Communities [6]
18. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
Existing Big Data cluster with massive data set without using
Big Data
Too slow in data
migration and
single server fails
Single GPU
server for Deep
Learning?
Big Data Cluster
19. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Distributed Deep Learning in Big Data
Big Data Cluster
Already built in the site
–and matured for Data Engineering, Data Analysis, Data
Science
Can we use the existing Big Data cluster for Deep
Learning?
–Can we integrate Deep Learning to this Big Data Cluster?
20. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
Distributed Deep Learning
– Integrate Deep Learning to the cluster
Not needs data migration and can leverage the
parallel computing and existing large scale data
Big Data Cluster
21. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Distributed Deep Learning in Big Data
Integrate deep learning to big data cluster
Leverage existing Hadoop/Spark clusters to run deep learning
applications
– Use the existing SW tools and HW infrastructure of Hadoop/Spark
Big Data Hadoop/Spark cluster
– Where the data are stored
• Computing Engines: Spark, MapReduce
– Integrate deep learning
• Add deep learning functionalities to the Big Data (Spark) cluster
• Can use the existing Big Data service
– ETL, data warehouse, data analysis, data prediction
22. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Deep Learning in Big Data
Rating Prediction in Distributed Deep Learning
23. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AWS Review Dataset
Dataset : - https://s3.amazonaws.com/amazon-reviews-
pds/tsv/index.txt
Products reviewed between 2005 and 2015 are analyzed
Countries considered : US, UK, FR , DE
Total no. of product reviews : 9.57 million
File Size : 5.26 GB
Number of Files : 7
File Format : TSV (Tab Separated Values), CSV (Comma Separated Values)
Predictive Analysis
Prediction of rating
– important measures for purchase and selling
24. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental System Specification
Cluster
Amazon AWS EMR: Hadoop Spark
Analytics Zoo
AWS EMR:
Instances: r3.2xlarge
Number of Nodes: 3
Memory size:
– 183 GB (= 61 GB x 3)
CPU:
– 8 vCPU
– CPU speed: 3.1 GHz
Storage:
– 960 GB (= 2 x 160GB x 3)
25. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Storage Flow Diagram
26. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Prediction with DDL
DDL: Distributed Deep Learning
Tensor Flow
Distributed Training and Inference in Spark cluster
DDL
27. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML and DDL [2-5]
Deep Learning in Spark cluster
Distributed Deep Learning
DDL
DDL lib
DDL lib
Deep Learning in Spark
28. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML and DDL
Spark ML
ALS (Alternating Least Squares) algorithm
– Feature and Parameters Engineering
– Generalization
• Train Validation Split
• Cross Validation
– Performance
• 5 – 36 minutes
• MAE: 1.55, 1.574
DDL: Distributed Deep Learning
Neural Collaborative Filtering(NCF)
– a neural network recommendation system
– Various Batch Size
– Performance
• 16 – 33 minutes
• MAE: 0.693 ~ 0.7036
29. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Performance
30. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Mean Absolute Error
31. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Distributed Deep Learning on Spark
Higher performance
– More accurate
• 55% more
– Much faster than the single server
• Faster than Spark ML (CV)
Leveraging Big Data Engineering, Analysis, Science
– Great
32. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
33. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. “Rating Prediction using Deep Learning and Spark”, Monika Mishra, Mingoo Kang, Jongwook
Woo, The 11th International Conference on Internet (ICONI 2019), Dec 15-18 2019, Hanoi,
Vietnam
2. “BigDL: Bringing Ease of Use of Deep Learning for Apache Spark”, Jason Dai, Radhika
Rangarajan, Databricks, Spark Summit 2017
3. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo :
Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jennie Wang, Guoqiong Song, CVPR
2018, Salt Lake City, Utah, June 18-22 2018
4. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo :
Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jason Dai, AAAI 2019 Tutorial
Forum, Thirty-Third Conference on Artificial Intelligence, January 27 – 28, 2019, Honolulu,
Hawaii, USA
5. “User-based real-time product recommendations leveraging deep learning using Analytics Zoo
on Apache Spark and BigDL”, Luyang Wang, Guoqiong Song, Jing (Nicole) Kong, Maneesha
Bhalla, Strata Data Conference 2019, March 25-28, 2019, San Francisco, CA
6. “Leveraging NLP and Deep Learning for Document Recommendation in the Cloud”, Guoqiong
Song, Spark + AI Summit 2019