SlideShare a Scribd company logo
1 of 33
Jongwook Woo
HiPIC
CalStateLA
The 11th International Conference on
Internet (ICONI) 2019
Dec 16 2019
Jongwook Woo, PhD, jwoo5@calstatela.edu
Monika Mishra, Mingoo Kang+
+Hanshin University, Korea
Big Data AI Center (BigDAI)
California State University Los Angeles
Rating Prediction using Deep
Learning and Spark
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Deep Learning in Big Data
 Rating Prediction in Distributed Deep Learning
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Collaboration with HDP, CDH, Oracle, Amazon
using Hadoop Big Data
https://www.cloudera.com/more/customers/csula.html
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Deep Learning in Big Data
 Rating Prediction in Distributed Deep Learning
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
New Technology: Big Data
What is Big Data? Data or Systems?
Large Scale Data?
– Many people only see the data point of view
– 3 Vs, 5Vs
Systems?
– YES
Big Data [1]
New Computer Systems to store and compute Large-Scale data
– How to Store and Process large scale data
• Not only for computing power
• Parallel Distributed Computing Systems => Data Intensive Super Computer
– Does not need to be Tera-/Peta-Bytes of data set
• Linearly Scalable
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677
1409년(태종 9) 최해산(崔海山), 아버지 최무선(崔茂宣)
[출처] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Super Computer vs Big Data vs Cloud
Traditional Super Computer
(Parallel File Systems: Lustre, PVFS, GPFS)
Cluster for Store
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
However, Cloud Computing adopts
this separated architecture:
with High Speed N/W and Object
Storage
Cluster for Compute
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data: Linearly Scalable
 Some people questions that the system to handle 1 ~ 3GB of
data set is not Big Data
Well…. add more servers as more data in the future in Big Data platform
– it is linearly scalable once built
– n time more computing power ideally
Data Size: < 3 GB Data Size: 200 TB >
Add n
servers
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
Deep Learning in Big Data
Rating Prediction in Distributed Deep Learning
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Scale Driving: Deep Learning Process
Deep Learning and Massive Data [3]
“Machine Learning Yearning” Andrew Ng 2016
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data and Deep Learning
Deep Learning and Massive Data
the larger the data set
–The higher the accuracy is
Single GPU server needed with the following service
–Data Engineering, Data Analysis, Data Science
• Need to build new systems for the services?
–Deep Learning is isolated from Big Data
• For Data Engineering, Data Analysis, Data Science
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep learning experts
The
Gap
Big Data Engineers, Scientists, Analysts, etc.
Another Gap between Deep Learning and Big Data
Communities [6]
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
 Existing Big Data cluster with massive data set without using
Big Data
Too slow in data
migration and
single server fails
Single GPU
server for Deep
Learning?
Big Data Cluster
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Distributed Deep Learning in Big Data
Big Data Cluster
Already built in the site
–and matured for Data Engineering, Data Analysis, Data
Science
Can we use the existing Big Data cluster for Deep
Learning?
–Can we integrate Deep Learning to this Big Data Cluster?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
 Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
Distributed Deep Learning
– Integrate Deep Learning to the cluster
Not needs data migration and can leverage the
parallel computing and existing large scale data
Big Data Cluster
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Distributed Deep Learning in Big Data
Integrate deep learning to big data cluster
Leverage existing Hadoop/Spark clusters to run deep learning
applications
– Use the existing SW tools and HW infrastructure of Hadoop/Spark
Big Data Hadoop/Spark cluster
– Where the data are stored
• Computing Engines: Spark, MapReduce
– Integrate deep learning
• Add deep learning functionalities to the Big Data (Spark) cluster
• Can use the existing Big Data service
– ETL, data warehouse, data analysis, data prediction
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
Deep Learning in Big Data
Rating Prediction in Distributed Deep Learning
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AWS Review Dataset
Dataset : - https://s3.amazonaws.com/amazon-reviews-
pds/tsv/index.txt
Products reviewed between 2005 and 2015 are analyzed
Countries considered : US, UK, FR , DE
Total no. of product reviews : 9.57 million
File Size : 5.26 GB
Number of Files : 7
File Format : TSV (Tab Separated Values), CSV (Comma Separated Values)
Predictive Analysis
Prediction of rating
– important measures for purchase and selling
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental System Specification
Cluster
Amazon AWS EMR: Hadoop Spark
Analytics Zoo
AWS EMR:
 Instances: r3.2xlarge
Number of Nodes: 3
Memory size:
– 183 GB (= 61 GB x 3)
CPU:
– 8 vCPU
– CPU speed: 3.1 GHz
Storage:
– 960 GB (= 2 x 160GB x 3)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Storage Flow Diagram
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Prediction with DDL
DDL: Distributed Deep Learning
Tensor Flow
Distributed Training and Inference in Spark cluster
DDL
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML and DDL [2-5]
Deep Learning in Spark cluster
Distributed Deep Learning
DDL
DDL lib
DDL lib
Deep Learning in Spark
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML and DDL
 Spark ML
 ALS (Alternating Least Squares) algorithm
– Feature and Parameters Engineering
– Generalization
• Train Validation Split
• Cross Validation
– Performance
• 5 – 36 minutes
• MAE: 1.55, 1.574
 DDL: Distributed Deep Learning
 Neural Collaborative Filtering(NCF)
– a neural network recommendation system
– Various Batch Size
– Performance
• 16 – 33 minutes
• MAE: 0.693 ~ 0.7036
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Performance
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Mean Absolute Error
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Distributed Deep Learning on Spark
Higher performance
– More accurate
• 55% more
– Much faster than the single server
• Faster than Spark ML (CV)
Leveraging Big Data Engineering, Analysis, Science
– Great
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. “Rating Prediction using Deep Learning and Spark”, Monika Mishra, Mingoo Kang, Jongwook
Woo, The 11th International Conference on Internet (ICONI 2019), Dec 15-18 2019, Hanoi,
Vietnam
2. “BigDL: Bringing Ease of Use of Deep Learning for Apache Spark”, Jason Dai, Radhika
Rangarajan, Databricks, Spark Summit 2017
3. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo :
Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jennie Wang, Guoqiong Song, CVPR
2018, Salt Lake City, Utah, June 18-22 2018
4. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo :
Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jason Dai, AAAI 2019 Tutorial
Forum, Thirty-Third Conference on Artificial Intelligence, January 27 – 28, 2019, Honolulu,
Hawaii, USA
5. “User-based real-time product recommendations leveraging deep learning using Analytics Zoo
on Apache Spark and BigDL”, Luyang Wang, Guoqiong Song, Jing (Nicole) Kong, Maneesha
Bhalla, Strata Data Conference 2019, March 25-28, 2019, San Francisco, CA
6. “Leveraging NLP and Deep Learning for Document Recommendation in the Cloud”, Guoqiong
Song, Spark + AI Summit 2019

More Related Content

What's hot

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Edureka!
 

What's hot (20)

AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Data science
Data scienceData science
Data science
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Analytics Education in the era of Big Data
Analytics Education in the era of Big DataAnalytics Education in the era of Big Data
Analytics Education in the era of Big Data
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 

Similar to Rating Prediction using Deep Learning and Spark

A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
ArmyTrilidiaDevegaSK
 

Similar to Rating Prediction using Deep Learning and Spark (20)

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Benefiting from Semantic AI along the data life cycle
Benefiting from Semantic AI along the data life cycleBenefiting from Semantic AI along the data life cycle
Benefiting from Semantic AI along the data life cycle
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 

More from Jongwook Woo

Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
Jongwook Woo
 

More from Jongwook Woo (12)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Recently uploaded (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

Rating Prediction using Deep Learning and Spark

  • 1. Jongwook Woo HiPIC CalStateLA The 11th International Conference on Internet (ICONI) 2019 Dec 16 2019 Jongwook Woo, PhD, jwoo5@calstatela.edu Monika Mishra, Mingoo Kang+ +Hanshin University, Korea Big Data AI Center (BigDAI) California State University Los Angeles Rating Prediction using Deep Learning and Spark
  • 2. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Deep Learning in Big Data  Rating Prediction in Distributed Deep Learning
  • 3. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself Experience: Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC
  • 4. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  • 5. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Collaboration with HDP, CDH, Oracle, Amazon using Hadoop Big Data https://www.cloudera.com/more/customers/csula.html
  • 6. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Deep Learning in Big Data  Rating Prediction in Distributed Deep Learning
  • 7. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA New Technology: Big Data What is Big Data? Data or Systems? Large Scale Data? – Many people only see the data point of view – 3 Vs, 5Vs Systems? – YES Big Data [1] New Computer Systems to store and compute Large-Scale data – How to Store and Process large scale data • Not only for computing power • Parallel Distributed Computing Systems => Data Intensive Super Computer – Does not need to be Tera-/Peta-Bytes of data set • Linearly Scalable
  • 8. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way
  • 9. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way Becomes too Expensive
  • 10. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Not Expensive From 2017 Korean Blockbuster Movie, “The Fortress” (남한산성)
  • 11. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Not Expensive http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677 1409년(태종 9) 최해산(崔海山), 아버지 최무선(崔茂宣) [출처] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
  • 12. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Super Computer vs Big Data vs Cloud Traditional Super Computer (Parallel File Systems: Lustre, PVFS, GPFS) Cluster for Store Big Data (Hadoop, Spark, Distributed Deep Learning) Cluster for Compute and Store (Distributed File Systems: HDFS, GFS) However, Cloud Computing adopts this separated architecture: with High Speed N/W and Object Storage Cluster for Compute
  • 13. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data: Linearly Scalable  Some people questions that the system to handle 1 ~ 3GB of data set is not Big Data Well…. add more servers as more data in the future in Big Data platform – it is linearly scalable once built – n time more computing power ideally Data Size: < 3 GB Data Size: 200 TB > Add n servers
  • 14. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data Deep Learning in Big Data Rating Prediction in Distributed Deep Learning
  • 15. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Scale Driving: Deep Learning Process Deep Learning and Massive Data [3] “Machine Learning Yearning” Andrew Ng 2016
  • 16. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data and Deep Learning Deep Learning and Massive Data the larger the data set –The higher the accuracy is Single GPU server needed with the following service –Data Engineering, Data Analysis, Data Science • Need to build new systems for the services? –Deep Learning is isolated from Big Data • For Data Engineering, Data Analysis, Data Science
  • 17. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep learning experts The Gap Big Data Engineers, Scientists, Analysts, etc. Another Gap between Deep Learning and Big Data Communities [6]
  • 18. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster  Existing Big Data cluster with massive data set without using Big Data Too slow in data migration and single server fails Single GPU server for Deep Learning? Big Data Cluster
  • 19. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Distributed Deep Learning in Big Data Big Data Cluster Already built in the site –and matured for Data Engineering, Data Analysis, Data Science Can we use the existing Big Data cluster for Deep Learning? –Can we integrate Deep Learning to this Big Data Cluster?
  • 20. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster  Existing Big Data cluster Big Data Engineering Big Data Analysis Big Data Science Distributed Deep Learning – Integrate Deep Learning to the cluster Not needs data migration and can leverage the parallel computing and existing large scale data Big Data Cluster
  • 21. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Distributed Deep Learning in Big Data Integrate deep learning to big data cluster Leverage existing Hadoop/Spark clusters to run deep learning applications – Use the existing SW tools and HW infrastructure of Hadoop/Spark Big Data Hadoop/Spark cluster – Where the data are stored • Computing Engines: Spark, MapReduce – Integrate deep learning • Add deep learning functionalities to the Big Data (Spark) cluster • Can use the existing Big Data service – ETL, data warehouse, data analysis, data prediction
  • 22. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data Deep Learning in Big Data Rating Prediction in Distributed Deep Learning
  • 23. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AWS Review Dataset Dataset : - https://s3.amazonaws.com/amazon-reviews- pds/tsv/index.txt Products reviewed between 2005 and 2015 are analyzed Countries considered : US, UK, FR , DE Total no. of product reviews : 9.57 million File Size : 5.26 GB Number of Files : 7 File Format : TSV (Tab Separated Values), CSV (Comma Separated Values) Predictive Analysis Prediction of rating – important measures for purchase and selling
  • 24. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experimental System Specification Cluster Amazon AWS EMR: Hadoop Spark Analytics Zoo AWS EMR:  Instances: r3.2xlarge Number of Nodes: 3 Memory size: – 183 GB (= 61 GB x 3) CPU: – 8 vCPU – CPU speed: 3.1 GHz Storage: – 960 GB (= 2 x 160GB x 3)
  • 25. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Storage Flow Diagram
  • 26. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Prediction with DDL DDL: Distributed Deep Learning Tensor Flow Distributed Training and Inference in Spark cluster DDL
  • 27. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML and DDL [2-5] Deep Learning in Spark cluster Distributed Deep Learning DDL DDL lib DDL lib Deep Learning in Spark
  • 28. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML and DDL  Spark ML  ALS (Alternating Least Squares) algorithm – Feature and Parameters Engineering – Generalization • Train Validation Split • Cross Validation – Performance • 5 – 36 minutes • MAE: 1.55, 1.574  DDL: Distributed Deep Learning  Neural Collaborative Filtering(NCF) – a neural network recommendation system – Various Batch Size – Performance • 16 – 33 minutes • MAE: 0.693 ~ 0.7036
  • 29. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary: Performance
  • 30. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary: Mean Absolute Error
  • 31. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary Introduction to Big Data Distributed Deep Learning on Spark Higher performance – More accurate • 55% more – Much faster than the single server • Faster than Spark ML (CV) Leveraging Big Data Engineering, Analysis, Science – Great
  • 32. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Questions?
  • 33. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 1. “Rating Prediction using Deep Learning and Spark”, Monika Mishra, Mingoo Kang, Jongwook Woo, The 11th International Conference on Internet (ICONI 2019), Dec 15-18 2019, Hanoi, Vietnam 2. “BigDL: Bringing Ease of Use of Deep Learning for Apache Spark”, Jason Dai, Radhika Rangarajan, Databricks, Spark Summit 2017 3. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo : Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jennie Wang, Guoqiong Song, CVPR 2018, Salt Lake City, Utah, June 18-22 2018 4. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo : Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jason Dai, AAAI 2019 Tutorial Forum, Thirty-Third Conference on Artificial Intelligence, January 27 – 28, 2019, Honolulu, Hawaii, USA 5. “User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL”, Luyang Wang, Guoqiong Song, Jing (Nicole) Kong, Maneesha Bhalla, Strata Data Conference 2019, March 25-28, 2019, San Francisco, CA 6. “Leveraging NLP and Deep Learning for Document Recommendation in the Cloud”, Guoqiong Song, Spark + AI Summit 2019