Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Rating Prediction using Deep Learning and Spark

172 visualizaciones

Publicado el

Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark​" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

  • Sé el primero en recomendar esto

Rating Prediction using Deep Learning and Spark

  1. 1. Jongwook Woo HiPIC CalStateLA The 11th International Conference on Internet (ICONI) 2019 Dec 16 2019 Jongwook Woo, PhD, jwoo5@calstatela.edu Monika Mishra, Mingoo Kang+ +Hanshin University, Korea Big Data AI Center (BigDAI) California State University Los Angeles Rating Prediction using Deep Learning and Spark
  2. 2. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Deep Learning in Big Data  Rating Prediction in Distributed Deep Learning
  3. 3. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself Experience: Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC
  4. 4. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  5. 5. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Collaboration with HDP, CDH, Oracle, Amazon using Hadoop Big Data https://www.cloudera.com/more/customers/csula.html
  6. 6. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Deep Learning in Big Data  Rating Prediction in Distributed Deep Learning
  7. 7. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA New Technology: Big Data What is Big Data? Data or Systems? Large Scale Data? – Many people only see the data point of view – 3 Vs, 5Vs Systems? – YES Big Data [1] New Computer Systems to store and compute Large-Scale data – How to Store and Process large scale data • Not only for computing power • Parallel Distributed Computing Systems => Data Intensive Super Computer – Does not need to be Tera-/Peta-Bytes of data set • Linearly Scalable
  8. 8. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way
  9. 9. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way Becomes too Expensive
  10. 10. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Not Expensive From 2017 Korean Blockbuster Movie, “The Fortress” (남한산성)
  11. 11. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Not Expensive http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677 1409년(태종 9) 최해산(崔海山), 아버지 최무선(崔茂宣) [출처] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
  12. 12. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Super Computer vs Big Data vs Cloud Traditional Super Computer (Parallel File Systems: Lustre, PVFS, GPFS) Cluster for Store Big Data (Hadoop, Spark, Distributed Deep Learning) Cluster for Compute and Store (Distributed File Systems: HDFS, GFS) However, Cloud Computing adopts this separated architecture: with High Speed N/W and Object Storage Cluster for Compute
  13. 13. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data: Linearly Scalable  Some people questions that the system to handle 1 ~ 3GB of data set is not Big Data Well…. add more servers as more data in the future in Big Data platform – it is linearly scalable once built – n time more computing power ideally Data Size: < 3 GB Data Size: 200 TB > Add n servers
  14. 14. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data Deep Learning in Big Data Rating Prediction in Distributed Deep Learning
  15. 15. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Scale Driving: Deep Learning Process Deep Learning and Massive Data [3] “Machine Learning Yearning” Andrew Ng 2016
  16. 16. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data and Deep Learning Deep Learning and Massive Data the larger the data set –The higher the accuracy is Single GPU server needed with the following service –Data Engineering, Data Analysis, Data Science • Need to build new systems for the services? –Deep Learning is isolated from Big Data • For Data Engineering, Data Analysis, Data Science
  17. 17. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep learning experts The Gap Big Data Engineers, Scientists, Analysts, etc. Another Gap between Deep Learning and Big Data Communities [6]
  18. 18. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster  Existing Big Data cluster with massive data set without using Big Data Too slow in data migration and single server fails Single GPU server for Deep Learning? Big Data Cluster
  19. 19. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Distributed Deep Learning in Big Data Big Data Cluster Already built in the site –and matured for Data Engineering, Data Analysis, Data Science Can we use the existing Big Data cluster for Deep Learning? –Can we integrate Deep Learning to this Big Data Cluster?
  20. 20. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster  Existing Big Data cluster Big Data Engineering Big Data Analysis Big Data Science Distributed Deep Learning – Integrate Deep Learning to the cluster Not needs data migration and can leverage the parallel computing and existing large scale data Big Data Cluster
  21. 21. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Distributed Deep Learning in Big Data Integrate deep learning to big data cluster Leverage existing Hadoop/Spark clusters to run deep learning applications – Use the existing SW tools and HW infrastructure of Hadoop/Spark Big Data Hadoop/Spark cluster – Where the data are stored • Computing Engines: Spark, MapReduce – Integrate deep learning • Add deep learning functionalities to the Big Data (Spark) cluster • Can use the existing Big Data service – ETL, data warehouse, data analysis, data prediction
  22. 22. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data Deep Learning in Big Data Rating Prediction in Distributed Deep Learning
  23. 23. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AWS Review Dataset Dataset : - https://s3.amazonaws.com/amazon-reviews- pds/tsv/index.txt Products reviewed between 2005 and 2015 are analyzed Countries considered : US, UK, FR , DE Total no. of product reviews : 9.57 million File Size : 5.26 GB Number of Files : 7 File Format : TSV (Tab Separated Values), CSV (Comma Separated Values) Predictive Analysis Prediction of rating – important measures for purchase and selling
  24. 24. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experimental System Specification Cluster Amazon AWS EMR: Hadoop Spark Analytics Zoo AWS EMR:  Instances: r3.2xlarge Number of Nodes: 3 Memory size: – 183 GB (= 61 GB x 3) CPU: – 8 vCPU – CPU speed: 3.1 GHz Storage: – 960 GB (= 2 x 160GB x 3)
  25. 25. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Storage Flow Diagram
  26. 26. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Prediction with DDL DDL: Distributed Deep Learning Tensor Flow Distributed Training and Inference in Spark cluster DDL
  27. 27. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML and DDL [2-5] Deep Learning in Spark cluster Distributed Deep Learning DDL DDL lib DDL lib Deep Learning in Spark
  28. 28. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML and DDL  Spark ML  ALS (Alternating Least Squares) algorithm – Feature and Parameters Engineering – Generalization • Train Validation Split • Cross Validation – Performance • 5 – 36 minutes • MAE: 1.55, 1.574  DDL: Distributed Deep Learning  Neural Collaborative Filtering(NCF) – a neural network recommendation system – Various Batch Size – Performance • 16 – 33 minutes • MAE: 0.693 ~ 0.7036
  29. 29. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary: Performance
  30. 30. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary: Mean Absolute Error
  31. 31. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary Introduction to Big Data Distributed Deep Learning on Spark Higher performance – More accurate • 55% more – Much faster than the single server • Faster than Spark ML (CV) Leveraging Big Data Engineering, Analysis, Science – Great
  32. 32. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Questions?
  33. 33. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 1. “Rating Prediction using Deep Learning and Spark”, Monika Mishra, Mingoo Kang, Jongwook Woo, The 11th International Conference on Internet (ICONI 2019), Dec 15-18 2019, Hanoi, Vietnam 2. “BigDL: Bringing Ease of Use of Deep Learning for Apache Spark”, Jason Dai, Radhika Rangarajan, Databricks, Spark Summit 2017 3. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo : Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jennie Wang, Guoqiong Song, CVPR 2018, Salt Lake City, Utah, June 18-22 2018 4. “Building Deep Learning Applications for Big Data: An Introduction to Analytics Zoo : Distributed TensorFlow, Keras and BigDL on Apache Spark”, Jason Dai, AAAI 2019 Tutorial Forum, Thirty-Third Conference on Artificial Intelligence, January 27 – 28, 2019, Honolulu, Hawaii, USA 5. “User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL”, Luyang Wang, Guoqiong Song, Jing (Nicole) Kong, Maneesha Bhalla, Strata Data Conference 2019, March 25-28, 2019, San Francisco, CA 6. “Leveraging NLP and Deep Learning for Document Recommendation in the Cloud”, Guoqiong Song, Spark + AI Summit 2019

×