Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

AI on Big Data

152 visualizaciones

Publicado el

What is Big Data and Deep Learning Tensor Flow. How Big Data adopts Deep Learning.

Publicado en: Ingeniería
  • Inicia sesión para ver los comentarios

  • Sé el primero en recomendar esto

AI on Big Data

  1. 1. Jongwook Woo HiPIC CalStateLA 동의대학교 상경대 경제학과 임 동 순 교수 May 29 2018 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Introduction to AI on Big Data
  2. 2. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  3. 3. High Performance Information Computing Center Jongwook Woo CalStateLA Myself Experience:  Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  4. 4. High Performance Information Computing Center Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  5. 5. High Performance Information Computing Center Jongwook Woo CalStateLA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city since 2016 – Collect, Search, and Analyze City Data • Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon, DongEui • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  6. 6. High Performance Information Computing Center Jongwook Woo CalStateLA Myself: Partners for Services
  7. 7. High Performance Information Computing Center Jongwook Woo CalStateLA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  8. 8. High Performance Information Computing Center Jongwook Woo CalStateLA Myself: Public Partners
  9. 9. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  10. 10. High Performance Information Computing Center Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  11. 11. High Performance Information Computing Center Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  12. 12. High Performance Information Computing Center Jongwook Woo CalStateLA What is Hadoop? 12  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  13. 13. High Performance Information Computing Center Jongwook Woo CalStateLA Super Computer vs Hadoop Parallel vs. Distributed file systems by Michael Malak Updated by Jongwook Woo Cluster for Store Cluster for Compute/Store Cluster for Compute
  14. 14. High Performance Information Computing Center Jongwook Woo CalStateLA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  15. 15. High Performance Information Computing Center Jongwook Woo CalStateLA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/
  16. 16. High Performance Information Computing Center Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch
  17. 17. High Performance Information Computing Center Jongwook Woo CalStateLA NoSQL DB  Key-Value Memcached, Memcachedb, Redis  Column Oriented (Column Family Store) BigTable, Hbase Cassandra (Key-Value Column Oriented) Amazon SimpleDB  Document Oriented MongoDB, Couchbase, CouchDB  Graph Oriented Neo4j, InfiniteGraph
  18. 18. High Performance Information Computing Center Jongwook Woo CalStateLA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-Memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce Good in Machine Learning – Iterative algorithms
  19. 19. High Performance Information Computing Center Jongwook Woo CalStateLA Spark and Hadoop Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS & Azure – No Hadoop ecosystems
  20. 20. High Performance Information Computing Center Jongwook Woo CalStateLA Sentiment Map of Alphago Positive Negative
  21. 21. High Performance Information Computing Center Jongwook Woo CalStateLA Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a
  22. 22. High Performance Information Computing Center Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  23. 23. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
  24. 24. High Performance Information Computing Center Jongwook Woo CalStateLA Review count of popular sub-categories of business
  25. 25. High Performance Information Computing Center Jongwook Woo CalStateLA Businesses popular in 5 miles of CalStateLA, USC , UCLA
  26. 26. High Performance Information Computing Center Jongwook Woo CalStateLA Average Undergraduates Receiving PELL GRANT in Each College East Georgia State College: $2,854 Avg. PELL grant: 97.285%
  27. 27. High Performance Information Computing Center Jongwook Woo CalStateLA Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, Tableau,…) Data Visualization Qlik, Datameer, Excel PowerView - Big Data Engineering - Big Data Analysis - Big Data Science - Data Visualization
  28. 28. High Performance Information Computing Center Jongwook Woo CalStateLA Terms We know Data Engineering – Collect, clean, filter data Data Analysis – Find insights from the data Data Science (Predictive Analysis) – Predict the trend or pattern from the existing data Do we know? Big Data Analysis and Science – Using Big Data for Data Analysis and Science • Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. – For Massive Data Set • How to store and compute?
  29. 29. High Performance Information Computing Center Jongwook Woo CalStateLA NoSQL DB  Key-Value Memcached, Memcachedb, Redis  Column Oriented (Column Family Store) BigTable, Hbase Cassandra (Key-Value Column Oriented) Amazon SimpleDB  Document Oriented MongoDB, Couchbase, CouchDB  Graph Oriented Neo4j, InfiniteGraph
  30. 30. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  31. 31. High Performance Information Computing Center Jongwook Woo CalStateLA AI and Deep Learning Artificial Intelligence Machine Learning Deep Learning Neural Networks ▪Deep learning ▪Sub-field of neural networks, machine learning, and artificial intelligence ▪Deep learning is neural networks with many layers ▪Inspired by, but not limited to, ▪ the architecture of the human brain 3 1 © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC
  32. 32. High Performance Information Computing Center Jongwook Woo CalStateLA Deep Learning and TensorFlow ▪Development led by Google ▪Open-source library for deep learning ▪ Define model structures, library for efficient execution ▪Define once, run anywhere: ▪ can run on on CPUs and GPUs, many devices ▪ NVidia, Google GPU ▪Can be used in Python ▪ and many other languages ▪Built for large-scale machine learning ▪ development and operations 3 2 © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC
  33. 33. High Performance Information Computing Center Jongwook Woo CalStateLA 7 • Neural Networks • Multi-Layer Perceptron • Convolutional Neural Networks Deep Learning [9]
  34. 34. High Performance Information Computing Center Jongwook Woo CalStateLA 7 • good at problems like image classification. Convolutional Neural Networks
  35. 35. High Performance Information Computing Center Jongwook Woo CalStateLA 9 • Has 3 types of parameters ▫ W – Hidden weights ▫ U – Hidden to Hidden weights ▫ V – Hidden to Label weights • Good for Text Processing such as sentiment analysis: • My Projects > sapDeepLearningTensorflow > Week_03_Unit_05_S Recurrent Neural Networks (RNN)
  36. 36. High Performance Information Computing Center Jongwook Woo CalStateLA 10  Neural Networks are resource intensive o Typically require huge dedicated hardware (RAM, GPUs)  Parameter space huge o 100s of thousands of parameters o Tuning is important  Architecture choice is important: o See http://www.asimovinstitute.org/neural-network-zoo/ Key takeaways from modeling Deep Neural Networks
  37. 37. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  38. 38. High Performance Information Computing Center Jongwook Woo CalStateLA Recap Spark: an efficient framework for running computations on thousands of computers TensorFlow: high-performance numerical framework Get the best of both Simple API for distributed numerical computing Can leverage the hardware of the cluster 38
  39. 39. High Performance Information Computing Center Jongwook Woo CalStateLA 13  Investment in Big-Data o infrastructure  GPUs o Require specialized hardware o – Niche Use-cases  Can enterprises reuse existing infrastructure o for deep learning applications?  What use-cases in Deep learning can leverage Apache Spark? Deep Learning + Apache Spark
  40. 40. High Performance Information Computing Center Jongwook Woo CalStateLA Spark using TensorFlow [8, 9]  Neural networks  have seen spectacular progress during the last few years  the state of the art in image recognition and automated translation.  TensorFlow  a new framework released by Google – for numerical computations and neural networks.  Spark and TensorFlow  use Spark and a cluster of machines – to improve deep learning pipelines with TensorFlow – how to use TensorFlow and Spark together to train and apply deep learning models  Hyperparameter Tuning: – use Spark to find the best set of hyperparameters for neural network training, • leading to 10X reduction in training time and 34% lower error rate.  Deploying models at scale: – use Spark to apply a trained neural network model on a large amount of data
  41. 41. High Performance Information Computing Center Jongwook Woo CalStateLA  The accuracy of Spark with the default set of hyperparameters  99.2%.  best result with hyperparameter tuning – has a 99.47% accuracy on the test set, • which is a 34% reduction of the test error. Spark Cluster with TensorFlow
  42. 42. High Performance Information Computing Center Jongwook Woo CalStateLA 14  Databricks  Platform for running Spark with TensorFlow  BigDL  Intel’s library for deep learning on existing data frameworks.  TensorflowOnSpark  Yahoo’s Distributed Deep Learning on Big Data  SparkNet  AMPLab’s framework for training deep networks in Spark Efforts on using Deep Learning Frameworks with Spark
  43. 43. High Performance Information Computing Center Jongwook Woo CalStateLA 14  DeepLearning4J  Uses Data parallism to train on separate neural networks  DeepDist  Lightning-Fast Deep Learning on Spark Via parallel stochastic gradient updates  IBM DSX Efforts on using Deep Learning Frameworks with Spark
  44. 44. High Performance Information Computing Center Jongwook Woo CalStateLA 15  Deploying trained models o to make predictions on data stored in Spark RDDs or Dataframes o Inception model: https://www.tensorflow.org/tutorials/image_recognition o Each prediction requires about 4.8 billion operations o Parallelizing with Spark helps scale operations Databricks https://databricks.com/blog/2016/12/21/deep-learning-on- databricks.html
  45. 45. High Performance Information Computing Center Jongwook Woo CalStateLA 16 • Distributed model training  Use deep learning libraries like TensorFlow to test different model hyperparameters on each worker  Task parallelism Databricks https://databricks.com/blog/2016/12/21/deep-learning-on- databricks.html
  46. 46. High Performance Information Computing Center Jongwook Woo CalStateLA IBM DSX  Data Science Experience (DSX) includes TensorFlow libraty GPU Easy to develop and run Spark with TensorFlow Don’t need to configure library Databricks’ examples run in DSX –While Databricks CE does not support GPU Brunel for visualization lately ‹#›
  47. 47. High Performance Information Computing Center Jongwook Woo CalStateLA Multiple nodes in the cluster:  the computations scaled linearly a graph – the computation times (in seconds) • with respect to the number of machines on the cluster: – using a 13-node cluster, • train 13 models in parallel, • which translates into a 7x speedup compared to training the models one at a time on one machine. Spark Cluster with TensorFlow (Cont’d)
  48. 48. High Performance Information Computing Center Jongwook Woo CalStateLA Spark Cluster with TensorFlow (Cont’d)
  49. 49. High Performance Information Computing Center Jongwook Woo CalStateLA Spark Cluster with TensorFlow (Cont’d) the learning rate for different numbers of neurons: The learning rate is critical: – if it is too low, • the neural network does not learn anything (high test error). – If it is too high, • the training process may oscillate randomly and even diverge in some configurations. The number of neurons – not as important for getting a good performance, • and networks with many neurons – much more sensitive to the learning rate. – This is Occam’s Razor principle: • simpler model tend to be “good enough” for most purposes. • If you have the time and resource to go after the missing 1% test error, you must be willing to invest a lot of resources in training, • to find the proper hyperparameters that will make the difference.
  50. 50. High Performance Information Computing Center Jongwook Woo CalStateLA Distributed processing of images using TensorFlow  Apache Spark with a Deep Learning library takes an existing neural network (INCEPTION-3) – applies it to a corpus of images. requires that TensorFlow be installed on the cluster Run in IBM DSX – Not in Databricks CE • Built by Databricks but needs GPU  Spark integration work flow: define TensorFlow operations as methods, to be used within Spark tasks. broadcast the model for use within Spark tasks. parallelize a list of image URLs. Using Spark, we process the image URLs in parallel: – Load image. – Run inference on the image using TensorFlow to predict the image contents.
  51. 51. High Performance Information Computing Center Jongwook Woo CalStateLA Distributed processing of images classification using TensorFlow  use the “Simple image classification with Inception” example from TensorFlow, which applies the Inception model to predict the contents of a set of images.  For example, given Photo of two scuba divers The Inception model will tell us the contents of the image: ('scuba diver', 0.88708681), ('electric ray, crampfish, numbfish, torpedo', 0.012277877), ('sea snake', 0.005639134), ('tiger shark, Galeocerdo cuvieri', 0.0051873429), ('reel', 0.0044495272)
  52. 52. High Performance Information Computing Center Jongwook Woo CalStateLA Distributed processing of images classification using TensorFlow (Cont’d) Each of the lines above represents a “synset,” or a set of synonymous terms – representing a concept. The weight given to each synset – represents a confidence in how applicable the synset is to the image. – In this case, “scuba diver” is pretty accurate! Making predictions with Inception-v3  expensive: – each prediction requires about 4.8 billion operations (Szegedy et al., 2015). Even with smaller datasets, – worthwhile to parallelize this computation. – distribute these costly predictions using Spark.
  53. 53. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  인공지능  인공지능과 빅데이터  Summary
  54. 54. High Performance Information Computing Center Jongwook Woo CalStateLA Summary Introduction to Big Data Introduction to AI AI on Big Data
  55. 55. High Performance Information Computing Center Jongwook Woo CalStateLA Databricks Partners
  56. 56. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  57. 57. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  58. 58. High Performance Information Computing Center Jongwook Woo CalStateLA Question?
  59. 59. High Performance Information Computing Center Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm- choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big- Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. Github URL: https://github.com/nmelche/IntroductionToBigDataScience
  60. 60. High Performance Information Computing Center Jongwook Woo CalStateLA References 8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark 9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning- and-apache-spark 10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning- frameworks-on-spark 11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning- at-scalewith-apache-spark-keynote-by-ziya-ma 12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and- tensorflow.html 13. Tensor Flow Deep Learning Open SAP
  61. 61. High Performance Information Computing Center Jongwook Woo CalStateLA Deep Learning for the Intelligent Enterprise Deep learning Artificial Intelligence Machine Learning Deep Learning Neural Networks ▪ Sub-field of neural networks, machine learning, and artificial intelligence ▪ Deep learning is neural networks with many layers ▪ Inspired by, but not limited to, the architecture of the human brain ▪ Deep learning is the reality behind artificial intelligence 6 1 © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC

×