Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big Data Platform adopting Spark and Use Cases with Open Data

376 visualizaciones

Publicado el

Spark and its use cases are introduced with Open Data. It is illustrates how Hadoop big data can be adopted for open data and its analysis

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

Big Data Platform adopting Spark and Use Cases with Open Data

  1. 1. Jongwook Woo HiPIC CSULA Big Data Platform adopting Spark and Use Cases with Open Data Symposium on the High-Performance Big Data Analysis Platform 2016 Seoul, Korea April 28 2016 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles
  2. 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  3. 3. High Performance Information Computing Center Jongwook Woo CSULA Myself Name: 우종욱, Jongwook Woo Experience:  Since 2002, Professor at California State Univ Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  4. 4. High Performance Information Computing Center Jongwook Woo CSULA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Summer 2013 Igloo Security: – Collect, Search, and Analyze Security Log files 30GB – 100GB / day • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  5. 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  6. 6. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  7. 7. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  8. 8. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  9. 9. High Performance Information Computing Center Jongwook Woo CSULA What is Hadoop? 9  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  10. 10. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data Non-expensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers
  11. 11. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  12. 12. High Performance Information Computing Center Jongwook Woo CSULA 새로운 툴의 등장
  13. 13. High Performance Information Computing Center Jongwook Woo CSULA 새로운 툴의 등장 나가시노 전투
  14. 14. High Performance Information Computing Center Jongwook Woo CSULA 나가시노 전투
  15. 15. High Performance Information Computing Center Jongwook Woo CSULA 나가시노 전투 3단 발사
  16. 16. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data 다시한번 빅데이터 데이터를 가지고 미래 가치를 예측하는것 – No! • 빅데이터의 한 응용사례, 우리가 늘 해오던 일일뿐 – 기존의 컴퓨터, DW, DB등으로 빅데이터는 하둡이라는 수퍼컴퓨터를 이용하려는 새로운 접근법
  17. 17. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  18. 18. High Performance Information Computing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce
  19. 19. High Performance Information Computing Center Jongwook Woo CSULA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  20. 20. High Performance Information Computing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  21. 21. High Performance Information Computing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Communicate with Spark workers Workers Spark Executor Run on cluster nodes –Production Run in local threads –Development and Test
  22. 22. High Performance Information Computing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory Immutable –RDD, DStream, SchemaRDD, PairRDD Lineage –History of the objects –Automatically and efficiently re-compute lost data
  23. 23. High Performance Information Computing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  24. 24. High Performance Information Computing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  25. 25. High Performance Information Computing Center Jongwook Woo CSULA Spark Spark SQL Turning an RDD into a Relation Querying using SQL Spark Streaming DStream – RDD in streaming – Windows • To select DStream from streaming data MLib Sparse vector support, Decision trees, Linear/Logistic Regression, PCA SVD and PCA
  26. 26. High Performance Information Computing Center Jongwook Woo CSULA Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects Optimizer Optimizer: build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  27. 27. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  28. 28. High Performance Information Computing Center Jongwook Woo CSULA Spark Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS – No Hadoop ecosystems
  29. 29. High Performance Information Computing Center Jongwook Woo CSULA Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3, Couchbase, Cassandra, … RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  30. 30. High Performance Information Computing Center Jongwook Woo CSULA Spark with Hadoop YARN Spark Client Slave Nodes  ResourceManager (RM) Per Cluster  Create Spark AM and  allocate Containers for Spark AM  NodeManager (NM) Per Node  Spark workers  ApplicationMaster (AM) Per Application  Containers for Spark Executors Master Node Node Manager Node Manager Node Manager Container: Spark Executor Spark AM Resource Manager
  31. 31. High Performance Information Computing Center Jongwook Woo CSULA Databricks cluster at CalStateLA
  32. 32. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Use Cases  Hadoop Spark Training
  33. 33. High Performance Information Computing Center Jongwook Woo CSULA Open Data USA government Federal, State, City governments Expose data to public USA Business Twitter, Yelp, … Expose data to public with APIs – Some restriction to download City government New York – Taxi, Uber, … Los Angeles – Open Data, Open Hub with Geo info
  34. 34. High Performance Information Computing Center Jongwook Woo CSULA Data from Industry: Twitter Data  Systems Azure HDInsights Spark 8 Nodes – 40 cores: 2.4GHz Intel Xeon – Memory - Each Node: 28 GB  Data Source Keyword ‘alphago’ from Tweeter via Apache NiFi  Data Size  63,193 tweets  Real Time Data Collection period 03/12 – 03/17/2016 – No data collected on 03/13
  35. 35. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries that Tweets “Alphago”
  36. 36. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries  # of Tweets per Country USA: > 11,000 Japan: > 9,000 Korea: > 1,900 Russia, UK: > 1,600 Thai Land, France : > 1,000  Netherland, Spain, Ukraine: > 600
  37. 37. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries Sentiment Positive Negative
  38. 38. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries Most Tweeted Countries  All countries show more positive tweets –Korea, Japan, USA Country Positive Negative USA 5070 3567 Japan 8118 217 … Korea 1053 407 …
  39. 39. High Performance Information Computing Center Jongwook Woo CSULA Daily Tweets in 03/12 – 03/17/2016 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016 Alphago vs Lee Sedol Game 4: Mar 13 Lee Se-Dol win Game 5: Mar 15 Game 3: Mar 12
  40. 40. High Performance Information Computing Center Jongwook Woo CSULA Ngram words  3 word in row right after Go-Champion “sedol” and “se-dol” sedol  se-dol 3-grams Frequency Again-to-win 1,187 Is-something-I’ll 369 Is-something-i 199 In-go-tournament 168
  41. 41. High Performance Information Computing Center Jongwook Woo CSULA Sentiment Map of Alphago Positive Negative
  42. 42. High Performance Information Computing Center Jongwook Woo CSULA Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb ToiB8wQ2w14a
  43. 43. High Performance Information Computing Center Jongwook Woo CSULA Federal Government: Airline Data Set Government Open Data Airline Data Set in 2012 – 2014 – US Dept of transportation Cluster by Nillohit at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter
  44. 44. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  45. 45. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  46. 46. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  47. 47. High Performance Information Computing Center Jongwook Woo CSULA City Government: Crime Data Set Open Data in City of Los Angeles Crime Data Set in 2014 Ram Dharan and Sridhar Reddy at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 14 GB – Windows Server 2012 R2 Datacenter – Extending to last 10 years of data set
  48. 48. High Performance Information Computing Center Jongwook Woo CSULA Crime Data Los Angeles 2014 2% 8% 9% 12% 17% 19% 33% Total occurences of each Crime CRIMINAL VANDALISM OTHERS BURGALARY ASSAULT TRAFFIC THEFT
  49. 49. High Performance Information Computing Center Jongwook Woo CSULA Total No.of Crimes in 2014 19169 17384 19730 19413 20645 20494 21480 21280 21287 21669 19844 21355 0 5000 10000 15000 20000 25000 1 2 3 4 5 6 7 8 9 10 11 12 No.of Crimes per Month
  50. 50. High Performance Information Computing Center Jongwook Woo CSULA Raw Data Projection on Map
  51. 51. High Performance Information Computing Center Jongwook Woo CSULA Mapping of Crimes Occurred within 5miles from CSULA
  52. 52. High Performance Information Computing Center Jongwook Woo CSULA Mapping of Crimes Occurred within 5miles from UCLA
  53. 53. High Performance Information Computing Center Jongwook Woo CSULA Mapping of Crimes Occurred within 5miles from USC
  54. 54. High Performance Information Computing Center Jongwook Woo CSULA No. of crimes within 5 miles from CSULA, UCLA and USC on crime type 0 5000 10000 15000 20000 25000 30000 csula ucla usc
  55. 55. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.
  56. 56. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The station producing hydrogen for Hydrogen Vehicle Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to the public. Hyundai, Toyota
  57. 57. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Workflow
  58. 58. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Model by Manvi Chandra
  59. 59. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations
  60. 60. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations  Can predict Vehicle Pressure – Pressure of hydrogen gas within the vehicle Hydrogen Storage System – using our model in Azure Visual Studio ML – Building Spark ML Decision forest Regression – constructing a multitude of decision trees at training time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.
  61. 61. High Performance Information Computing Center Jongwook Woo CSULA Collaboration with City of Los Angeles Wellness and Safety Analysis How to improve wellness and the safety of the city –Expand Information Sharing and Performan ce Metrics –Promoting and improving City employee health and wellness. –Develop and carry out the City’s safety traini ng and injury prevention strategy.
  62. 62. High Performance Information Computing Center Jongwook Woo CSULA Collaboration with City of Los Angeles (Cont’d) Procurement Analysis How to improve procurement of the city –Pricing trends –Supplier diversity –Cost Optimization –Invoicing/Billing/Payment Trends –Material Optimization –Resource/process efficiencies
  63. 63. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  64. 64. High Performance Information Computing Center Jongwook Woo CSULA 광해군과 청
  65. 65. High Performance Information Computing Center Jongwook Woo CSULA 사르후 전투 <만주실록의 사르후 전투 그림. 후금 vs 명군의 전투 장면
  66. 66. High Performance Information Computing Center Jongwook Woo CSULA 강홍립과 부차 (후챠) 전투 <만주실록>: 조명연합군의 명 유정군 선봉을 공격하는 만주족 기병
  67. 67. High Performance Information Computing Center Jongwook Woo CSULA 조선군 편성 조선측 사료 <충렬록 1770-1790> 정사4간본의 조선군 그림. 활을 든 사수와 조총을 든 포수
  68. 68. High Performance Information Computing Center Jongwook Woo CSULA 강홍립과 부차 (후챠) 전투
  69. 69. High Performance Information Computing Center Jongwook Woo CSULA 새로운 기술 개발 및 교육  하둡, 스파크 R&D및 가치 창출을 위한 새로운 수퍼컴퓨터
  70. 70. High Performance Information Computing Center Jongwook Woo CSULA 하둡 스파크 교육이 왜 필요한가 새로운 가치 창조, R&D시 필요 미국을 필두로 공학, 과학, 기업등에서 하둡 스파크 빅데이터 교육의 중요성 인지 –데이타 마이닝 및 분석 분야 뿐아니라 대용량 데이터가 있는 모든 분야 중소기업도 Hadoop Cluster 소유가능 –저렴한 수퍼 컴퓨터 그러나, 아무도 하둡 스파크를 가르쳐 주지 않는다 누구에게 교육 받을 것인가?
  71. 71. High Performance Information Computing Center Jongwook Woo CSULA 하둡 교육 어떻게 시작할 것인가? 기술자들의 Self-study 한계 시간상의 한계: more than a year to be an expert Don’t know the detail Miss many important topics 2014년 우리는 전문가, 국제경쟁 시대에 살고 있음 – 80년대 대학 강의실이 아님 교육비 절약? 기업 생산성 감소 Think USA! – Training, Training, Training…..
  72. 72. High Performance Information Computing Center Jongwook Woo CSULA 하둡 교육 어떻게 시작할 것인가? (계속) IT분야의 각자교육의 한계 인식 필요 실리콘 밸리등 산업계에서 IT기술을 선도함 교육비 절약으로 빅데이터 산업에 뒤쳐짐 산업계 Training program 대한민국 특수성 – 정부 주도의 • 정부, 전문가, 기업을 통한 교육 제공 절실 • 공공기관, 학교는 정부주도로 • 산업체는 자생적으로 Theory Guy양성이 아닌 실무자 양성을 위한 실습용 장비/코드 예제 필요 저가 교육이 아닌 고가 양질 교육 육성 장려
  73. 73. High Performance Information Computing Center Jongwook Woo CSULA Spark Training California State University Los Angeles (Prof Jongwook Woo) Supported by Databricks and its cloud computing services UC Berkeley Edx (MOOC) UC Berkeley amplab camp Stanford Cloudera, Hortonworks, DataStax Training courses IBM Big University
  74. 74. High Performance Information Computing Center Jongwook Woo CSULA Databricks Partners
  75. 75. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  76. 76. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  77. 77. High Performance Information Computing Center Jongwook Woo CSULA Big Data Forum at UKC 2016 Forum Chair: jwoo5@calstatela.edu Amazon AWS, Hortonworks, Couchbase, Qlik…
  78. 78. High Performance Information Computing Center Jongwook Woo CSULA Question?
  79. 79. High Performance Information Computing Center Jongwook Woo CSULA References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )  “Market Basket Analysis using Spark”, Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
  80. 80. High Performance Information Computing Center Jongwook Woo CSULA  Introduction to Big Data with Apache Spark, databricks  Stanford Spark Class (http://stanford.edu/~rezab )  Cornell University, CS5304  DS320: DataStax Enterprise Analytics with Spark  Cloudera, http://www.cloudera.com  Hortonworks, http://www.hortonworks.com  Spark 3 Use Cases, http://www.datanami.com/2014/03/06/apache_spark_ 3_real-world_use_cases/ References
  81. 81. High Performance Information Computing Center Jongwook Woo CSULA Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Wide” (shuffle) deps: boundary of stages “Narrow” deps: A stage pipeline to be run on the same node
  82. 82. High Performance Information Computing Center Jongwook Woo CSULA Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: A stage pipeline to be run on the same node “Wide” (shuffle) deps: boundary of stages
  83. 83. High Performance Information Computing Center Jongwook Woo CSULA Scheduler Optimizations Pipelines within a stage 2 map, union Stage 3: join algorithms based on partitioning (minimize shuffles) join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  84. 84. High Performance Information Computing Center Jongwook Woo CSULA Scheduler Optimizations Conceptually Stage 1: 3 tasks Stage 2: 4 tasks Stage 3: 3 tasks Total: 3 stages, 10 tasks join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task

×