Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Introduction to Spark: Data Analysis and Use Cases in Big Data

703 visualizaciones

Publicado el

Introduce Big Data Hadoop and fundamentals of Hadoop, Experimental Result of Data Analysis using Airline and Traffic Violation data sets

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

Introduction to Spark: Data Analysis and Use Cases in Big Data

  1. 1. Jongwook Woo HiPIC CSULA Introduction to Spark: Data Analysis and Use Cases in Big Data Fall 2015 Engineering Colloquium Series University of Bridgeport, CT October 29 2015 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS California State University Los Angeles
  2. 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Spark SQL, Streaming, ML  Hive and its Architecture on Azure  Experimental Results  Conclusions
  3. 3. High Performance Information Computing Center Jongwook Woo CSULA Myself Name: 우종욱, Jongwook Woo Experience:  2012 - Present – Certified Cloudera Instructor: R&D, Consulting, Training  2012 - Present : Big Data Academic Parterships – Databricks, Cloudera, Hortonworks Partner for Hadoop Training – Amazon AWS, MicroSoft Azure, IBM Bluemix  Since 2002, Professor at California State Univ Los Angeles  Since 1998: R&D consulting in Hollywood – implements eBusiness applications using J2EE and middleware – Information Search and Integration with FAST, Lucene/Solr, Sphinx – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  Since 2007: Exposed to Big Data  PhD in 2001: Computer Science and Engineering at USC
  4. 4. High Performance Information Computing Center Jongwook Woo CSULA Myself Experience (Cont’d): Bring in Big Data training and R&D to Korea since 2009 2014: Training Hadoop and the Ecosystems Summer 2013 Igloo Security: – Collect, Search, and Analyze Security Log files 30GB – 100GB / day • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education in Univ and Research Centers
  5. 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Education Research Grant  Partnership  Received Academic Education Partnership with Databricks, Cloudera, Hortonworks and IBM  Certificate  Certified Cloudera Hadoop Instructor  Certified Cloudera Hadoop Developer / Administrator / Hbase / Spark
  6. 6. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Spark SQL, Streaming, ML  Hive and its Architecture on Azure  Experimental Results  Conclusions
  7. 7. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  8. 8. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  9. 9. High Performance Information Computing Center Jongwook Woo CSULA What is Hadoop? 9 Hadoop Founder: Doug Cutting Chief Architect at Cloudera
  10. 10. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data Inexpensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –You can build and run your applications
  11. 11. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Ecosystems Runs on Hadoop Cluster Data Analysis –Hive, Pig –Impala, Tez Data Streaming –Storm, Flume, Kafka Data Migration with RDB –Sqoop In-Memory Processing –Spark
  12. 12. High Performance Information Computing Center Jongwook Woo CSULA Hive and Pig Data Analysis of Hadoop Ecosystems Developed at Facebook and Yahoo Turns Hadoop into a data warehouse Convert to MapReduce jobs – Batch Processing – Non-Interactive • in the beginning even for any simple SQL/Script statement • Impala on MPP for interactive querying HiveQL SQL syntax Pig Latin Script language
  13. 13. High Performance Information Computing Center Jongwook Woo CSULA Hadoop CDH: Logical Diagram Web Browser to control Cloudera Manager Server HTTP(S) Agent CDH Agent CDH Agent CDH Agent CDH Agent CDH Agent CDH CM . . . . . . . . . Agent CDH Agent CDH Agent CDH HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  14. 14. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Spark SQL, Streaming, ML  Architecture on Azure  Experimental Results  Conclusions
  15. 15. High Performance Information Computing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab In-memory storage for intermediate data 10 ~ 100x faster than N/W and Disk
  16. 16. High Performance Information Computing Center Jongwook Woo CSULA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS HBase, Hive, Sequence files New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  17. 17. High Performance Information Computing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  18. 18. High Performance Information Computing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Create RDDs Workers Spark Executor Run on cluster nodes –Production Run in local threads –development
  19. 19. High Performance Information Computing Center Jongwook Woo CSULA Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3… RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  20. 20. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Spark SQL, Streaming, ML  Architecture on Azure  Experimental Results  Conclusions
  21. 21. High Performance Information Computing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory RDD, DStream, SchemaRDD, PairRDD Immutable Lineage –History of the objects –Automatically and efficiently recompute lost data
  22. 22. High Performance Information Computing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  23. 23. High Performance Information Computing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  24. 24. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Spark SQL, Streaming, ML  Architecture on Azure  Experimental Results  Conclusions
  25. 25. High Performance Information Computing Center Jongwook Woo CSULA Spark SparkSQL Turning an RDD into a Relation Querying using SQL Spark Streaming DStream – RDD in streaming – Windows • To select DStream from streaming data MLib Sparse vector support, Decision trees, Linear/Logistic Regression, PCA SVD and PCA
  26. 26. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Spark SQL, Streaming, ML  Architecture on Azure  Experimental Results  Conclusions
  27. 27. High Performance Information Computing Center Jongwook Woo CSULA Spark on Cloud Computing Databricks Latest Spark – Latest Spark MLlib Amazon AWS Spark example at https://spark.apache.org/ IBM Bluemix Support Spark and AMP Labs – Launch Spark on BlueMix on July 2015 Microsoft Azure
  28. 28. High Performance Information Computing Center Jongwook Woo CSULA Microsoft Azure HDInsight Deploys Hadoop clusters in the cloud Hive, Spark HDInsight uses Hortonworks Data Platform (HDP) Hadoop Distribution HDInsight cluster configuration  Number of data nodes: 4  CPU: 4 Cores  Memory: 7 GB  Operating System: Windows Server 2012 R2 Datacenter Hadoop clusters can be launched using  Linux Operating System  Windows Server Operating System
  29. 29. High Performance Information Computing Center Jongwook Woo CSULA System Architecture Spark Worker
  30. 30. High Performance Information Computing Center Jongwook Woo CSULA (1) Airline Data Set Analysis Data has been taken from the US Department of Transportation Consist of the arrival and departure records of domestic airlines Time period January 2005 – December 2014 (10 Years) Total number of files: 120 File Format: csv (comma separated values) Total file size: 13.1 GB Total Number of records: 66 million
  31. 31. High Performance Information Computing Center Jongwook Woo CSULA Apache Hive and Spark Hive SQL like language Developed at Facebook HQL (Hive Query Language) is different than SQL – Runs map reduce jobs under the hood. – Batch Process – Queries have a high latency – Read based – Not appropriate for transaction processing Spark Spark SQL – Uses Hive metastore and HiveQL
  32. 32. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Spark Hive Context 3:30 ~ 16 minutes for queries in Azure Total number of flights cancelled each month – Time taken: 210.862 seconds, Fetched: 120 row(s) Total number of flights diverted each month – Time taken: 216.704 seconds, Fetched: 120 row(s) Total number of flights cancelled every year – Time taken: 302.465 seconds, Fetched: 10 row(s) Total number of flights diverted every year – Time taken: 461.433 seconds, Fetched: 10 row(s) Effect of flight distance on flight diversions – Time taken: 675.725 seconds, Fetched: 1500 row(s)
  33. 33. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results Cont’d Effect of flight distance on flight cancellations – Time taken: 576.925 seconds, Fetched: 1500 row(s) Effect of flight distance on average departure delay – Time taken: 992.911 seconds, Fetched: 1500 row(s) Monthly average departure delay – Time taken: 973.695 seconds, Fetched: 13 row(s) Yearly average departure delay – Time taken: 623.694 seconds, Fetched: 11 row(s)
  34. 34. High Performance Information Computing Center Jongwook Woo CSULA Cancelled and Diverted flights by month 0 5000 10000 15000 20000 25000 30000 35000 Numberofcancelled/divertedflights Cancelled/Diverted Vs Time Cancelled Diverted
  35. 35. High Performance Information Computing Center Jongwook Woo CSULA Cancelled and Diverted flights by year 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Numberofcancelledflights Number of cancelled/diverted flights Vs Year Cancelled Diverted
  36. 36. High Performance Information Computing Center Jongwook Woo CSULA Diverted Flights Vs Distance 0 100 200 300 400 500 600 700 800 900 1000 0 1000 2000 3000 4000 5000 6000 NumberofDivertedflights(count) Flight Distance (in miles) Number of diverted flights Vs Distance Diverted (Count)
  37. 37. High Performance Information Computing Center Jongwook Woo CSULA Cancelled Flights Vs Distance 0 2000 4000 6000 8000 10000 12000 14000 0 1000 2000 3000 4000 5000 6000 Numberofcancelledflights(count) Flight Distance (in miles) Number of cancelled flights Vs Distance Cancellation (Count)
  38. 38. High Performance Information Computing Center Jongwook Woo CSULA Average Departure Delay vs Flight Distance 0 50 100 150 200 250 0 1000 2000 3000 4000 5000 6000 AverageDepartureDelay(inminutes) Flight Distance (in miles) Average Departure Delay Vs Flight Distance Avg Dep Delay
  39. 39. High Performance Information Computing Center Jongwook Woo CSULA Average Departure Delay by month 0 2 4 6 8 10 12 14 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec AverageDepartureDelay(inminutes) Average Depature Delay Vs Month Avg Dep Delay
  40. 40. High Performance Information Computing Center Jongwook Woo CSULA Average Departure Delay by year 0 2 4 6 8 10 12 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 AverageDepartureDelay(inminutes) Average Departure Delay Vs Year Avg Dep Delay
  41. 41. High Performance Information Computing Center Jongwook Woo CSULA (2) Traffic Violation Data Set Analysis Data has been taken from data.gov home of the U.S. Government’s open data Traffic Violation data in the Montgomery County of Maryland Time period in Jan 2012 – Aug 2015 Total file size: 300 MB
  42. 42. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results
  43. 43. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results
  44. 44. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results
  45. 45. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results
  46. 46. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results  Traffic Violation and Deaths by Year
  47. 47. High Performance Information Computing Center Jongwook Woo CSULA Experimental Results
  48. 48. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Spark Cores  RDD  Spark SQL, Streaming, ML  Architecture on Azure  Experimental Results  Conclusions
  49. 49. High Performance Information Computing Center Jongwook Woo CSULA Conclusion Big Data is Hadoop Data Analysis Ecosystems Hive, Pig Spark is the way to go for Big Data In HDFS with Hadoop Ecosystems such as Hive, Pig Spark SQL, Spark MLlib Any large scale data set Scientific, Marketing, Finance, Economics
  50. 50. High Performance Information Computing Center Jongwook Woo CSULA Question?
  51. 51. High Performance Information Computing Center Jongwook Woo CSULA References  Airline Data Set, United States Department of Transportation, http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236  What is Hive?, http://www-01.ibm.com/software/data/infosphere/hadoop/hive/  Introduction to Windows Azure Blob Storage, https://www.simple- talk.com/cloud/cloud-data/an-introduction-to-windows-azure-blob- storage-/  Introduction to Hadoop in HDInsight: Big-data analysis and processing in the cloud, https://azure.microsoft.com/en- us/documentation/articles/hdinsight-hadoop-introduction/  Explorer for Microsoft Azure Storage: Freeware Client, http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx  Upload data for Hadoop jobs in HDInsight, https://azure.microsoft.com/en- us/documentation/articles/hdinsight-upload-data/  “Market Basket Analysis Algorithms with MapReduce”, Jongwook Woo, DMKD-00150, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795

×