Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big Data Trend with Open Platform

573 visualizaciones

Publicado el

Introduction to Big Data, Hadoop, Spark, Open Platform, Use Case, Future Trend

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

Big Data Trend with Open Platform

  1. 1. Jongwook Woo HiPIC CalState LA SWRC 2017 San Diego, CA Feb 25 2017 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Big Data Trend with Open Platform
  2. 2. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  3. 3. High Performance Information Computing Center Jongwook Woo CalState LA Myself Experience:  Since 2002, Professor at California State Univ Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  4. 4. High Performance Information Computing Center Jongwook Woo CalState LA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city in 2016 – Collect, Search, and Analyze City Data • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  5. 5. High Performance Information Computing Center Jongwook Woo CalState LA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  6. 6. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  7. 7. High Performance Information Computing Center Jongwook Woo CalState LA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  8. 8. High Performance Information Computing Center Jongwook Woo CalState LA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  9. 9. High Performance Information Computing Center Jongwook Woo CalState LA What is Hadoop? 9  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  10. 10. High Performance Information Computing Center Jongwook Woo CalState LA Super Computer vs Hadoop Parallel vs. Distributed file systems by Michael Malak Cluster for Compute Cluster for Store Cluster for Compute/Store
  11. 11. High Performance Information Computing Center Jongwook Woo CalState LA Definition: Big Data Non-expensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers
  12. 12. High Performance Information Computing Center Jongwook Woo CalState LA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  13. 13. High Performance Information Computing Center Jongwook Woo CalState LA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/
  14. 14. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  15. 15. High Performance Information Computing Center Jongwook Woo CalState LA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Only Map and Reduce – Limited Parallelization Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce
  16. 16. High Performance Information Computing Center Jongwook Woo CalState LA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  17. 17. High Performance Information Computing Center Jongwook Woo CalState LA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL ML / MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  18. 18. High Performance Information Computing Center Jongwook Woo CalState LA Spark Drivers and Workers Drivers Client –with SparkContext • Communicate with Spark workers Workers Spark Executor Run on cluster nodes –Production Run in local threads –Development and Test
  19. 19. High Performance Information Computing Center Jongwook Woo CalState LA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory Immutable –RDD, DStream, SchemaRDD, PairRDD Lineage –History of the objects –Automatically and efficiently re-compute lost data
  20. 20. High Performance Information Computing Center Jongwook Woo CalState LA RDD and Data Frame Operations Transformation Define new RDDs and Data Frame from the current –Lazy: not computed immediately map(), filter(), join(), select(), groupBy() Actions Return values count(), collect(), take(), save()
  21. 21. High Performance Information Computing Center Jongwook Woo CalState LA Programming in Spark Scala Functional Programming – Fundamental of programming is function • Input/Output is function No side effects – No states Python Legacy, large Libraries Java R
  22. 22. High Performance Information Computing Center Jongwook Woo CalState LA
  23. 23. High Performance Information Computing Center Jongwook Woo CalState LA Spark  Spark SQL  Querying using SQL, HiveQL  Data Frame  ML  Machine Learning on Data Frame, Pipelining  MLib – On RDD – Sparse vector support, Decision trees, Linear/Logistic Regression, PCA, SVM  Spark Streaming  DStream – RDD in streaming – Windows • To select DStream from streaming data
  24. 24. High Performance Information Computing Center Jongwook Woo CalState LA Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects Optimizer Optimizer: build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  25. 25. High Performance Information Computing Center Jongwook Woo CalState LA During Scheduling Process https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797
  26. 26. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  27. 27. High Performance Information Computing Center Jongwook Woo CalState LA Spark Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS – No Hadoop ecosystems
  28. 28. High Performance Information Computing Center Jongwook Woo CalState LA Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3, Couchbase, Cassandra, … RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  29. 29. High Performance Information Computing Center Jongwook Woo CalState LA Spark with Hadoop YARN Spark Client Slave Nodes  ResourceManager (RM) Per Cluster  Create Spark AM and  allocate Containers for Spark AM  NodeManager (NM) Per Node  Spark workers  ApplicationMaster (AM) Per Application  Containers for Spark Executors Master Node Node Manager Node Manager Node Manager Container: Spark Executor Spark AM Resource Manager
  30. 30. High Performance Information Computing Center Jongwook Woo CalState LA Databricks cluster at CalStateLA
  31. 31. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  32. 32. High Performance Information Computing Center Jongwook Woo CalState LA Open Platform Open Source Open Conference Open Data Public Data
  33. 33. High Performance Information Computing Center Jongwook Woo CalState LA Open Source Hadoop http://hadoop.apache.org/ Spark http://spark.apache.org/  NoSQL http://hbase.apache.org/ Search Engine http://lucene.apache.org/solr/
  34. 34. High Performance Information Computing Center Jongwook Woo CalState LA Open Conference Hadoop Summit Live Streaming –http://siliconangle.tv/hadoop-summit- 2016/ Spark Summit https://spark-summit.org/east-2017/ Live Streaming –http://go.spark-summit.org/east- 2017/live- stream?_ga=1.62160364.1150099959.1484 851457
  35. 35. High Performance Information Computing Center Jongwook Woo CalState LA Open Data USA government Federal, State, City governments Expose data to public USA Business Twitter, Yelp, … Expose data to public with APIs – Some restriction to download City government New York – Taxi, Uber, … Los Angeles – Open Data, Open Hub with Geo info
  36. 36. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  37. 37. High Performance Information Computing Center Jongwook Woo CalState LA Databricks Partners
  38. 38. High Performance Information Computing Center Jongwook Woo CalState LA Industrial Collaboration Cloudera visits to interview Jongwook Woo
  39. 39. High Performance Information Computing Center Jongwook Woo CalState LA Industrial Collaboration: IBM Bluemix at CalStateLA
  40. 40. High Performance Information Computing Center Jongwook Woo CalState LA Big Data Analysis and Prediction Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Tableua, Qlik, …) Data Visualization Qlik, Datameer, Excel PowerView
  41. 41. High Performance Information Computing Center Jongwook Woo CalState LA Databricks cluster at CalStateLA
  42. 42. Jongwook Woo HiPIC CalState LA LOCAL BUSINESS DATA ANALYSIS Yashaswi Ananth Ruchi Singh Mahsa Tayer Farahani
  43. 43. High Performance Information Computing Center Jongwook Woo CalState LA LOCAL BUSINESS DATA ANALYSIS Using Local Business Data From Yelp and Google Local Grad Students at CalStateLA Symposium, Feb 24 2017 Yashaswi Ananth Ruchi Singh Mahsa Tayer Farahani
  44. 44. High Performance Information Computing Center Jongwook Woo CalState LA REVIEW COUNT FOR BUSINESS TYPES • Food • Services • Entertainment • Shopping • Medical
  45. 45. High Performance Information Computing Center Jongwook Woo CalState LA TOP BUSINESS IN THE SIX CATEGORIES
  46. 46. High Performance Information Computing Center Jongwook Woo CalState LA Review count of popular sub-categories of business
  47. 47. High Performance Information Computing Center Jongwook Woo CalState LA Sentiment Analysis of Services category
  48. 48. High Performance Information Computing Center Jongwook Woo CalState LA Top business Top 5 most popular local business on Yelp between 2006-2016 in the selected cities
  49. 49. High Performance Information Computing Center Jongwook Woo CalState LA Businesses popular in 5 miles of CalStateLA, USC , UCLA
  50. 50. High Performance Information Computing Center Jongwook Woo CalState LA Historical Analysis Of College Scorecard CalStateLA Symposium Feb 24 2017 Kunal Pritwani Atinder Singh Dharmesh Soni Mounika Vallabhaneni
  51. 51. High Performance Information Computing Center Jongwook Woo CalState LA Data is collected from the site. : https://www.kaggle.com/kaggle/college-scorecard We have historical data of over 100,000 colleges in the US spanning over 14 years. Data Size – 1.33 GB File Format – CSV ( Comma Separated Values) Specification of Data Set
  52. 52. High Performance Information Computing Center Jongwook Woo CalState LA Mean Income Medical college of Wisconsin: 250K Upstate Medical University: 152.7K CalTech: 103K Washington and Lee University: 100K
  53. 53. High Performance Information Computing Center Jongwook Woo CalState LA Comparing Average Net Price of Two States (Annual Tuition) UCLA: $13,817 CalStateLA: $4,370 Fashion Inst of Tech: $11.5K CUNY: $5K
  54. 54. High Performance Information Computing Center Jongwook Woo CalState LA SAT Scores in Different Colleges Math (Blue), Verbal (Orange), Mean Earning (Purple) • CalTech: 800, 778.9, $98.7K • MIT: 800, 764.4, $124.4K • Harvard: 791, 795.6, $133K • Princeton: 793, 791, $115.6K • Yale: 788, 794.4, $97.8K
  55. 55. High Performance Information Computing Center Jongwook Woo CalState LA Comparing Average Undergraduates Receiving PELL GRANT Universal Career Community College: 100% PELL grant scholarship
  56. 56. High Performance Information Computing Center Jongwook Woo CalState LA Average Undergraduates Receiving PELL GRANT in Each College East Georgia State College: $2,854 Avg. PELL grant: 97.285%
  57. 57. High Performance Information Computing Center Jongwook Woo CalState LA Alphago vs Lee using Twitter Data  Systems Azure HDInsights Spark 8 Nodes – 40 cores: 2.4GHz Intel Xeon – Memory - Each Node: 28 GB  Data Source Keyword ‘alphago’ from Tweeter via Apache NiFi  Data Size  63,193 tweets  Real Time Data Collection period 03/12 – 03/17/2016 – No data collected on 03/13
  58. 58. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries that Tweets “Alphago”
  59. 59. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries  # of Tweets per Country USA: > 11,000 Japan: > 9,000 Korea: > 1,900 Russia, UK: > 1,600 Thai Land, France : > 1,000  Netherland, Spain, Ukraine: > 600
  60. 60. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries Sentiment Positive Negative
  61. 61. High Performance Information Computing Center Jongwook Woo CalState LA Top 10 Countries Most Tweeted Countries  All countries show more positive tweets –Korea, Japan, USA Country Positive Negative USA 5070 3567 Japan 8118 217 … Korea 1053 407 …
  62. 62. High Performance Information Computing Center Jongwook Woo CalState LA Daily Tweets in 03/12 – 03/17/2016 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016 Alphago vs Lee Sedol Game 4: Mar 13 Lee Se-Dol win Game 5: Mar 15 Game 3: Mar 12
  63. 63. High Performance Information Computing Center Jongwook Woo CalState LA Ngram words  3 word in row right after Go-Champion “sedol” and “se-dol” sedol  se-dol 3-grams Frequency Again-to-win 1,187 Is-something-I’ll 369 Is-something-i 199 In-go-tournament 168
  64. 64. High Performance Information Computing Center Jongwook Woo CalState LA Sentiment Map of Alphago Positive Negative
  65. 65. High Performance Information Computing Center Jongwook Woo CalState LA Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb ToiB8wQ2w14a
  66. 66. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set Government Open Data Airline Data Set in 2012 – 2014 – US Dept of transportation Cluster by Nillohit at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter
  67. 67. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set
  68. 68. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set
  69. 69. High Performance Information Computing Center Jongwook Woo CalState LA Airline Data Set
  70. 70. High Performance Information Computing Center Jongwook Woo CalState LA City Government: Crime Data Set Open Data in City of Los Angeles Crime Data Set in 2014 Ram Dharan and Sridhar Reddy at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 14 GB – Windows Server 2012 R2 Datacenter – Extending to last 10 years of data set
  71. 71. High Performance Information Computing Center Jongwook Woo CalState LA Crime Data Los Angeles 2014 2% 8% 9% 12% 17% 19% 33% Total occurences of each Crime CRIMINAL VANDALISM OTHERS BURGALARY ASSAULT TRAFFIC THEFT
  72. 72. High Performance Information Computing Center Jongwook Woo CalState LA Total No.of Crimes in 2014 19169 17384 19730 19413 20645 20494 21480 21280 21287 21669 19844 21355 0 5000 10000 15000 20000 25000 1 2 3 4 5 6 7 8 9 10 11 12 No.of Crimes per Month
  73. 73. High Performance Information Computing Center Jongwook Woo CalState LA Raw Data Projection on Map
  74. 74. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from CalStateLA
  75. 75. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from UCLA
  76. 76. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from USC
  77. 77. High Performance Information Computing Center Jongwook Woo CalState LA Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
  78. 78. High Performance Information Computing Center Jongwook Woo CalState LA No. of crimes within 5 miles from CSULA, UCLA and USC on crime type 0 5000 10000 15000 20000 25000 30000 csula ucla usc
  79. 79. High Performance Information Computing Center Jongwook Woo CalState LA Contents  Myself  Big Data  Spark  Spark and Hadoop  Open Platform  Use Cases  Future Trend
  80. 80. High Performance Information Computing Center Jongwook Woo CalState LA Future Research Trend Deep Learning TensorFlow and Spark – Yahoo, Intel, Google – Image Recognition, Prediction Analysis ChatBot Amazon Alexa API IBM Watson ChatBot API Google Home API More into In-Memory Processing – Spark DataFrame, Data Set, ML Cloud Computing – IBM Bluemix, MS Azure, Google Cloud, Amazon AWS
  81. 81. High Performance Information Computing Center Jongwook Woo CalState LA Question?

×