Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

President Election of Korea in 2017

510 visualizaciones

Publicado el

Data Analysis and Visualization of Twitter data for Korean President Election in 2017 using ElasticSearch and Hadoop Spark

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

  • Sé el primero en recomendar esto

President Election of Korea in 2017

  1. 1. Jongwook Woo HiPIC CalStateLA Seoul Elasticsearch Community Meetup Gangnam, Korea Aug 10 2017 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Data Collection and Visualization using Big Data: President Election 2017 in Korea
  2. 2. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Architecture  Demo
  3. 3. High Performance Information Computing Center Jongwook Woo CalStateLA Myself Experience:  Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since Jan 2016 : Co-Founder of The Big Link LLC and Wiken  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  4. 4. High Performance Information Computing Center Jongwook Woo CalStateLA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city since 2016 – Collect, Search, and Analyze City Data • Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon, DongEui • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  5. 5. High Performance Information Computing Center Jongwook Woo CalStateLA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea  Grants and Awards  Faculty Scholarship Winner of Teradata University Network 2017  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  6. 6. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data Architecture  Demo
  7. 7. High Performance Information Computing Center Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  8. 8. High Performance Information Computing Center Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop and Spark – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – Cloud Computing Big Data services • Amazon AWS, IBM Bluemix, Microsoft Azure – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch
  9. 9. High Performance Information Computing Center Jongwook Woo CalStateLA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good – Iterative graph algorithms, Machine Learning Interactive query
  10. 10. High Performance Information Computing Center Jongwook Woo CalStateLA ElasticSearch Full Text Search and Visualization Server Getting more popular than Solr ElasticSearch, Kibana, ES-Hadoop, Logstash,… Based on Apache Lucene library Horizontally Scalable
  11. 11. High Performance Information Computing Center Jongwook Woo CalStateLA Elastic Stack 100% open source No enterprise edition All new versions with 5.0 ElasticSearch
  12. 12. High Performance Information Computing Center Jongwook Woo CalStateLA 12 ES-Hadoop Elasticsearch for Hadoop • Exchange data between Hadoop HDFS and ElasticSearch ElasticSearch
  13. 13. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Architecture  Demo
  14. 14. High Performance Information Computing Center Jongwook Woo CalStateLA Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, Tableau,…) Data Visualization Qlik, Datameer, Excel PowerView
  15. 15. High Performance Information Computing Center Jongwook Woo CalStateLA Data Engineering Data Source Twitter streaming API – using the keywords • "문재인","moonriver365", "안철수", "cheolsoo0919", "유승민", "yooseongmin2017", "홍준표", "HongSkyangel808", "심상정", "sangjungsim“ – Roughly: April 28 2017 – May 11 2017 Data Collection Apache Nifi for streaming data – supports powerful and scalable directed graphs • data routing, transformation, and system mediation logic Data Storage ElasticSearch Hadoop HDFS at Azure
  16. 16. High Performance Information Computing Center Jongwook Woo CalStateLA Data Engineering (Cont’d) Data Analysis and Prediction: In the future Spark ML, Spark SQL, Hadoop Hive Data Visualization Kibana in ElasticSearch
  17. 17. High Performance Information Computing Center Jongwook Woo CalStateLA Apache NiFi • NiFi-1.1.2: getTwitter, putElasticSearch5, putHDFS
  18. 18. High Performance Information Computing Center Jongwook Woo CalStateLA Hadoop Spark Cluster: HDInsight in Azure vCores Memory Local SSD (GB) (GB) 4 28 200
  19. 19. High Performance Information Computing Center Jongwook Woo CalStateLA ElasticSearch in HDInsights Did not launch ElasticSearch Service in Azure Instead, install ES5 in Linux Head Node of HDInsights cluster –ElasticSearch • 5.3.1 –Kibana • 5.3.2
  20. 20. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping to ES Temp-Spatial Analysis  For matching the Twitter date format to ES curl -XPUT localhost:9200/_template/elect17 -d ' { "template" : "elect17*", "settings" : { "number_of_shards" : 1 }, "mappings" : { "default" : { "properties" : { "created_at" : { "type" : "date", "format" : "EEE MMM dd HH:mm:ss Z YYYY" },
  21. 21. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping to ES (Cont’d) "coordinates" : { "properties" : { "coordinates" : { "type" : "geo_point" }, "type" : { "type" : "string" } } }, "user" : { "properties" : { "screen_name" : { "type" : "string", "index" : "not_analyzed" },
  22. 22. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping to ES (Cont’d) "lang" : { "type" : "string", "index" : "not_analyzed" } } } } } } }'
  23. 23. High Performance Information Computing Center Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  24. 24. High Performance Information Computing Center Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  25. 25. High Performance Information Computing Center Jongwook Woo CalStateLA ES-Hadoop  Install ES-Hadoop $ wget -P /tmp http://download.elastic.co/hadoop/elasticsearch- hadoop-5.3.1.zip $ unzip /tmp/elasticsearch-hadoop-5.3.1.zip -d /tmp $ cp /tmp/elasticsearch-hadoop-5.3.1/dist/elasticsearch-hadoop- 5.3.1.jar /tmp/elasticsearch-hadoop-5.3.1.jar $ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop- 5.3.1/dist/elasticsearch-hadoop-5.3.1.jar /tmp $ sudo cp elasticsearch-spark-20_2.11-5.3.1.jar /usr/hdp/current/spark2-client/
  26. 26. High Performance Information Computing Center Jongwook Woo CalStateLA ES-Hadoop (Cont’d)  Add ES-Hadoop libraries to Hive with one of the followings: $ hive hive> add jar hdfs:///tmp/elasticsearch-hadoop-5.3.1.jar hive> add jar /tmp/elasticsearch-hadoop-5.3.1.jar hive> add jar file:///tmp/elasticsearch-hadoop-5.3.1.jar hive > list jar ; file:///tmp/elasticsearch-hadoop-5.3.1.jar
  27. 27. High Performance Information Computing Center Jongwook Woo CalStateLA ES-Hadoop (Cont’d) hive> select * from elect17_test LIMIT 10; OK 856281525070909440 NULL NULL NULL NULL RT @sydbris: 이 정도는 우리 문재인 후보님이 절대 말씀하시지 않겠지. "넌 내가 유신 반대투쟁하고 민주화운동할 때 친구들이랑 고대 앞 하숙방에 모여서 xx 모의했냐?" Sun Apr 23 22:59:59 +0000 2017 856281524995407872 NULL NULL NULL NULL RT @choomiae: 존경하는 시흥시민 여러분! …
  28. 28. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Architecture  Demo
  29. 29. High Performance Information Computing Center Jongwook Woo CalStateLA Demo Azure Portal Ubuntu VM ElasticSearch NiFi Kibana: April 29 – May 10 Hive with ES-Hadoop Test with the data on April 23 – April 24
  30. 30. High Performance Information Computing Center Jongwook Woo CalStateLA Spark Big Data Training and R&D HiPIC California State University Los Angeles Supported by – Databricks and its cloud computing services – Amazon AWS, IBM Buemix, MS Azure – Hortonworks, Cloudera – Teradata – ElasticSearch – Qlik, Tableau
  31. 31. High Performance Information Computing Center Jongwook Woo CalStateLA Databricks Partners
  32. 32. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  33. 33. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  34. 34. High Performance Information Computing Center Jongwook Woo CalStateLA Conclusion K-Elect 2017 in ES5 and HDInsights ES5 Easy to collect and visualize HDInsights Data and Predict Analysis possible
  35. 35. High Performance Information Computing Center Jongwook Woo CalStateLA Question?
  36. 36. High Performance Information Computing Center Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
  37. 37. High Performance Information Computing Center Jongwook Woo CalStateLA 4. Business Data Analysis LA at Databricks, HiPIC of CalStateLA, Jongwook Woo https://docs.databricks.com/spark/latest/training/cal-state-la- biz-data-la.html 5. https://github.com/hipic/spark_mba, HiPIC of California State University Los Angeles 6. Hadoop, http://hadoop.apache.org 7. Databricks, http://www.databricks.com 8. DS320: DataStax Enterprise Analytics with Spark 9. Cloudera, http://www.cloudera.com 10.Hortonworks, http://www.hortonworks.com References (Cont’d)

×