Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Jongwook Woo
HiPIC
CalStat
eLA
UKC 2016
Dallas, TX
Aug 12 2016
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance In...
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduc...
High Performance Information Computing Center
Jongwook Woo
CalStat
Myself
Experience:
 Since 2002, Professor at Californ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Experience (Cont’d): Bring in Big Data R&D
and trainin...
High Performance Information Computing Center
Jongwook Woo
CalStat
Experience in Big Data
 Collaboration
 Council Member...
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduc...
High Performance Information Computing Center
Jongwook Woo
CalStat
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-b...
High Performance Information Computing Center
Jongwook Woo
CalStat
Two Cores in Big Data
How to store Big Data
How to co...
High Performance Information Computing Center
Jongwook Woo
CalStat
What is Hadoop?
9
 Hadoop Founder:
o Doug Cutting
 Ap...
High Performance Information Computing Center
Jongwook Woo
CalStat
Definition: Big Data
Non-expensive frameworks that can...
High Performance Information Computing Center
Jongwook Woo
CalStat
Hadoop Cluster: Logical Diagram
Web Browser of Cluster ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduc...
High Performance Information Computing Center
Jongwook Woo
CalStat
Alternate of Hadoop MapReduce
Limitation in MapReduce
...
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
In-Memory Data Computing
Faster than Hadoop Map...
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
RDDs, Transformations, and Actions
Spark
Streamin...
High Performance Information Computing Center
Jongwook Woo
CalStat
RDD Operations
Transformation
Define new RDDs from th...
High Performance Information Computing Center
Jongwook Woo
CalStat
Programming in Spark
Scala
Functional Programming
–Fu...
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
 Spark SQL
 DataFrame
– Turning an RDD into a R...
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduc...
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark
Spark
File Systems: Tachyon
Resource Manager: ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark with Hadoop YARN
Spark Client
Slave Nodes
 Resou...
High Performance Information Computing Center
Jongwook Woo
CalStat
Big Data Analysis Flow
Data Collection
Batch API: Yelp,...
High Performance Information Computing Center
Jongwook Woo
CalStat
Databricks cluster at CalStateLA
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduc...
High Performance Information Computing Center
Jongwook Woo
CalStat
Open Data
USA government
Federal, State, City governm...
High Performance Information Computing Center
Jongwook Woo
CalStat
Open Big Data Analysis in
CalStateLA
Social Media Data...
High Performance Information Computing Center
Jongwook Woo
CalStat
Data from Industry: Twitter
Data
 Systems
Azure HDIns...
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries that Tweets
“Alphago”
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
 # of Tweets per
Country
USA: > 11,0...
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries Sentiment
Positive Negative
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 Countries
Most Tweeted Countries
 All countrie...
High Performance Information Computing Center
Jongwook Woo
CalStat
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
2...
High Performance Information Computing Center
Jongwook Woo
CalStat
Ngram words
 3 word in row right after Go-Champion
“se...
High Performance Information Computing Center
Jongwook Woo
CalStat
Sentiment Map of Alphago
Positive
Negative
High Performance Information Computing Center
Jongwook Woo
CalStat
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video:...
High Performance Information Computing Center
Jongwook Woo
CalStat
Federal Government: Airline Data Set
Government Open D...
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CalStat
City Government: Crime Data Set
Open Data in City of L...
High Performance Information Computing Center
Jongwook Woo
CalStat
Projection of Raw Data
0
10000
20000
30000
40000
50000
...
High Performance Information Computing Center
Jongwook Woo
CalStat
Total No. of Crimes in 2012-15
0
5000
10000
15000
20000...
High Performance Information Computing Center
Jongwook Woo
CalStat
Mapping of Crimes Occurred within 5miles
from CalStateL...
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from CalStateLA
0
10000
2...
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of Crimes for every 5miles
from UCLA
0
20000
40000
6...
High Performance Information Computing Center
Jongwook Woo
CalStat
No. of Crimes for every 5miles
from USC
0
20000
40000
6...
High Performance Information Computing Center
Jongwook Woo
CalStat
Comparision of Crimes for
every 5miles from CalStateLA,...
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes per area in LA
0
2000
4000
6000
8000
10000...
High Performance Information Computing Center
Jongwook Woo
CalStat
Total No.of Crimes for every
2hours in LA
0
2000
4000
6...
High Performance Information Computing Center
Jongwook Woo
CalStat
No.of crimes for every 2hrs
within 5miles from CalState...
High Performance Information Computing Center
Jongwook Woo
CalStat
BUSINESS DATA ANALYSIS
 DATA SET DETAILS
• Yelp Review...
High Performance Information Computing Center
Jongwook Woo
CalStat
Top 10 businesses within 5 miles from CalStateLA
(with ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Businesses popular in 5 miles of CalStateLA,
usc , ucla
High Performance Information Computing Center
Jongwook Woo
CalStat
Number of food business in radius
0-25 miles from CalSt...
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The Cal Stat...
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
The station
...
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Workflow
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Model by Man...
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Hydrogen Gas Power Plant
Prediction Model
Results and ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Contents
 Myself
 Introduction To Big Data
 Introduc...
High Performance Information Computing Center
Jongwook Woo
CalStat
Spark Big Data Training and R&D
HiPIC
California Stat...
High Performance Information Computing Center
Jongwook Woo
CalStat
Databricks Partners
High Performance Information Computing Center
Jongwook Woo
CalStat
Training Hadoop and Spark
Cloudera visits to interview ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Training Hadoop on IBM Bluemix at
California State Univ...
High Performance Information Computing Center
Jongwook Woo
CalStat
Question?
High Performance Information Computing Center
Jongwook Woo
CalStat
References
Hadoop, http://hadoop.apache.org
Apache Sp...
High Performance Information Computing Center
Jongwook Woo
CalStat
 Introduction to Big Data with Apache Spark, databrick...
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filte...
High Performance Information Computing Center
Jongwook Woo
CalStat
Block
manager
Task
threads
Spark Components
sc = new Sp...
High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
...
High Performance Information Computing Center
Jongwook Woo
CalStat
Dependency Types
union
groupByKey
join with inputs not
...
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduler Optimizations
Pipelines within a
stage 2
ma...
High Performance Information Computing Center
Jongwook Woo
CalStat
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
S...
Próxima SlideShare
Cargando en…5
×

Big Data Trend and Open Data

241 visualizaciones

Publicado el

Presented at UKC 2016, Dallas, TX, Aug 12 2016

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

  • Sé el primero en recomendar esto

Big Data Trend and Open Data

  1. 1. Jongwook Woo HiPIC CalStat eLA UKC 2016 Dallas, TX Aug 12 2016 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Big Data Trend and Open Data
  2. 2. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  3. 3. High Performance Information Computing Center Jongwook Woo CalStat Myself Experience:  Since 2002, Professor at California State Univ Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  4. 4. High Performance Information Computing Center Jongwook Woo CalStat Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city in 2016 – Collect, Search, and Analyze City Data • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  5. 5. High Performance Information Computing Center Jongwook Woo CalStat Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  6. 6. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  7. 7. High Performance Information Computing Center Jongwook Woo CalStat Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  8. 8. High Performance Information Computing Center Jongwook Woo CalStat Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  9. 9. High Performance Information Computing Center Jongwook Woo CalStat What is Hadoop? 9  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  10. 10. High Performance Information Computing Center Jongwook Woo CalStat Definition: Big Data Non-expensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers
  11. 11. High Performance Information Computing Center Jongwook Woo CalStat Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  12. 12. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  13. 13. High Performance Information Computing Center Jongwook Woo CalStat Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce
  14. 14. High Performance Information Computing Center Jongwook Woo CalStat Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  15. 15. High Performance Information Computing Center Jongwook Woo CalStat Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib ML machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  16. 16. High Performance Information Computing Center Jongwook Woo CalStat RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  17. 17. High Performance Information Computing Center Jongwook Woo CalStat Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  18. 18. High Performance Information Computing Center Jongwook Woo CalStat Spark  Spark SQL  DataFrame – Turning an RDD into a Relation  Querying using SQL  Spark Streaming  DStream – RDD in streaming – Windows • To select DStream from streaming data  Mlib, ML  Sparse vector support, Decision trees, Linear/Logistic Regression, PCA  Pipeline
  19. 19. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  20. 20. High Performance Information Computing Center Jongwook Woo CalStat Spark Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS – No Hadoop ecosystems
  21. 21. High Performance Information Computing Center Jongwook Woo CalStat Spark with Hadoop YARN Spark Client Slave Nodes  ResourceManager (RM) Per Cluster  Create Spark AM and  allocate Containers for Spark AM  NodeManager (NM) Per Node  Spark workers  ApplicationMaster (AM) Per Application  Containers for Spark Executors Master Node Node Manager Node Manager Node Manager Container: Spark Executor Spark AM Resource Manager
  22. 22. High Performance Information Computing Center Jongwook Woo CalStat Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, …) Data Visualization Qlik, Datameer, Excel PowerView
  23. 23. High Performance Information Computing Center Jongwook Woo CalStat Databricks cluster at CalStateLA
  24. 24. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Use Cases  Hadoop Spark Training
  25. 25. High Performance Information Computing Center Jongwook Woo CalStat Open Data USA government Federal, State, City governments Expose data to public USA Business Twitter, Yelp, … Expose data to public with APIs – Some restriction to download City government New York – Taxi, Uber, … Los Angeles – Open Data, Open Hub with Geo info
  26. 26. High Performance Information Computing Center Jongwook Woo CalStat Open Big Data Analysis in CalStateLA Social Media Data Analysis Twitter Sentiment Analysis for Alphago Open Data from Government Airline Data analysis Crime Data analysis Web Service API Business Data Analysis from Yelp and Google Places API
  27. 27. High Performance Information Computing Center Jongwook Woo CalStat Data from Industry: Twitter Data  Systems Azure HDInsights Spark 8 Nodes – 40 cores: 2.4GHz Intel Xeon – Memory - Each Node: 28 GB  Data Source Keyword ‘alphago’ from Tweeter via Apache NiFi  Data Size  63,193 tweets  Real Time Data Collection period 03/12 – 03/17/2016 – No data collected on 03/13
  28. 28. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries that Tweets “Alphago”
  29. 29. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries  # of Tweets per Country USA: > 11,000 Japan: > 9,000 Korea: > 1,900 Russia, UK: > 1,600 Thai Land, France : > 1,000  Netherland, Spain, Ukraine: > 600
  30. 30. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries Sentiment Positive Negative
  31. 31. High Performance Information Computing Center Jongwook Woo CalStat Top 10 Countries Most Tweeted Countries  All countries show more positive tweets –Korea, Japan, USA Country Positive Negative USA 5070 3567 Japan 8118 217 … Korea 1053 407 …
  32. 32. High Performance Information Computing Center Jongwook Woo CalStat Daily Tweets in 03/12 – 03/17/2016 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016 Alphago vs Lee Sedol Game 4: Mar 13 Lee Se-Dol win Game 5: Mar 15 Game 3: Mar 12
  33. 33. High Performance Information Computing Center Jongwook Woo CalStat Ngram words  3 word in row right after Go-Champion “sedol” and “se-dol” sedol  se-dol 3-grams Frequency Again-to-win 1,187 Is-something-I’ll 369 Is-something-i 199 In-go-tournament 168
  34. 34. High Performance Information Computing Center Jongwook Woo CalStat Sentiment Map of Alphago Positive Negative
  35. 35. High Performance Information Computing Center Jongwook Woo CalStat Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb ToiB8wQ2w14a
  36. 36. High Performance Information Computing Center Jongwook Woo CalStat Federal Government: Airline Data Set Government Open Data Airline Data Set in 2012 – 2014 – US Dept of transportation Cluster by Nillohit at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter
  37. 37. High Performance Information Computing Center Jongwook Woo CalStat Airline Data Set
  38. 38. High Performance Information Computing Center Jongwook Woo CalStat Airline Data Set
  39. 39. High Performance Information Computing Center Jongwook Woo CalStat Airline Data Set
  40. 40. High Performance Information Computing Center Jongwook Woo CalStat City Government: Crime Data Set Open Data in City of Los Angeles  Crime Data Set in 2012-2015  File Size – 151MB  Total Number of offenses – 8.94 million Ram Dharan and Sridhar Reddy at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 14 GB – Windows Server 2012 R2 Datacenter – Extending to last 10 years of data set
  41. 41. High Performance Information Computing Center Jongwook Woo CalStat Projection of Raw Data 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 year2012 year2013 year2014 year2015
  42. 42. High Performance Information Computing Center Jongwook Woo CalStat Total No. of Crimes in 2012-15 0 5000 10000 15000 20000 25000 year2012 year2013 year2014 year2015
  43. 43. High Performance Information Computing Center Jongwook Woo CalStat Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
  44. 44. High Performance Information Computing Center Jongwook Woo CalStat No.of Crimes for every 5miles from CalStateLA 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 >35 csula_2012 csula_2013 csula_2014 csula_2015
  45. 45. High Performance Information Computing Center Jongwook Woo CalStat No.of Crimes for every 5miles from UCLA 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40 ucla_2012 ucla_2013 ucla_2014 ucla_2015
  46. 46. High Performance Information Computing Center Jongwook Woo CalStat No. of Crimes for every 5miles from USC 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >40 ucla_2012 ucla_2013 ucla_2014 ucla_2015
  47. 47. High Performance Information Computing Center Jongwook Woo CalStat Comparision of Crimes for every 5miles from CalStateLA, UCLA and USC in 2015 0 20000 40000 60000 80000 100000 120000 0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >50 csula_2015 ucla_2015 usc_2015
  48. 48. High Performance Information Computing Center Jongwook Woo CalStat No.of crimes per area in LA 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 77thStreet Mission Newton Rampart Southwest Topanga VanNuys Wilshire Central Devonshire Foothill Harbor Hollenbeck Hollywood NHollywood Pacific WestValley Northeast Olympic Southeast WestLA in2012 in2013 in2014 in2015
  49. 49. High Performance Information Computing Center Jongwook Woo CalStat Total No.of Crimes for every 2hours in LA 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 77thStreet Mission Newton Rampart Southwest Topanga VanNuys Wilshire Central Devonshire Foothill Harbor Hollenbeck Hollywood NHollywood Pacific WestValley Northeast Olympic Southeast WestLA in2012 in2013 in2014 in2015
  50. 50. High Performance Information Computing Center Jongwook Woo CalStat No.of crimes for every 2hrs within 5miles from CalStateLA, UCLA and USC in 2015 0 2000 4000 6000 8000 10000 12000 00:00-02:00 02:00-04:00 04:00-06:00 06:00-08:00 08:00-10:00 10:00-12:00 12:00-14:00 14:00-16:00 16:00-18:00 18:00-20:00 20:00-22:00 22:00-24:00 usc ucla csula
  51. 51. High Performance Information Computing Center Jongwook Woo CalStat BUSINESS DATA ANALYSIS  DATA SET DETAILS • Yelp Review Data : 1.9GB • Business Data: 500MB • Web Service API from Yelp and Google Places Analysis Join YELP CHALLENGE DATA SET GOOGLE PLACES YELP DATA
  52. 52. High Performance Information Computing Center Jongwook Woo CalStat Top 10 businesses within 5 miles from CalStateLA (with 5 or 4 star ratings) 34 31 29 26 19 19 15 15 15 0 5 10 15 20 25 30 35 40 count Chart Title Hair Salons Auto Repair General Dentistry Insurance Churches Skin Care Chiropractors Barbers Elementary Schools • Hair Salons and Insurance are popular qualified business categories
  53. 53. High Performance Information Computing Center Jongwook Woo CalStat Businesses popular in 5 miles of CalStateLA, usc , ucla
  54. 54. High Performance Information Computing Center Jongwook Woo CalStat Number of food business in radius 0-25 miles from CalStateLA, usc and ucla CalStateLA have more food businesses within 5 miles compared to UCLA and USC 0 100 200 300 400 500 600 0- 5 5-10. 10-15. 15-20 20-25 CSULA USC UCLA
  55. 55. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.
  56. 56. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model The station producing hydrogen for Hydrogen Vehicle Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to the public. Hyundai, Toyota
  57. 57. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Workflow
  58. 58. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Model by Manvi Chandra
  59. 59. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Results and observations
  60. 60. High Performance Information Computing Center Jongwook Woo CalStat Hydrogen Gas Power Plant Prediction Model Results and observations  Can predict Vehicle Pressure – Pressure of hydrogen gas within the vehicle Hydrogen Storage System – using our model in Azure Visual Studio ML – Building Spark ML Decision forest Regression – constructing a multitude of decision trees at training time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.
  61. 61. High Performance Information Computing Center Jongwook Woo CalStat Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  62. 62. High Performance Information Computing Center Jongwook Woo CalStat Spark Big Data Training and R&D HiPIC California State University Los Angeles Supported by – Databricks and its cloud computing services – Amazon AWS, IBM Buemix, MS Azure – Hortonworks, Cloudera – Datameer
  63. 63. High Performance Information Computing Center Jongwook Woo CalStat Databricks Partners
  64. 64. High Performance Information Computing Center Jongwook Woo CalStat Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  65. 65. High Performance Information Computing Center Jongwook Woo CalStat Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  66. 66. High Performance Information Computing Center Jongwook Woo CalStat Question?
  67. 67. High Performance Information Computing Center Jongwook Woo CalStat References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )  “Market Basket Analysis using Spark”, Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
  68. 68. High Performance Information Computing Center Jongwook Woo CalStat  Introduction to Big Data with Apache Spark, databricks  Stanford Spark Class (http://stanford.edu/~rezab )  Cornell University, CS5304  DS320: DataStax Enterprise Analytics with Spark  Cloudera, http://www.cloudera.com  Hortonworks, http://www.hortonworks.com  Spark 3 Use Cases, http://www.datanami.com/2014/03/06/apache_spark_ 3_real-world_use_cases/ References
  69. 69. High Performance Information Computing Center Jongwook Woo CalStat Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects Optimizer Optimizer: build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  70. 70. High Performance Information Computing Center Jongwook Woo CalStat Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3, Couchbase, Cassandra, … RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  71. 71. High Performance Information Computing Center Jongwook Woo CalStat Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Wide” (shuffle) deps: boundary of stages “Narrow” deps: A stage pipeline to be run on the same node
  72. 72. High Performance Information Computing Center Jongwook Woo CalStat Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: A stage pipeline to be run on the same node “Wide” (shuffle) deps: boundary of stages
  73. 73. High Performance Information Computing Center Jongwook Woo CalStat Scheduler Optimizations Pipelines within a stage 2 map, union Stage 3: join algorithms based on partitioning (minimize shuffles) join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  74. 74. High Performance Information Computing Center Jongwook Woo CalStat Scheduler Optimizations Conceptually Stage 1: 3 tasks Stage 2: 4 tasks Stage 3: 3 tasks Total: 3 stages, 10 tasks join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task

×