SlideShare una empresa de Scribd logo
1 de 16
Understanding Big Data
Applications and Architectures
1st JTC 1 SGBD Meeting
SDSC San Diego March 19 2014
Geoffrey Fox
Judy Qiu
Shantenu Jha (Rutgers)
gcf@indiana.edu
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
51 Detailed Use Cases: Contributed July-September 2013
Covers goals, data features such as 3 V’s, software, hardware
• http://bigdatawg.nist.gov/usecases.php
• https://bigdatacoursespring2014.appspot.com/course (Section 5)
• Government Operation(4): National Archives and Records Administration, Census Bureau
• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search,
Digital Materials, Cargo shipping (as in UPS)
• Defense(3): Sensors, Image surveillance, Situation Assessment
• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity
• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd
Sourcing, Network Science, NIST benchmark datasets
• The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source
experiments
• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake,
Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate
simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry
(microbes to watersheds), AmeriFlux and FLUXNET gas sensors
• Energy(1): Smart grid 2
26 Features for each use case
Would like to capture “essence of
these use cases”
“small” kernels, mini-apps
Or Classify applications into patterns
Do it from HPC background not database view
point
i.e. focus on cases with detailed analytics
What are “mini-Applications”
• Use for benchmarks of computers and software (is my
parallel compiler any good?)
• In parallel computing, this is well established
– Linpack for measuring performance to rank machines in Top500
(changing?)
– NAS Parallel Benchmarks (originally a pencil and paper
specification to allow optimal implementations; then MPI library)
– Other specialized Benchmark sets keep changing and used to
guide procurements
• Last 2 NSF hardware solicitations had NO preset benchmarks –
perhaps as no agreement on key applications for clouds and
data intensive applications
– Berkeley dwarfs capture different structures that any approach
to parallel computing must address
– Templates used to capture parallel computing patterns
• I’ll let experts comment on database benchmarks like TPC
HPC Benchmark Classics
• Linpack or HPL: Parallel LU factorization for solution of
linear equations
• NPB version 1: Mainly classic HPC solver kernels
– MG: Multigrid
– CG: Conjugate Gradient
– FT: Fast Fourier Transform
– IS: Integer sort
– EP: Embarrassingly Parallel
– BT: Block Tridiagonal
– SP: Scalar Pentadiagonal
– LU: Lower-Upper symmetric Gauss Seidel
7 Original Berkeley Dwarfs (Colella)
1. Structured Grids (including locally structured
grids, e.g. Adaptive Mesh Refinement)
2. Unstructured Grids
3. Fast Fourier Transform
4. Dense Linear Algebra
5. Sparse Linear Algebra
6. Particles
7. Monte Carlo
8. Note “vaguer” than NPB
13 Berkeley Dwarfs
• Dense Linear Algebra
• Sparse Linear Algebra
• Spectral Methods
• N-Body Methods
• Structured Grids
• Unstructured Grids
• MapReduce
• Combinational Logic
• Graph Traversal
• Dynamic Programming
• Backtrack and Branch-and-Bound
• Graphical Models
• Finite State Machines
First 6 of these correspond to
Colella’s original.
Monte Carlo dropped
N-body methods are a subset of
Particle
Note a little inconsistent in that
MapReduce is a programming
model and spectral method is a
numerical method
Need multiple facets!
Distributed Computing MetaPatterns I
Jha, Cole, Katz, Parashar, Rana, Weissman
Distributed Computing MetaPatterns II
Jha, Cole, Katz, Parashar, Rana, Weissman
Distributed Computing MetaPatterns III
Jha, Cole, Katz, Parashar, Rana, Weissman
Problem Architecture Facet of Ogres (Meta or
MacroPattern)
i. Pleasingly Parallel – as in Blast, Protein docking, imagery
ii. Local Analytics or Machine Learning – ML or filtering
pleasingly parallel as in bio-imagery, radar images (really
just pleasingly parallel but sophisticated local analytics)
iii. Global Analytics or Machine Learning seen in LDA,
Clustering etc. with parallel ML over nodes of system
iv. SPMD (Single Program Multiple Data)
v. Bulk Synchronous Processing: well defined compute-
communication phases
vi. Fusion: Knowledge discovery often involves fusion of
multiple methods.
vii. Workflow (often used in fusion)
Core Analytics Facet of Ogres (microPattern)
i. Search/Query
ii. Local Machine Learning – pleasingly parallel
iii. Summarizing statistics
iv. Recommender Systems (Collaborative Filtering)
v. Outlier Detection (iORCA)
vi. Clustering (many methods),
vii. LDA (Latent Dirichlet Allocation) or variants like PLSI (Probabilistic
Latent Semantic Indexing),
viii. SVM and Linear Classifiers (Bayes, Random Forests),
ix. PageRank, (Find leading eigenvector of sparse matrix)
x. SVD (Singular Value Decomposition),
xi. Learning Neural Networks (Deep Learning),
xii. MDS (Multidimensional Scaling),
xiii. Graph Structure Algorithms (seen in search of RDF Triple stores),
xiv. Network Dynamics - Graph simulation Algorithms (epidemiology)
Matrix
Algebra
Global
Optimization
Analytics Features Facet of Ogres
• These core analytics/kernels can be classified by features
like
• (a) Flops per byte;
• (b) Communication Interconnect requirements;
• (c) Is application (graph) constant or dynamic
• (d) Is communication BSP or Asynchronous
• (e) Are algorithms Iterative or not?
• (f) Are data points in metric or non-metric spaces
Application Class Facet of Ogres
• (a) Search and query
• (b) Maximum Likelihood,
• (c) 2 minimizations,
• (d) Expectation Maximization (often Steepest descent)
• (e) Global Optimization (Variational Bayes)
• (f) Agents, as in epidemiology (swarm approaches)
• (g) GIS (Geographical Information Systems).
Data Source Facet of Ogres
• (i) SQL,
• (ii) NOSQL based,
• (iii) Other Enterprise data systems (10 examples from Bob Marcus)
• (iv) Set of Files (as managed in iRODS),
• (v) Internet of Things,
• (vi) Streaming and
• (vii) HPC simulations.
• Before data gets to compute system, there is often an initial data
gathering phase which is characterized by a block size and timing. Block
size varies from month (Remote Sensing, Seismic) to day (genomic) to
seconds or lower (Real time control, streaming)
• There are storage/compute system styles: Shared, Dedicated,
Permanent, Transient
• Other characteristics are need for permanent auxiliary/comparison
datasets and these could be interdisciplinary implying nontrivial data
movement/replication
Lessons / Insights
• Ogres classify Big Data applications by multiple
facets – each with several exemplars and features
– Guide to breadth and depth of Big Data
– Does your architecture/software support all the ogres?
• Add database exemplars
• In parallel computing, the simple analytic kernels
dominate mindshare even though agreed limited
• Section 5 of my class
https://bigdatacoursespring2014.appspot.com/preview
classifies 51 use cases with these facets

Más contenido relacionado

La actualidad más candente

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitGanesan Narayanasamy
 
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 

La actualidad más candente (20)

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.1 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 

Similar a Classification of Big Data Use Cases by different Facets

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226Nick Kypreos
 
On the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-proceduresOn the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-proceduresArmando Guevara
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1wang yaohui
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 

Similar a Classification of Big Data Use Cases by different Facets (20)

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
On the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-proceduresOn the-design-of-geographic-information-system-procedures
On the-design-of-geographic-information-system-procedures
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1
 
Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
BigData
BigDataBigData
BigData
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 

Más de Geoffrey Fox

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online EducationGeoffrey Fox
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityGeoffrey Fox
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationGeoffrey Fox
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 

Más de Geoffrey Fox (15)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Remarks on MOOC's
Remarks on MOOC'sRemarks on MOOC's
Remarks on MOOC's
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a Service
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 

Último

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Classification of Big Data Use Cases by different Facets

  • 1. Understanding Big Data Applications and Architectures 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Geoffrey Fox Judy Qiu Shantenu Jha (Rutgers) gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington
  • 2. 51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware • http://bigdatawg.nist.gov/usecases.php • https://bigdatacoursespring2014.appspot.com/course (Section 5) • Government Operation(4): National Archives and Records Administration, Census Bureau • Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) • Defense(3): Sensors, Image surveillance, Situation Assessment • Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity • Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets • The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments • Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan • Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors • Energy(1): Smart grid 2 26 Features for each use case
  • 3. Would like to capture “essence of these use cases” “small” kernels, mini-apps Or Classify applications into patterns Do it from HPC background not database view point i.e. focus on cases with detailed analytics
  • 4. What are “mini-Applications” • Use for benchmarks of computers and software (is my parallel compiler any good?) • In parallel computing, this is well established – Linpack for measuring performance to rank machines in Top500 (changing?) – NAS Parallel Benchmarks (originally a pencil and paper specification to allow optimal implementations; then MPI library) – Other specialized Benchmark sets keep changing and used to guide procurements • Last 2 NSF hardware solicitations had NO preset benchmarks – perhaps as no agreement on key applications for clouds and data intensive applications – Berkeley dwarfs capture different structures that any approach to parallel computing must address – Templates used to capture parallel computing patterns • I’ll let experts comment on database benchmarks like TPC
  • 5. HPC Benchmark Classics • Linpack or HPL: Parallel LU factorization for solution of linear equations • NPB version 1: Mainly classic HPC solver kernels – MG: Multigrid – CG: Conjugate Gradient – FT: Fast Fourier Transform – IS: Integer sort – EP: Embarrassingly Parallel – BT: Block Tridiagonal – SP: Scalar Pentadiagonal – LU: Lower-Upper symmetric Gauss Seidel
  • 6. 7 Original Berkeley Dwarfs (Colella) 1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo 8. Note “vaguer” than NPB
  • 7. 13 Berkeley Dwarfs • Dense Linear Algebra • Sparse Linear Algebra • Spectral Methods • N-Body Methods • Structured Grids • Unstructured Grids • MapReduce • Combinational Logic • Graph Traversal • Dynamic Programming • Backtrack and Branch-and-Bound • Graphical Models • Finite State Machines First 6 of these correspond to Colella’s original. Monte Carlo dropped N-body methods are a subset of Particle Note a little inconsistent in that MapReduce is a programming model and spectral method is a numerical method Need multiple facets!
  • 8. Distributed Computing MetaPatterns I Jha, Cole, Katz, Parashar, Rana, Weissman
  • 9. Distributed Computing MetaPatterns II Jha, Cole, Katz, Parashar, Rana, Weissman
  • 10. Distributed Computing MetaPatterns III Jha, Cole, Katz, Parashar, Rana, Weissman
  • 11. Problem Architecture Facet of Ogres (Meta or MacroPattern) i. Pleasingly Parallel – as in Blast, Protein docking, imagery ii. Local Analytics or Machine Learning – ML or filtering pleasingly parallel as in bio-imagery, radar images (really just pleasingly parallel but sophisticated local analytics) iii. Global Analytics or Machine Learning seen in LDA, Clustering etc. with parallel ML over nodes of system iv. SPMD (Single Program Multiple Data) v. Bulk Synchronous Processing: well defined compute- communication phases vi. Fusion: Knowledge discovery often involves fusion of multiple methods. vii. Workflow (often used in fusion)
  • 12. Core Analytics Facet of Ogres (microPattern) i. Search/Query ii. Local Machine Learning – pleasingly parallel iii. Summarizing statistics iv. Recommender Systems (Collaborative Filtering) v. Outlier Detection (iORCA) vi. Clustering (many methods), vii. LDA (Latent Dirichlet Allocation) or variants like PLSI (Probabilistic Latent Semantic Indexing), viii. SVM and Linear Classifiers (Bayes, Random Forests), ix. PageRank, (Find leading eigenvector of sparse matrix) x. SVD (Singular Value Decomposition), xi. Learning Neural Networks (Deep Learning), xii. MDS (Multidimensional Scaling), xiii. Graph Structure Algorithms (seen in search of RDF Triple stores), xiv. Network Dynamics - Graph simulation Algorithms (epidemiology) Matrix Algebra Global Optimization
  • 13. Analytics Features Facet of Ogres • These core analytics/kernels can be classified by features like • (a) Flops per byte; • (b) Communication Interconnect requirements; • (c) Is application (graph) constant or dynamic • (d) Is communication BSP or Asynchronous • (e) Are algorithms Iterative or not? • (f) Are data points in metric or non-metric spaces
  • 14. Application Class Facet of Ogres • (a) Search and query • (b) Maximum Likelihood, • (c) 2 minimizations, • (d) Expectation Maximization (often Steepest descent) • (e) Global Optimization (Variational Bayes) • (f) Agents, as in epidemiology (swarm approaches) • (g) GIS (Geographical Information Systems).
  • 15. Data Source Facet of Ogres • (i) SQL, • (ii) NOSQL based, • (iii) Other Enterprise data systems (10 examples from Bob Marcus) • (iv) Set of Files (as managed in iRODS), • (v) Internet of Things, • (vi) Streaming and • (vii) HPC simulations. • Before data gets to compute system, there is often an initial data gathering phase which is characterized by a block size and timing. Block size varies from month (Remote Sensing, Seismic) to day (genomic) to seconds or lower (Real time control, streaming) • There are storage/compute system styles: Shared, Dedicated, Permanent, Transient • Other characteristics are need for permanent auxiliary/comparison datasets and these could be interdisciplinary implying nontrivial data movement/replication
  • 16. Lessons / Insights • Ogres classify Big Data applications by multiple facets – each with several exemplars and features – Guide to breadth and depth of Big Data – Does your architecture/software support all the ogres? • Add database exemplars • In parallel computing, the simple analytic kernels dominate mindshare even though agreed limited • Section 5 of my class https://bigdatacoursespring2014.appspot.com/preview classifies 51 use cases with these facets

Notas del editor

  1. Big dwarfs are OgresImplement Ogres in ABDS+
  2. Big dwarfs are OgresImplement Ogres in ABDS+