SlideShare una empresa de Scribd logo
1 de 29
CS/NERSC Data Seminar, September 1, 2017
ArrayUDF: User-Defined Scientific Data
Analysis on Arrays
Bin Dong1, Kesheng Wu1, Surendra Byna1, Jialin Liu1
Weijie Zhao2, Florin Rusu1,2
1LBNL, Berkeley, CA
2UC Merced, Merced, CA
CS/NERSC Data Seminar, September 1, 2017
CS/NERSC Data Seminar, September 1, 2017
Clarion call for large scale data analysis in modern
scientific activities
Example: scientific projects for supernovae, dark matter/energy, etc.
Data source:
Rick White,
J. Hart,
R. Cutri,
Ian Foster,
C. J. Grillmair,
etc.
SIGMOD’16
SIGMOD’17
SIGMOD’14
CS/NERSC Data Seminar, September 1, 2017
Q1: How many data analysis operations are
being or will be developed ?
CS/NERSC Data Seminar, September 1, 2017
Avg Growth
Python modules 72/day
R packages 10/day
Java packages 108/day
Q1: How many data analysis operations are being
or will be developed ?
Implication from popular data analysis languages
*Data from http://www.modulecounts.com/ on Aug 30 2017
Large population
CS/NERSC Data Seminar, September 1, 2017
Q2: What are the functions of these data analysis
operations ?
Q1: How many data analysis operations are being
or will be developed ?
Large population
CS/NERSC Data Seminar, September 1, 2017
Q2: What are the functions of these data analysis
operations ?
Variety
Q1: How many data analysis operations are being
or will be developed ?
Large population
CS/NERSC Data Seminar, September 1, 2017
Two common methods to develop data analysis
operations with large population and variety
For each operation P Do
Develop P’s :
- Data management
- Expression execution
- Other components:
parallel,
communication
cache,
etc.
End For
Redundant
Diverse
Customized Solutions
✔
✗
✗Redundant
May lack expertise of the
underlying systems to
tune its performance
✗
UDF API
- Data management
- Generic exec. engine
- Other components:
parallel, comm.,
cache, etc.
Diverse
One single
& shared
✔
✔
Operation expression 1
User-defined Functions (UDF)
Professionally tuned
✔
CS/NERSC Data Seminar, September 1, 2017
UDF is at heart of modern big data system
Examples: MapReduce in Apache Hadoop and Spark
MAP()
MAP()
MAP()
reduce()
reduce()
UDF to generate
(key, value) pairs
UDF to merge
(key, values) pairs
shuffle/sort
Input
Data
Output
Data
CS/NERSC Data Seminar, September 1, 2017
MapReduce is not an optimal fit for scientific data
analysis
Reason 1: most scientific data are multi-dimensional arrays
Pictures Credit: Kyle Hemes,
Peter Nugent,
Suren Byna,
etc.
Converting array to (key, value) is expensive
because of explicitly handling coordinate
CS/NERSC Data Seminar, September 1, 2017
MapReduce is not an optimal fit for scientific data
analysis (continued)
Reason 1: most scientific data are multi-dimensional arrays
Reason 2: most scientific data analysis operations own
structure locality property
Structure locality:
The analysis operation on a single cell
accesses its neighborhood cells
Map deals with a single element at a time
Moving Average
2D Poisson Equation
Solver (Discrete)
Reduce requires to duplicate each cell
for all neighborhood cells (~x # of neighbors)
Reduce only happens after expensive shuffle
Converting array to (key, value) is expensive
CS/NERSC Data Seminar, September 1, 2017
ArrayUDF: user-defined scientific data analysis
on arrays
• Stencil-based user-defined function API
Structural locality aware array operations
• Native multidimensional array data model
In-situ data processing in scientific data formats, e.g., HDF5
• Optimal and automatic chunking and ghost zone
handling method
Fast large array processing in parallel & out-of-core manner
Processing
Element
Chunk Ghost zone
Storage
System
PE0
PE1
PE2
PE3
ArrayUDF
CS/NERSC Data Seminar, September 1, 2017
• Stencil (S) is a structure representation for a set of
neighborhood cells
Stencil-based UDF
s0,0 s0,1
s1,0
s1,-1
s0,-1
s-1,-1 s-1,0 s-1,1
s1,1
Materialized structure locality
Flexible UDF expression by manipulating
each neighborhood cell independently
2D Example:
- S has a center where computing happens
- The size of |S| is not fixed
- Notations for set member
stands for the cell at offset
from center point
d1,d2,···sd1,d2,···
i, j,···
CS/NERSC Data Seminar, September 1, 2017
Stencil-based UDF(continued)
• is an arbitrary user-defined function
-|S| = 1, user-defined function of a single cell
i.e., map in MapReduce
-|S| > 1, user-defined aggregation of a set of cells,
i.e., reduce in MapReduce
f Si, j,···( )® c'
i, j,···
A’A
f
CS/NERSC Data Seminar, September 1, 2017
Examples of using ArrayUDF
Tem_avg(Stencil t):
return (t(-30)+ … t(30))/60
Three steps by using ArrayUDF:
Example 1: moving average in time series data
Global temperature trend filtered by
moving average at 60 years’ interval
from 1908 to 2008
T.Apply(Tem_avg, T’)
Array T(“data location pointer”)
Step 1: Initialize data
Step 2: Define operation on Stencil
Step 3: Run & get result T’
CS/NERSC Data Seminar, September 1, 2017
Examples of using ArrayUDF (Continued)
Example 2: vorticity computation in fluid flow
VC_X(Stencil u):
return u(0,1)- u(0, -1)
VC_Y(Stencil v):
return v(1,0)- u(-1, 0)
V_X.Apply(VC_X, V_X’)
V_Y.Apply(VC_Y , V_Y’)
V_X’+V_Y’ as vorticity
Modeling renewable energy
Combustion engines
Pictures credit to: LANL, Frank Fritz Michael Milthaler, etc.
Array V_X(“data location pointer”)
Array V_y(“data location pointer”)
Step 1: Initialize data (2D example)
Step 2: Define operation on Stencil
Step 3: Run & get result
Three steps by using ArrayUDF :
CS/NERSC Data Seminar, September 1, 2017
Optimized performance of ArrayUDF
T_overall = T_I/O + T_computing + T_communication
minimized(T_I/O) Constant = 0 (avoided)
ArrayUDF
T_overall: overall time to run a data analysis operation
T_I/O: time of reading/writing data
T_computing: time of execute operation expression
T_communication: time of communications
CS/NERSC Data Seminar, September 1, 2017
ArrayUDF minimizes I/O cost via smart
chunking
• Factors considered in chunking : physical layout, logical shape, size, etc.
• Two chunking strategies:
- Layout unknown, i.e., average case of all possible layouts
- Select square shaped chunk to minimize ghost cells/chunk
- Layout known in advance, i.e., row-major, the most popular one
Select contiguous chunking to maximize I/O on contiguous cells,
including ghost cells
Cell Ghost Cell Chunk Row−major order
Chunk
Square Shape
Non−Square Shape
# of ghost cells = 12
# of ghost cells = 20
Ghost Cell
Chunk Cell
See theoretical
analysis in the
ArrayUDF paper
at HPDC’17
CS/NERSC Data Seminar, September 1, 2017
ArrayUDF dynamically builds ghost zone to
avoid communication
• What is ghost zone and why ?
- Ghost zone are extra cells surrounding a chunk
- Motivated by structure locality
• When to build ghost zone?
- Ghost zone is built when chunk is read from disk into
memory
• How to determine the size of ghost zone ?
User-defined
Trail-run: execute the UDF code on a special Stencil
instance to collect the offsets used by UDF
Size of ghost zone = maximum of collected offsets
CS/NERSC Data Seminar, September 1, 2017
Evaluations
• Hardware:
-Edison, a Cray XC30 supercomputer at NERSC
-5576 computing nodes, 24 cores/node, 64GB DDR3 Memory
• Software
- ArrayUDF - RasDaMan 9.5 (sequential version)
- Spark 1.5.0 - EXTASCID(hand-optimized version)
- SciDB 16.9 - Hand-optimized C/C++ code
• Workloads
- Two synthetic data sets (i.e., 2D and 3D) for micro benchmarks
 Window operators, chunking strategy, trail-run, etc.
- Four real scientific data sets (i.e., S3D, MSI , VPIC , CoRTAD)
 Overall performance tests /w generic UDF interface
CS/NERSC Data Seminar, September 1, 2017
Comparison with peer systems with
standard “window” operators
• “window” comes from SciDB and RasDaMan, where a
operator is applied to all window members uniformly
Average on
- window 2x2 for 2D
- window 2x2x2 for 3D
• ArrayUDF has close performance to hand-optimized code
• ArrayUDF is as much as 384X faster than peer systems
CS/NERSC Data Seminar, September 1, 2017
Comparison with Spark in real scientific
data analysis with generic UDF interface
Spark experiences
out-of-memory:
- large data size
- more local cells
S3D
Vorticity comp.
301GB
2 local cells/op.
MSI
Laplacian op.
21GB
4 local cells/op.
VPIC
Tri interpolation
36GB
8 local cells/op.
CoRTAD
Moving
average
225GB
4 local cells/op.
DataSize
# of local cells used by UDF
We observed
ArrayUDF is
2070X faster
CS/NERSC Data Seminar, September 1, 2017
ArrayUDF at NERSC
module load arrayudf/1.0
module show arrayudf/1.0
Examples:
https://bitbucket.org/arrayudf/
CS/NERSC Data Seminar, September 1, 2017
Conclusions
• ArrayUDF: User-defined scientific data analysis on arrays
-Stencil based UDF API for structural locality-aware operations
-Native array model & In-situ array processing in HDF5, etc.
-Auto & Optimal chunking and ghost zone methods for parallel or
out-of-core array processing
• ArrayUDF provides close performance to hand-optimized code
• ArrayUDF is as much as 2070X faster than Spark
• ArrayUDF is easy-to-use data analysis system
• ArrayUDF source code: https://bitbucket.org/arrayudf/
• Future work
-Python and other language interface (Done)
-more in-situ formats: NetCDF, PnetCDF, ADIOS, etc.
CS/NERSC Data Seminar, September 1, 2017
Acknowledgments
• Nicholas Chaimov from University of Oregon for suggestions to set
up Spark on Edison at NERSC
• Office of Advanced Scientific Computing Research, Office of
Science, U.S. Department of Energy, support for the SDS project
and a DOE Career award (Program manager: Dr. Lucy Nowell)
under contract number DE-AC02-05CH11231
• National Energy Research Scientific Computing Center
CS/NERSC Data Seminar, September 1, 2017
Thanks
Bin Dong
dbin@lbl.gov
http://crd.lbl.gov//dongbin
CS/NERSC Data Seminar, September 1, 2017
Backup Slides
CS/NERSC Data Seminar, September 1, 2017
Stencil-based computing model vs. others
Input Output UDF
SQL DBMS Tuple t Tuple t’ t’=f(t)
SciDB Cell c Cell c’ c’=f(c)
MapReduce KeyValue kv KeyValue kv’’ kv’=Map(kv)
kv’’=Reduce(kv’1, kv’2, …)
ArrayUDF Stencil s Cell c’ c’=f(s)
vs. MapReduce:
ArrayUDF generalizes map and reduce as a single operation
vs. SciDB:
SciDB has ‘window’, similar to Stencil. But, the ‘window’ usually
applies an operator uniformly on all cells involved. In ArrayUDF,
cell with stencil can be applied with different operator.
CS/NERSC Data Seminar, September 1, 2017
Chunking strategy evaluation
2D Dataset (100000, 100000)
Square chunk
(1K, 1K)
• Squared chunking (for average cases)
- minimize ghost cells # to reduce I/O cost
• Contiguous chunking (for row-major layout)
- maximize contiguous disk read
Ghost zone has
ignorable impact
CS/NERSC Data Seminar, September 1, 2017
Trail-run overhead
• Detect ghost zone size automatically
• Run the UDF on a single Stencil but the UDF might access
more neighborhood cells
Unit: microsecond
≈ 1 ms when 256 cells
are used in the UDF

Más contenido relacionado

La actualidad más candente

Elster falch-gpu-cse-sem-oct2013
Elster falch-gpu-cse-sem-oct2013Elster falch-gpu-cse-sem-oct2013
Elster falch-gpu-cse-sem-oct2013
Anne Elster
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
aimsnist
 
Slide 1
Slide 1Slide 1
Slide 1
butest
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 

La actualidad más candente (20)

Elster falch-gpu-cse-sem-oct2013
Elster falch-gpu-cse-sem-oct2013Elster falch-gpu-cse-sem-oct2013
Elster falch-gpu-cse-sem-oct2013
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
 
Deploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi ClustersDeploying a Task-based Runtime System on Raspberry Pi Clusters
Deploying a Task-based Runtime System on Raspberry Pi Clusters
 
Deep galaxy classification of galaxies based on deep convolutional neural ne...
Deep galaxy  classification of galaxies based on deep convolutional neural ne...Deep galaxy  classification of galaxies based on deep convolutional neural ne...
Deep galaxy classification of galaxies based on deep convolutional neural ne...
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Slide 1
Slide 1Slide 1
Slide 1
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 

Similar a ArrayUDF: User-Defined Scientific Data Analysis on Arrays

Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Deltares
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
 

Similar a ArrayUDF: User-Defined Scientific Data Analysis on Arrays (20)

The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
PhD defence
PhD defencePhD defence
PhD defence
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
Data Intensive Research with DISPEL
Data Intensive Research with DISPELData Intensive Research with DISPEL
Data Intensive Research with DISPEL
 
Visualising Research Graph using Neo4j and Gephi
Visualising Research Graph using Neo4j and GephiVisualising Research Graph using Neo4j and Gephi
Visualising Research Graph using Neo4j and Gephi
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
Nuclear emergency response and Big Data technologies
Nuclear emergency response and Big Data technologiesNuclear emergency response and Big Data technologies
Nuclear emergency response and Big Data technologies
 
HPC I/O for Computational Scientists
HPC I/O for Computational ScientistsHPC I/O for Computational Scientists
HPC I/O for Computational Scientists
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
 
Integrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesIntegrated research data management in the Structural Sciences
Integrated research data management in the Structural Sciences
 
2019 IRIS-HEP AS workshop: Particles and decays
2019 IRIS-HEP AS workshop: Particles and decays2019 IRIS-HEP AS workshop: Particles and decays
2019 IRIS-HEP AS workshop: Particles and decays
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
INC 2004: An Efficient Mechanism for Adaptive Resource Discovery in Grids
INC 2004: An Efficient Mechanism for Adaptive Resource Discovery in GridsINC 2004: An Efficient Mechanism for Adaptive Resource Discovery in Grids
INC 2004: An Efficient Mechanism for Adaptive Resource Discovery in Grids
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 

Último

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

ArrayUDF: User-Defined Scientific Data Analysis on Arrays

  • 1. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF: User-Defined Scientific Data Analysis on Arrays Bin Dong1, Kesheng Wu1, Surendra Byna1, Jialin Liu1 Weijie Zhao2, Florin Rusu1,2 1LBNL, Berkeley, CA 2UC Merced, Merced, CA CS/NERSC Data Seminar, September 1, 2017
  • 2. CS/NERSC Data Seminar, September 1, 2017 Clarion call for large scale data analysis in modern scientific activities Example: scientific projects for supernovae, dark matter/energy, etc. Data source: Rick White, J. Hart, R. Cutri, Ian Foster, C. J. Grillmair, etc. SIGMOD’16 SIGMOD’17 SIGMOD’14
  • 3. CS/NERSC Data Seminar, September 1, 2017 Q1: How many data analysis operations are being or will be developed ?
  • 4. CS/NERSC Data Seminar, September 1, 2017 Avg Growth Python modules 72/day R packages 10/day Java packages 108/day Q1: How many data analysis operations are being or will be developed ? Implication from popular data analysis languages *Data from http://www.modulecounts.com/ on Aug 30 2017 Large population
  • 5. CS/NERSC Data Seminar, September 1, 2017 Q2: What are the functions of these data analysis operations ? Q1: How many data analysis operations are being or will be developed ? Large population
  • 6. CS/NERSC Data Seminar, September 1, 2017 Q2: What are the functions of these data analysis operations ? Variety Q1: How many data analysis operations are being or will be developed ? Large population
  • 7. CS/NERSC Data Seminar, September 1, 2017 Two common methods to develop data analysis operations with large population and variety For each operation P Do Develop P’s : - Data management - Expression execution - Other components: parallel, communication cache, etc. End For Redundant Diverse Customized Solutions ✔ ✗ ✗Redundant May lack expertise of the underlying systems to tune its performance ✗ UDF API - Data management - Generic exec. engine - Other components: parallel, comm., cache, etc. Diverse One single & shared ✔ ✔ Operation expression 1 User-defined Functions (UDF) Professionally tuned ✔
  • 8. CS/NERSC Data Seminar, September 1, 2017 UDF is at heart of modern big data system Examples: MapReduce in Apache Hadoop and Spark MAP() MAP() MAP() reduce() reduce() UDF to generate (key, value) pairs UDF to merge (key, values) pairs shuffle/sort Input Data Output Data
  • 9. CS/NERSC Data Seminar, September 1, 2017 MapReduce is not an optimal fit for scientific data analysis Reason 1: most scientific data are multi-dimensional arrays Pictures Credit: Kyle Hemes, Peter Nugent, Suren Byna, etc. Converting array to (key, value) is expensive because of explicitly handling coordinate
  • 10. CS/NERSC Data Seminar, September 1, 2017 MapReduce is not an optimal fit for scientific data analysis (continued) Reason 1: most scientific data are multi-dimensional arrays Reason 2: most scientific data analysis operations own structure locality property Structure locality: The analysis operation on a single cell accesses its neighborhood cells Map deals with a single element at a time Moving Average 2D Poisson Equation Solver (Discrete) Reduce requires to duplicate each cell for all neighborhood cells (~x # of neighbors) Reduce only happens after expensive shuffle Converting array to (key, value) is expensive
  • 11. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF: user-defined scientific data analysis on arrays • Stencil-based user-defined function API Structural locality aware array operations • Native multidimensional array data model In-situ data processing in scientific data formats, e.g., HDF5 • Optimal and automatic chunking and ghost zone handling method Fast large array processing in parallel & out-of-core manner Processing Element Chunk Ghost zone Storage System PE0 PE1 PE2 PE3 ArrayUDF
  • 12. CS/NERSC Data Seminar, September 1, 2017 • Stencil (S) is a structure representation for a set of neighborhood cells Stencil-based UDF s0,0 s0,1 s1,0 s1,-1 s0,-1 s-1,-1 s-1,0 s-1,1 s1,1 Materialized structure locality Flexible UDF expression by manipulating each neighborhood cell independently 2D Example: - S has a center where computing happens - The size of |S| is not fixed - Notations for set member stands for the cell at offset from center point d1,d2,···sd1,d2,··· i, j,···
  • 13. CS/NERSC Data Seminar, September 1, 2017 Stencil-based UDF(continued) • is an arbitrary user-defined function -|S| = 1, user-defined function of a single cell i.e., map in MapReduce -|S| > 1, user-defined aggregation of a set of cells, i.e., reduce in MapReduce f Si, j,···( )® c' i, j,··· A’A f
  • 14. CS/NERSC Data Seminar, September 1, 2017 Examples of using ArrayUDF Tem_avg(Stencil t): return (t(-30)+ … t(30))/60 Three steps by using ArrayUDF: Example 1: moving average in time series data Global temperature trend filtered by moving average at 60 years’ interval from 1908 to 2008 T.Apply(Tem_avg, T’) Array T(“data location pointer”) Step 1: Initialize data Step 2: Define operation on Stencil Step 3: Run & get result T’
  • 15. CS/NERSC Data Seminar, September 1, 2017 Examples of using ArrayUDF (Continued) Example 2: vorticity computation in fluid flow VC_X(Stencil u): return u(0,1)- u(0, -1) VC_Y(Stencil v): return v(1,0)- u(-1, 0) V_X.Apply(VC_X, V_X’) V_Y.Apply(VC_Y , V_Y’) V_X’+V_Y’ as vorticity Modeling renewable energy Combustion engines Pictures credit to: LANL, Frank Fritz Michael Milthaler, etc. Array V_X(“data location pointer”) Array V_y(“data location pointer”) Step 1: Initialize data (2D example) Step 2: Define operation on Stencil Step 3: Run & get result Three steps by using ArrayUDF :
  • 16. CS/NERSC Data Seminar, September 1, 2017 Optimized performance of ArrayUDF T_overall = T_I/O + T_computing + T_communication minimized(T_I/O) Constant = 0 (avoided) ArrayUDF T_overall: overall time to run a data analysis operation T_I/O: time of reading/writing data T_computing: time of execute operation expression T_communication: time of communications
  • 17. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF minimizes I/O cost via smart chunking • Factors considered in chunking : physical layout, logical shape, size, etc. • Two chunking strategies: - Layout unknown, i.e., average case of all possible layouts - Select square shaped chunk to minimize ghost cells/chunk - Layout known in advance, i.e., row-major, the most popular one Select contiguous chunking to maximize I/O on contiguous cells, including ghost cells Cell Ghost Cell Chunk Row−major order Chunk Square Shape Non−Square Shape # of ghost cells = 12 # of ghost cells = 20 Ghost Cell Chunk Cell See theoretical analysis in the ArrayUDF paper at HPDC’17
  • 18. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF dynamically builds ghost zone to avoid communication • What is ghost zone and why ? - Ghost zone are extra cells surrounding a chunk - Motivated by structure locality • When to build ghost zone? - Ghost zone is built when chunk is read from disk into memory • How to determine the size of ghost zone ? User-defined Trail-run: execute the UDF code on a special Stencil instance to collect the offsets used by UDF Size of ghost zone = maximum of collected offsets
  • 19. CS/NERSC Data Seminar, September 1, 2017 Evaluations • Hardware: -Edison, a Cray XC30 supercomputer at NERSC -5576 computing nodes, 24 cores/node, 64GB DDR3 Memory • Software - ArrayUDF - RasDaMan 9.5 (sequential version) - Spark 1.5.0 - EXTASCID(hand-optimized version) - SciDB 16.9 - Hand-optimized C/C++ code • Workloads - Two synthetic data sets (i.e., 2D and 3D) for micro benchmarks  Window operators, chunking strategy, trail-run, etc. - Four real scientific data sets (i.e., S3D, MSI , VPIC , CoRTAD)  Overall performance tests /w generic UDF interface
  • 20. CS/NERSC Data Seminar, September 1, 2017 Comparison with peer systems with standard “window” operators • “window” comes from SciDB and RasDaMan, where a operator is applied to all window members uniformly Average on - window 2x2 for 2D - window 2x2x2 for 3D • ArrayUDF has close performance to hand-optimized code • ArrayUDF is as much as 384X faster than peer systems
  • 21. CS/NERSC Data Seminar, September 1, 2017 Comparison with Spark in real scientific data analysis with generic UDF interface Spark experiences out-of-memory: - large data size - more local cells S3D Vorticity comp. 301GB 2 local cells/op. MSI Laplacian op. 21GB 4 local cells/op. VPIC Tri interpolation 36GB 8 local cells/op. CoRTAD Moving average 225GB 4 local cells/op. DataSize # of local cells used by UDF We observed ArrayUDF is 2070X faster
  • 22. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF at NERSC module load arrayudf/1.0 module show arrayudf/1.0 Examples: https://bitbucket.org/arrayudf/
  • 23. CS/NERSC Data Seminar, September 1, 2017 Conclusions • ArrayUDF: User-defined scientific data analysis on arrays -Stencil based UDF API for structural locality-aware operations -Native array model & In-situ array processing in HDF5, etc. -Auto & Optimal chunking and ghost zone methods for parallel or out-of-core array processing • ArrayUDF provides close performance to hand-optimized code • ArrayUDF is as much as 2070X faster than Spark • ArrayUDF is easy-to-use data analysis system • ArrayUDF source code: https://bitbucket.org/arrayudf/ • Future work -Python and other language interface (Done) -more in-situ formats: NetCDF, PnetCDF, ADIOS, etc.
  • 24. CS/NERSC Data Seminar, September 1, 2017 Acknowledgments • Nicholas Chaimov from University of Oregon for suggestions to set up Spark on Edison at NERSC • Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, support for the SDS project and a DOE Career award (Program manager: Dr. Lucy Nowell) under contract number DE-AC02-05CH11231 • National Energy Research Scientific Computing Center
  • 25. CS/NERSC Data Seminar, September 1, 2017 Thanks Bin Dong dbin@lbl.gov http://crd.lbl.gov//dongbin
  • 26. CS/NERSC Data Seminar, September 1, 2017 Backup Slides
  • 27. CS/NERSC Data Seminar, September 1, 2017 Stencil-based computing model vs. others Input Output UDF SQL DBMS Tuple t Tuple t’ t’=f(t) SciDB Cell c Cell c’ c’=f(c) MapReduce KeyValue kv KeyValue kv’’ kv’=Map(kv) kv’’=Reduce(kv’1, kv’2, …) ArrayUDF Stencil s Cell c’ c’=f(s) vs. MapReduce: ArrayUDF generalizes map and reduce as a single operation vs. SciDB: SciDB has ‘window’, similar to Stencil. But, the ‘window’ usually applies an operator uniformly on all cells involved. In ArrayUDF, cell with stencil can be applied with different operator.
  • 28. CS/NERSC Data Seminar, September 1, 2017 Chunking strategy evaluation 2D Dataset (100000, 100000) Square chunk (1K, 1K) • Squared chunking (for average cases) - minimize ghost cells # to reduce I/O cost • Contiguous chunking (for row-major layout) - maximize contiguous disk read Ghost zone has ignorable impact
  • 29. CS/NERSC Data Seminar, September 1, 2017 Trail-run overhead • Detect ghost zone size automatically • Run the UDF on a single Stencil but the UDF might access more neighborhood cells Unit: microsecond ≈ 1 ms when 256 cells are used in the UDF

Notas del editor

  1. Let’s with start with this background slide, which says “”. This example comes from cosmology background, where different projects are dedicated to research about …. During these projects, tons of data will be analyzed. PTF: Palomar Transient Factory
  2. When I joined some of project here, I always ask myself … In order to accompany these different scientific projects and also large data set during these projects,
  3. To develop data analysis operations with large populations and variety, there are two common method,
  4. 1, http://fluxnet.fluxdata.org/2017/05/09/fluxnet2015-data-set-highlighted-eos/ Twitchell Island, in the Sacramento–San Joaquin River Delta in Sacramento County, Calif., is a wetland flux site (FLUXNET ID US-Twt) in the FLUXNET network. It provides data to FLUXNET2015, the latest iteration of the FLUXNET data sets, which include measurements of exchanges of carbon, water, and energy between land and atmosphere at sites all around the world. Credit: Kyle Hemes 2, http://c3.lbl.gov/nugent/talks/isi/isi_usc_2017.pptx 4, http://nilearn.github.io/building_blocks/manual_pipeline.html
  5. To summary, since MapReduce does directly not not support array and it also not good at supporting structure locality, MapReduce is not an optimal fit for scientific data analysis. In order to resolve these issues, we developed ArrayUDF.
  6. Velocity
  7. GAT