User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management and other non-trivial tasks to the system. This general approach is at the heart of the modern Big Data systems, such MapReduce/Spark and SciDB. However, a wide variety of common scientific data operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute with these Big Data systems. In this talk, we will introduce a brand new Big Data system namely ArrayUDF (https://bitbucket.org/arrayudf/arrayudf) for scientific data sets, especially for multi-dimensional arrays. The ArrayUDF allows flexible expressiveness of UDF for scientific data analysis on the strength of their common character--structural locality. ArrayUDF executes the UDF directly on arrays stored in files, such as HDF5, without any data load overload. ArrayUDF's desi
gn and implementation considerations for parallel data processing on large-scale HPC will also be introduced. The performance tests on Edison at NERSC show that ArrayUDF is around 2000X faster than Spark on processing large scientific datasets.
Abortion pills in Jeddah | +966572737505 | Get Cytotec
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
1. CS/NERSC Data Seminar, September 1, 2017
ArrayUDF: User-Defined Scientific Data
Analysis on Arrays
Bin Dong1, Kesheng Wu1, Surendra Byna1, Jialin Liu1
Weijie Zhao2, Florin Rusu1,2
1LBNL, Berkeley, CA
2UC Merced, Merced, CA
CS/NERSC Data Seminar, September 1, 2017
2. CS/NERSC Data Seminar, September 1, 2017
Clarion call for large scale data analysis in modern
scientific activities
Example: scientific projects for supernovae, dark matter/energy, etc.
Data source:
Rick White,
J. Hart,
R. Cutri,
Ian Foster,
C. J. Grillmair,
etc.
SIGMOD’16
SIGMOD’17
SIGMOD’14
3. CS/NERSC Data Seminar, September 1, 2017
Q1: How many data analysis operations are
being or will be developed ?
4. CS/NERSC Data Seminar, September 1, 2017
Avg Growth
Python modules 72/day
R packages 10/day
Java packages 108/day
Q1: How many data analysis operations are being
or will be developed ?
Implication from popular data analysis languages
*Data from http://www.modulecounts.com/ on Aug 30 2017
Large population
5. CS/NERSC Data Seminar, September 1, 2017
Q2: What are the functions of these data analysis
operations ?
Q1: How many data analysis operations are being
or will be developed ?
Large population
6. CS/NERSC Data Seminar, September 1, 2017
Q2: What are the functions of these data analysis
operations ?
Variety
Q1: How many data analysis operations are being
or will be developed ?
Large population
7. CS/NERSC Data Seminar, September 1, 2017
Two common methods to develop data analysis
operations with large population and variety
For each operation P Do
Develop P’s :
- Data management
- Expression execution
- Other components:
parallel,
communication
cache,
etc.
End For
Redundant
Diverse
Customized Solutions
✔
✗
✗Redundant
May lack expertise of the
underlying systems to
tune its performance
✗
UDF API
- Data management
- Generic exec. engine
- Other components:
parallel, comm.,
cache, etc.
Diverse
One single
& shared
✔
✔
Operation expression 1
User-defined Functions (UDF)
Professionally tuned
✔
8. CS/NERSC Data Seminar, September 1, 2017
UDF is at heart of modern big data system
Examples: MapReduce in Apache Hadoop and Spark
MAP()
MAP()
MAP()
reduce()
reduce()
UDF to generate
(key, value) pairs
UDF to merge
(key, values) pairs
shuffle/sort
Input
Data
Output
Data
9. CS/NERSC Data Seminar, September 1, 2017
MapReduce is not an optimal fit for scientific data
analysis
Reason 1: most scientific data are multi-dimensional arrays
Pictures Credit: Kyle Hemes,
Peter Nugent,
Suren Byna,
etc.
Converting array to (key, value) is expensive
because of explicitly handling coordinate
10. CS/NERSC Data Seminar, September 1, 2017
MapReduce is not an optimal fit for scientific data
analysis (continued)
Reason 1: most scientific data are multi-dimensional arrays
Reason 2: most scientific data analysis operations own
structure locality property
Structure locality:
The analysis operation on a single cell
accesses its neighborhood cells
Map deals with a single element at a time
Moving Average
2D Poisson Equation
Solver (Discrete)
Reduce requires to duplicate each cell
for all neighborhood cells (~x # of neighbors)
Reduce only happens after expensive shuffle
Converting array to (key, value) is expensive
11. CS/NERSC Data Seminar, September 1, 2017
ArrayUDF: user-defined scientific data analysis
on arrays
• Stencil-based user-defined function API
Structural locality aware array operations
• Native multidimensional array data model
In-situ data processing in scientific data formats, e.g., HDF5
• Optimal and automatic chunking and ghost zone
handling method
Fast large array processing in parallel & out-of-core manner
Processing
Element
Chunk Ghost zone
Storage
System
PE0
PE1
PE2
PE3
ArrayUDF
12. CS/NERSC Data Seminar, September 1, 2017
• Stencil (S) is a structure representation for a set of
neighborhood cells
Stencil-based UDF
s0,0 s0,1
s1,0
s1,-1
s0,-1
s-1,-1 s-1,0 s-1,1
s1,1
Materialized structure locality
Flexible UDF expression by manipulating
each neighborhood cell independently
2D Example:
- S has a center where computing happens
- The size of |S| is not fixed
- Notations for set member
stands for the cell at offset
from center point
d1,d2,···sd1,d2,···
i, j,···
13. CS/NERSC Data Seminar, September 1, 2017
Stencil-based UDF(continued)
• is an arbitrary user-defined function
-|S| = 1, user-defined function of a single cell
i.e., map in MapReduce
-|S| > 1, user-defined aggregation of a set of cells,
i.e., reduce in MapReduce
f Si, j,···( )® c'
i, j,···
A’A
f
14. CS/NERSC Data Seminar, September 1, 2017
Examples of using ArrayUDF
Tem_avg(Stencil t):
return (t(-30)+ … t(30))/60
Three steps by using ArrayUDF:
Example 1: moving average in time series data
Global temperature trend filtered by
moving average at 60 years’ interval
from 1908 to 2008
T.Apply(Tem_avg, T’)
Array T(“data location pointer”)
Step 1: Initialize data
Step 2: Define operation on Stencil
Step 3: Run & get result T’
15. CS/NERSC Data Seminar, September 1, 2017
Examples of using ArrayUDF (Continued)
Example 2: vorticity computation in fluid flow
VC_X(Stencil u):
return u(0,1)- u(0, -1)
VC_Y(Stencil v):
return v(1,0)- u(-1, 0)
V_X.Apply(VC_X, V_X’)
V_Y.Apply(VC_Y , V_Y’)
V_X’+V_Y’ as vorticity
Modeling renewable energy
Combustion engines
Pictures credit to: LANL, Frank Fritz Michael Milthaler, etc.
Array V_X(“data location pointer”)
Array V_y(“data location pointer”)
Step 1: Initialize data (2D example)
Step 2: Define operation on Stencil
Step 3: Run & get result
Three steps by using ArrayUDF :
16. CS/NERSC Data Seminar, September 1, 2017
Optimized performance of ArrayUDF
T_overall = T_I/O + T_computing + T_communication
minimized(T_I/O) Constant = 0 (avoided)
ArrayUDF
T_overall: overall time to run a data analysis operation
T_I/O: time of reading/writing data
T_computing: time of execute operation expression
T_communication: time of communications
17. CS/NERSC Data Seminar, September 1, 2017
ArrayUDF minimizes I/O cost via smart
chunking
• Factors considered in chunking : physical layout, logical shape, size, etc.
• Two chunking strategies:
- Layout unknown, i.e., average case of all possible layouts
- Select square shaped chunk to minimize ghost cells/chunk
- Layout known in advance, i.e., row-major, the most popular one
Select contiguous chunking to maximize I/O on contiguous cells,
including ghost cells
Cell Ghost Cell Chunk Row−major order
Chunk
Square Shape
Non−Square Shape
# of ghost cells = 12
# of ghost cells = 20
Ghost Cell
Chunk Cell
See theoretical
analysis in the
ArrayUDF paper
at HPDC’17
18. CS/NERSC Data Seminar, September 1, 2017
ArrayUDF dynamically builds ghost zone to
avoid communication
• What is ghost zone and why ?
- Ghost zone are extra cells surrounding a chunk
- Motivated by structure locality
• When to build ghost zone?
- Ghost zone is built when chunk is read from disk into
memory
• How to determine the size of ghost zone ?
User-defined
Trail-run: execute the UDF code on a special Stencil
instance to collect the offsets used by UDF
Size of ghost zone = maximum of collected offsets
19. CS/NERSC Data Seminar, September 1, 2017
Evaluations
• Hardware:
-Edison, a Cray XC30 supercomputer at NERSC
-5576 computing nodes, 24 cores/node, 64GB DDR3 Memory
• Software
- ArrayUDF - RasDaMan 9.5 (sequential version)
- Spark 1.5.0 - EXTASCID(hand-optimized version)
- SciDB 16.9 - Hand-optimized C/C++ code
• Workloads
- Two synthetic data sets (i.e., 2D and 3D) for micro benchmarks
Window operators, chunking strategy, trail-run, etc.
- Four real scientific data sets (i.e., S3D, MSI , VPIC , CoRTAD)
Overall performance tests /w generic UDF interface
20. CS/NERSC Data Seminar, September 1, 2017
Comparison with peer systems with
standard “window” operators
• “window” comes from SciDB and RasDaMan, where a
operator is applied to all window members uniformly
Average on
- window 2x2 for 2D
- window 2x2x2 for 3D
• ArrayUDF has close performance to hand-optimized code
• ArrayUDF is as much as 384X faster than peer systems
21. CS/NERSC Data Seminar, September 1, 2017
Comparison with Spark in real scientific
data analysis with generic UDF interface
Spark experiences
out-of-memory:
- large data size
- more local cells
S3D
Vorticity comp.
301GB
2 local cells/op.
MSI
Laplacian op.
21GB
4 local cells/op.
VPIC
Tri interpolation
36GB
8 local cells/op.
CoRTAD
Moving
average
225GB
4 local cells/op.
DataSize
# of local cells used by UDF
We observed
ArrayUDF is
2070X faster
22. CS/NERSC Data Seminar, September 1, 2017
ArrayUDF at NERSC
module load arrayudf/1.0
module show arrayudf/1.0
Examples:
https://bitbucket.org/arrayudf/
23. CS/NERSC Data Seminar, September 1, 2017
Conclusions
• ArrayUDF: User-defined scientific data analysis on arrays
-Stencil based UDF API for structural locality-aware operations
-Native array model & In-situ array processing in HDF5, etc.
-Auto & Optimal chunking and ghost zone methods for parallel or
out-of-core array processing
• ArrayUDF provides close performance to hand-optimized code
• ArrayUDF is as much as 2070X faster than Spark
• ArrayUDF is easy-to-use data analysis system
• ArrayUDF source code: https://bitbucket.org/arrayudf/
• Future work
-Python and other language interface (Done)
-more in-situ formats: NetCDF, PnetCDF, ADIOS, etc.
24. CS/NERSC Data Seminar, September 1, 2017
Acknowledgments
• Nicholas Chaimov from University of Oregon for suggestions to set
up Spark on Edison at NERSC
• Office of Advanced Scientific Computing Research, Office of
Science, U.S. Department of Energy, support for the SDS project
and a DOE Career award (Program manager: Dr. Lucy Nowell)
under contract number DE-AC02-05CH11231
• National Energy Research Scientific Computing Center
25. CS/NERSC Data Seminar, September 1, 2017
Thanks
Bin Dong
dbin@lbl.gov
http://crd.lbl.gov//dongbin
27. CS/NERSC Data Seminar, September 1, 2017
Stencil-based computing model vs. others
Input Output UDF
SQL DBMS Tuple t Tuple t’ t’=f(t)
SciDB Cell c Cell c’ c’=f(c)
MapReduce KeyValue kv KeyValue kv’’ kv’=Map(kv)
kv’’=Reduce(kv’1, kv’2, …)
ArrayUDF Stencil s Cell c’ c’=f(s)
vs. MapReduce:
ArrayUDF generalizes map and reduce as a single operation
vs. SciDB:
SciDB has ‘window’, similar to Stencil. But, the ‘window’ usually
applies an operator uniformly on all cells involved. In ArrayUDF,
cell with stencil can be applied with different operator.
28. CS/NERSC Data Seminar, September 1, 2017
Chunking strategy evaluation
2D Dataset (100000, 100000)
Square chunk
(1K, 1K)
• Squared chunking (for average cases)
- minimize ghost cells # to reduce I/O cost
• Contiguous chunking (for row-major layout)
- maximize contiguous disk read
Ghost zone has
ignorable impact
29. CS/NERSC Data Seminar, September 1, 2017
Trail-run overhead
• Detect ghost zone size automatically
• Run the UDF on a single Stencil but the UDF might access
more neighborhood cells
Unit: microsecond
≈ 1 ms when 256 cells
are used in the UDF
Notas del editor
Let’s with start with this background slide, which says “”.
This example comes from cosmology background, where different projects are dedicated to research about ….
During these projects, tons of data will be analyzed.
PTF: Palomar Transient Factory
When I joined some of project here, I always ask myself …
In order to accompany these different scientific projects and also large data set during these projects,
To develop data analysis operations with large populations and variety, there are two common method,
1, http://fluxnet.fluxdata.org/2017/05/09/fluxnet2015-data-set-highlighted-eos/
Twitchell Island, in the Sacramento–San Joaquin River Delta in Sacramento County, Calif., is a wetland flux site (FLUXNET ID US-Twt) in the FLUXNET network. It provides data to FLUXNET2015, the latest iteration of the FLUXNET data sets, which include measurements of exchanges of carbon, water, and energy between land and atmosphere at sites all around the world. Credit: Kyle Hemes
2, http://c3.lbl.gov/nugent/talks/isi/isi_usc_2017.pptx
4, http://nilearn.github.io/building_blocks/manual_pipeline.html
To summary, since MapReduce does directly not not support array and it also not good at supporting structure locality, MapReduce is not an optimal fit for scientific data analysis.
In order to resolve these issues, we developed ArrayUDF.