Sept 24 NISO Virtual Conference: Library Data in the Cloud

Big Data Processing
in the Cloud
Zhiwu Xie and Collin Brittle

Data are abundant;
(re)usable data are scarce.
2
- Jane Silverthorne
Deputy Assistant Director, NSF

3
Outline
• Data, big data, and the library
• Why should the library process big data? A case
study: SEB
• How is this different? Some characterizations.
• How can the cloud computing help?

Data, Big Data, and Library
• Research data as the first class citizen
• Data management mandates, OSTP memorandum, …
• Research libraries’ roles
• data consultancy, DMP Tools, institutional repositories,
DPN, SHARE, …
• From data to big data
4

The Big Data Long Tails
• Small number of large data sets as the head; large
number of smaller, messier data sets as the tail
• Small number of “grand-challenge” related data sets
as the head; large number of modestly funded
researches that produce large data sets as the tail
5

The Libraries’ Role
• Large data sets and grand challenges -> National
infrastructure, e.g., XSEDE
• Smaller, messier -> Disciplinary repositories,
institutional repositories
• What about the large data sets that are not initially
recognized as grand-challenge related?
• Much easier to produce large data sets today
• Their values must not be underestimated
• Most institutional repositories limit the size of deposit, e.g., to
10G
6

Signature Engineering Building
8

Signature Engineering Building
9

Data Sharing
• Encourage exploratory and multidisciplinary research
• Foster open and inclusive communities around
• modeling of dynamic systems
• structural health monitoring and damage detection
• occupancy studies
• sensor evaluation
• data fusion
• energy reduction
• evacuation management
• …
12

Characterization
• Compute intensive
• Storage intensive
• Potentially bandwidth intensive
• On-demand
• Scalability challenge
13

Compute Intensive
• About 6GB raw data per hour
• Must be continuously processed, ingested, and
further processed
• User-generated computations
• Must not interfere with data retrieval
14

Storage Intensive
• SEB will accumulate about 60TB of raw data per year
• To facilitate researches on long-term effects, we must
keep raw data for an extended period of time, e.g.,
>= 5 years
• VT currently does not have an affordable storage
facility to hold this much data
• Within XSEDE, only TACC’s Ranch can allocate this
much long-term storage
15

(Potentially) Bandwidth Intensive
• What if hundreds of researchers around the world
each tried to download hundreds of TB of our data?
16

On Demand
• Explorative and multidisciplinary researches cannot
predict the data usage a priori
17

Scalability
• How to deal with these challenges in a scalable
manner?
18

Big Data + Cloud
• Affordable
• Elastic
• Scalable
19

A Data Reuse Scenario
• Validate cross-disciplinary research hypothesis, e.g.,
find a novel vibration pattern against all SEB data
• Lower the reuse barrier: must not require users to
invest heavily on infrastructure before the initial
analysis
• No need to move data around
• Can perform user-specified initial data filtering and analysis
• Initial analysis results may be used to enrich the metadata
and facilitate further discovery
• Cost sharing
20

Thank You!
• Questions? Comments?
• zhiwuxie@vt.edu
21

Sept 24 NISO Virtual Conference: Library Data in the Cloud

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Sept 24 NISO Virtual Conference: Library Data in the Cloud

Similar a Sept 24 NISO Virtual Conference: Library Data in the Cloud (20)

Más de National Information Standards Organization (NISO)

Más de National Information Standards Organization (NISO) (20)

Último

Último (20)

Sept 24 NISO Virtual Conference: Library Data in the Cloud

Notas del editor