Big Data Processing in the Cloud: a Hydra/Sufia Experience
Zhiwu Xie, Ph.D., Associate Professor and Technology Development Librarian, Center for Digital Research and Scholarship University Libraries, Virginia Tech
2. Data are abundant;
(re)usable data are scarce.
2
- Jane Silverthorne
Deputy Assistant Director, NSF
3. 3
Outline
• Data, big data, and the library
• Why should the library process big data? A case
study: SEB
• How is this different? Some characterizations.
• How can the cloud computing help?
4. Data, Big Data, and Library
• Research data as the first class citizen
• Data management mandates, OSTP memorandum, …
• Research libraries’ roles
• data consultancy, DMP Tools, institutional repositories,
DPN, SHARE, …
• From data to big data
4
5. The Big Data Long Tails
• Small number of large data sets as the head; large
number of smaller, messier data sets as the tail
• Small number of “grand-challenge” related data sets
as the head; large number of modestly funded
researches that produce large data sets as the tail
5
6. The Libraries’ Role
• Large data sets and grand challenges -> National
infrastructure, e.g., XSEDE
• Smaller, messier -> Disciplinary repositories,
institutional repositories
• What about the large data sets that are not initially
recognized as grand-challenge related?
• Much easier to produce large data sets today
• Their values must not be underestimated
• Most institutional repositories limit the size of deposit, e.g., to
10G
6
12. Data Sharing
• Encourage exploratory and multidisciplinary research
• Foster open and inclusive communities around
• modeling of dynamic systems
• structural health monitoring and damage detection
• occupancy studies
• sensor evaluation
• data fusion
• energy reduction
• evacuation management
• …
12
14. Compute Intensive
• About 6GB raw data per hour
• Must be continuously processed, ingested, and
further processed
• User-generated computations
• Must not interfere with data retrieval
14
15. Storage Intensive
• SEB will accumulate about 60TB of raw data per year
• To facilitate researches on long-term effects, we must
keep raw data for an extended period of time, e.g.,
>= 5 years
• VT currently does not have an affordable storage
facility to hold this much data
• Within XSEDE, only TACC’s Ranch can allocate this
much long-term storage
15
16. (Potentially) Bandwidth Intensive
• What if hundreds of researchers around the world
each tried to download hundreds of TB of our data?
16
17. On Demand
• Explorative and multidisciplinary researches cannot
predict the data usage a priori
17
18. Scalability
• How to deal with these challenges in a scalable
manner?
18
19. Big Data + Cloud
• Affordable
• Elastic
• Scalable
19
20. A Data Reuse Scenario
• Validate cross-disciplinary research hypothesis, e.g.,
find a novel vibration pattern against all SEB data
• Lower the reuse barrier: must not require users to
invest heavily on infrastructure before the initial
analysis
• No need to move data around
• Can perform user-specified initial data filtering and analysis
• Initial analysis results may be used to enrich the metadata
and facilitate further discovery
• Cost sharing
20
Today I will present a data repository project that Virginia Tech Libraries is currently involved in that may alleviate some of the reuse pain.
We are developing a technology prototype that potentially targets a slight different niche from the traditional comfort zone of the library development.
Nonetheless, the work is closely related to what the academic and research library has been doing for a long time, which is, archiving and making available research outcomes.
The library community certainly has a lot to contribute, but to do it well, we may need to pick up a few new tricks, for example, the cloud computing.
The presentation will cover these grounds:
First I’ll give you a general idea of data and big data, and in what capacity the libraries have been involved so far.
I’ll then describe the new niche that our development is trying to explore. This is done by describing the Virginia Tech Signature Engineering Building project, or the SEB project.
However, this is not a project briefing. Instead I believe the problem we are trying to tackle is a general one and can potentially become a growth point for the libraries. I therefore will extract the key requirements of the SEB project and generalize them as the characteristics of this new niche. I will then describe how this new niche is different from the traditional repository work we are already familiar with, and then how can the cloud computing help.
I will not cover much implementation details. Instead, this is more of a high-level, conceptual overview. My co-author Collin Brittle has presented some of the implementation details at this year’s Open Repository in Helsinki, Finland. His presentation video is online in case you are interested.