2024: Domino Containers - The Next Step. News from the Domino Container commu...
SEAD Prototype: Data Curation and Preservation for Sustainability Science
1. The SEAD Prototype: Data Curation and Preservation for Sustainability Science
Beth Plale, Robert H. McDonald, Kavitha Chandrasekar, Inna Kouper, Indiana University, {plale, rhmcdona, kavchand, inkouper}@indiana.edu
Margaret Hedstrom, James Myers, University of Michigan, {hedstrom, myersjd}@umich.edu
Praveen Kumar, Rob Kooper, Luigi Marini, University of Illinois at Urbana-Champaign, {kumar1, kooper, lmarini}@illinois.edu
SEAD Vision and Rationale
Serve interdisciplinary and data-driven
research in sustainability science
Enable access to publications, data and
people
Support new types of analyses with
heterogeneous data
Reduce overall cost of data curation
and preservation
Capture metadata to provide
immediate value for users, producers
and repositories
Increase capabilities for research data
re-use
Active Curation, Actionable
Data (ACR)
Community Exploration,
Research Analytics (VIVO)
SEAD Use Cases (focusing on curation)
Active
Content
Repository
APIs – Web Services
Role-based
Access Control
Data/Metadata
Management
People / Projects / Publications
Data Citations
Organizations
Visualized Networks and Community
Dynamics
People
Projects
Publications
Organizations
Data Citations
Visualizations
Branded Public Access
Active Project Spaces
Individual Data Pages
Data pages
Collection pages
Tag – Search – Map
Project Summary
Geo-Web App
Branded Repository
Android – Desktop
Apps
Ingestion of heterogeneous data types
(e.g., images, geo-spatial data, and sensor
data) and mapping of semantic
relationships among the research data
collections as well as semantic annotation
and tagging.
Support of data discovery through
interoperable standards and algorithms,
social networking and data publishing.
Enhancements of existing data through
automated scientific metadata extraction
and data visualization plugins.
Ingestion of new data sets directly via
workbench tools.
Curation of data via federated deposit into
institutional and disciplinary repositories.
Data Publication,
Preservation and Discovery (VA)
Policy-Driven Curation
Institutional / Cloud / Grid Storage
Faceted Search
Curator’s Workbench
Ingest Processing
Matchmaking
Faceted Search
Geo-spatial Search
SEAD Prototype
VIVO
APIs – Web Services
APIs – Joseki – Web Services
Extractors and
Indexing
User Management
RDF –Tupelo 2 – Medici – Lucene – Geoserver
SPRAQL /
HTTP
User / Entity Management –
Analytics
Jena – RDF
MySQL – Local File System
Virtual Archive
SPRAQL / HTTP
BAGIT
Metadata Extraction
Persistent IDs
Indexing
Archiving
Solr Query
(XML)
Geospatial
Query
BagIt
Matchmaker
Conversion
DataONE
Member Node
MySQL – Local File System – Solr – PostGIS
MySQL – Local File System
Acknowledgements
SEAD is funded by the National Science Foundation under Cooperative Agreement #OCI0940824.
SEAD gratefully acknowledges all of our partner participants who have been involved in developing our services
framework. This includes the research teams from the following organizations: School of Information, University
of Michigan; Department of Civil and Environmental Engineering, the National Center for Supercomputing
Applications (NCSA) and UIUC Libraries, University of Illinois at Urbana-Champaign; Data to Insight Center, IU
Libraries and School of Informatics and Computing, Indiana University; the Interuniversity Consortium for Political
and Social Research (ICPSR); the National Center for Earth-Surface Dynamics (NCED) and the Data Conservancy
Project, John Hopkins University.
Currently, SEAD has implemented core functionality for uploading, annotating, and viewing data, linking data to researcher profiles, and mechanisms to package this information
and transfer it to institutional repositories or archival cloud storage. The curation pipeline to institutional repositories supports both long-term preservation and search and
discovery workflows. The SEAD prototype is currently being tested by ingesting, annotating, and preserving datasets from the National Center for Earth Surface Dynamics (1.6
terabytes of data containing over 450,000 files) which involves transfer of data and metadata between SEAD ACR, VIVO and VA components.