CNI Fall 2011 Meeting Presentation Margaret Hedstrom & Robert McDonald (Dec. 2011)
1. SEAD
Sustainable Environment –
Actionable Data
CNI Fall Members Meeting
Margaret Hedstrom Robert H. McDonald Arlington, VA
SEAD PI/Project Director SEAD Sr. Personnel 12/12/2011
Professor & Associate Dean Assoc. Dean/Associate Director
UM School of Information Indiana University
2. NSF DataNet Program
• new types of organizations that integrate library & archival
sciences, cyberinfrastructure, computer & information sciences, &
domain science expertise
• provide reliable digital preservation, access, integration, and
analysis capabilities for science and/or engineering data over a
decades-long timeline;
• continuously anticipate and adapt to changes in technologies and in
user needs and expectations;
• engage in research to drive the leading edge forward
• serve as component elements of an interoperable data preservation
and access network
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141
3. • SEAD’s Unique
Partners
Contributions
– Address domain-driven
needs & requirements
– Serve scientists and
researchers in the “long tail”
– Integrate existing
technologies, tools &
services (rather than build
new from scratch)
5. Data challenges
• Heterogeneity of
all kinds
• Multiple scales
• Multidisciplinary
• Many small
datasets
6. The long tail of scientific research
• Small and derived data sets
• Heterogeneous data
• Multiple sources of data
• Short-lived data with long-term
value
• Value of data grows when combined
& integrated
7. SEAD’s Goals
• Provide data services that address the needs of
researchers working toward sustainability
• Integrate these services into an generalizable “Active and
Social Curation” infrastructure suited to the social
structure and economics of long-tail research
communities
• Develop capabilities to package and migrate the most
valuable datasets to a federated repository
infrastructure for long-term preservation
• Education, outreach, & training to disseminate SEAD‟s
contributions to other projects & communities
8. SEAD’s Strategy
• Leverage social media for discovery of
data, interest, and expertise
• Move data curation upstream in the data life
cycle
• Involve domain scientists in setting priorities
for evolution of data and services
• Take advantage of existing infrastructures
(Institutional Repositories, ICPSR) for long-
term preservation
9. Active and Social Curation
• Engage researchers during projects, not at the
end
• Automatically capture metadata as defined by
the data producers
• Provide facilities for
commentary, recommendations, and mark-up
of data
• Further reduce costs by re-engineering
curation processes to leverage this rich
metadata and volunteered effort
11. SEAD Status
Phase 1 Phase 2
Months 1-18 Years 3-5
Grow SEAD
Develop
users, data, an
Prototype d functionality
SEAD start date: 10/1/2011
In other words, SEAD is not ready to accept your data!
12. SEAD Personnel
• Margaret Hedstrom, PI (Michigan)
• Praveen Kumar, co-PI (Illinois)
• Jim Myers, co-PI (RPI)
• Beth Plale, co-PI (Indiana)
• Ann Zimmerman, co-PI/Project Manager
(Michigan)
• George Alter (ICPSR)
• Bryan Beecher (ICPSR)
• Katy Börner (Indiana)
• Robert McDonald (Indiana)
• Jude Yew, Post-doc (Michigan)
• + many more to come
14. SEAD TEAM
University of Michigan: Margaret Hedstrom (UM PI), Ann
Zimmerman (Co-PI and Project Manager), George Alter, Bryan
Beecher, Charles Severance, Karen Woollams, Jude Yew.
Indiana University: Beth Plale (IU PI), Katy Borner, Robert H.
McDonald, Kavitha Chandrasekar, Robert Ping, Stacy
Kowalczyk, Robert Light.
University of Illinois: Praveen Kumar (UIUC PI), Rob Kooper, Luigi
Marini, Terry McLaren.
Rensselaer Polytechnic Institute: Jim Myers (RPI PI), Ram Prasanna
Govind Krishnan, Lindsay Todd, Adam Wilson.
15. SEAD Cyberinfrastructure
• An international resource
for sustainability science
• Novel technical and
business approaches to
supporting the long-tail
of research data
• Lifecycle support:
actionable data services
integrated with curation
and preservation
infrastructure
16. Key Challenges for SEAD
Cyberinfrastructure
• Managed Data storage and services are expensive!
• Begging for metadata doesn‟t work!
• Curation and preservation are time consuming!
• The long-tail is not standardized!
• Data collections are always missing something
valuable!
• Data models evolve!
• Cyberinfrastructure is obsolete by the time you build
it!
• Building Community as you leverge
cyberinfrastructure
18. Linked Data and Repositories
• Tag and annotate data
• Overlay it with reference data
• Organize it in domain terminology
• Link it to
people, papers, projects, conversations…
20. KEY SEAD Questions
• What could SEAD capture when?
• How can SEAD provide direct value
to data producers, users, and
curators?
• How can robust web-services and
social computing lower barriers and
reduce/realign costs?
21. SEAD: Active Content Repository
• With the „Big Picture‟ graph in-hand, curators
can:
▫ Focus on what to curate and when,
▫ Automate parts of the process
▫ Use existing/emerging technologies for packaging
and preserving datasets
▫ Better manage federated repositories
22. SEAD: Leveraging Existing Resources
• Cyberinfrastructure
▫ IU Data Capacitor/HPC Capabilities
▫ UIUC/NCSA HPC Capabilities
▫ Rensselaer CCNI Capabilities
• Repositories
▫ UM Deep Blue
▫ IU ScholarWorks
▫ ICPSR Repository
▫ UIUC IDEALS
23. SEAD LayerCake View
• Services over an
Network of Data
Producers
active content layer
that is backed
by/harvested into a Web User Interface
federated archive Active Content Repository
infrastructure based
Services Provided
Content Curation Archival Other
data services
Mining Decisions
on institutional generation
Virtual Archives
resources Institutional Repositories
Data IU RPI UIUC UM ICPSR
Conservancy
User Network
24. CI Technical Approach
Active and Social Curation OAIS Repository Federation
Curation Boundary
Automated
Curation
Data Metadata Workflow/Rule
Acquisition, Management Engine
Analysis and DDI3.
Operates on
Simulation METS, PREMIS, MODS
Metadata, Content Scholarly
Objects and Trigger
, DC, SensorML, OGC,
… Events Communication
Ingest scripts:
Ingest, AIPs
Appraisal fixity, integrity, a
Compound Objects - OAI-ORE
VIVO/ and CI Technical Approach
uthentication, tr
Linked Active Selection ansformation
Data Content Digital Repository Federation
(OAIS compliant)
Repositor Preservation
Actions
y
Dissemination Packages
Wide-Area File System
Search, Brows
e,
Migration
and Access Mechanisms
Annotation, V Use, Reuse, R
and E-Scholarship
isualization epurposing Emulation
Contributor User Services
Tools Tools Tools
25. Toward PetaScale Data
• Internet2 upgrade:
▫ Total bandwidth from 100 Gbps to 8.8 Tbps
▫ Moving a petabyte of data will go from from 10 days to 25 hrs
26. SEAD 18 Month Prototype Targets for
Cyberinfrastructure
• Active and Social Content Curation
▫ Pilot Active Content Repository, VIVO deployments
▫ Exemplar services for Data Ingest, Discovery, Re-
use, Curation
• CI for Long-term Access
▫ Data model, protocol design/development
▫ Pilot Federated Repository infrastructure
27. SEAD CI QuickView
• SEAD will quickly build a repository and data services infrastructure
for sustainability research that can be responsively adapted based on
community feedback – Community Agile Development
• SEAD will leverage existing tools and emerging practices to
dramatically enhance the interactions of researchers and data
librarians – Active Curation
• SEAD‟s focus on the long-tail will force an emphasis on ease-of-use
and low costs that is critical for long-term sustainability – Leverage
Existing Institution Resources for Long-term Access
• SEAD will leverage experiences in the sustainability research
community to provide guidance for other long-tail communities
making the transition to an interdisciplinary, systems-oriented
approach to research – Sustainability and Resource Growth
Partnership and Collaboration
28. Acknowledgments
SEAD is funded by the National Science
Foundation under cooperative agreement
#OCI0940824
• For more on SEAD go to:
• http://sead-data.net
• Follow us on Twitter
@SEADdatanet
http://sead-data.net
Editor's Notes
How may people in this audience have an institutional repository? Are you using it to publish data?