Create, curate, re-use: the expanding life course of digital research data
1. a centre of expertise in data curation and preservation
Create, curate, re-use:
the expanding life course of digital research data
Chris Rusbridge
EDUCAUSE Australasia May 2007
Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5
UK: Scotland License, excluding content property of others. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative
Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
2. a centre of expertise in data curation and preservation
Contents
• Science and digital curation
• Why are data important?
• What kinds of data?
• What to do with your data: frontiers of
practice
• Repository frontiers
• Changing practice
EDUCAUSE Australasia 2007
3. a centre of expertise in data curation and preservation
Digital Curation Centre Mission
“The over-riding purpose of the DCC is to
support and promote continuing improvement
in the quality of data curation, and of
associated digital preservation”
EDUCAUSE Australasia 2007
4. a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
5. a centre of expertise in data curation and preservation
Summarising…
• Sustainability • Maintaining meaning
• Creation or selection over time
• Growth, development • Preserving, including
• Making available past states
• Access management • De-selection…
• Re-usability • Extended time
• Linkage, context, • Budget and policy
metadata impacts
• Authenticity, integrity,
• People issues!
provenance
EDUCAUSE Australasia 2007
6. a centre of expertise in data curation and preservation
Science and curation
• Creating and managing data suitable for re-use
• Good curation supports good science (managing
your data properly)
• Poor curation allows sloppy science?
• Data curation should save money
• Murray-Rust/Frey on interesting but fruitless experiments!
• Some science impossible without curation…
• QCD strong coupling constant prediction (Bethke)
• Viscosity of earth mantle from Shang Dynasty eclipse
records (Pang et al)
• Science depending on past baselines (eg environmental,
social sciences)
EDUCAUSE Australasia 2007
7. a centre of expertise in data curation and preservation
Records of science
• Data increasingly important as evidence
• Key part of the scholarly record (public good)
• Unrepeatable observations & experiments
• Experimental verifiability (the basis of science)
• Would Chang retractions have been reduced if his first
data were available?
• Allows additional interpretations
• Legal and compliance
• See APSR/AERES report for good examples
EDUCAUSE Australasia 2007
8. a centre of expertise in data curation and preservation
What kinds of data?
• Observations
• eg UARS (Upper Atmosphere) Level 0: telemetry
• UARS Level 1: measured physical parameters (post
calibration?)
• Derived data
• UARS Level 2: calculated geophysical? profiles
• UARS level 3: gridded, interpolated?
• Combined data
• Crafted data
• Eg annotated gene/protein databases
• Descriptive (meta)data
EDUCAUSE Australasia 2007
9. a centre of expertise in data curation and preservation
Retaining research data means…
• Data secure against loss (within group)
• Communal repository (secure bit dump)
• Re-usable, sharable information
• As above, plus active curation (eg bio-
informatics)
• Long term preservation of information
• Be clear what you are trying to do!
EDUCAUSE Australasia 2007
10. a centre of expertise in data curation and preservation
… or the data trajectory is…
• Hard drive → lost (crash)
• Hard drive →DVD →Cardboard box →Loft
→Skip/dumpster → lost
• Sometimes this is a very bad thing
• Sometimes these are the right options!
EDUCAUSE Australasia 2007
11. a centre of expertise in data curation and preservation
Long term bit storage…
• A solved problem? Just requires well-
understood good data management
practices?
• Wrong! For very large datasets over very long
time, there are significant problems…
BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T.
J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys
'06. Leuven, Belgium, ACM.
EDUCAUSE Australasia 2007
12. a centre of expertise in data curation and preservation
How Well Must We Preserve?
Keep a petabyte for a century
– With 50% chance of remaining completely undamaged
Consider each bit decaying independently
– Analogy with radioactive decay
That's a bit half- life of 10**18 years
– One hundred million times the age of the universe
That's a very demanding requirement
– Hard to measure
– Even very unlikely faults will matter a lot
EDUCAUSE Australasia 2007 •Slide from David Rosenthal, LOCKSS
13. a centre of expertise in data curation and preservation
What to do about curation
• Build curation/reusability into your workflow
• Curation begins before creation
• What’s easy at first becomes (impossibly) hard
later
• Describe your data (metadata schemas,
“representation info”, etc)
• Keep experimental parameters (technical, who,
what, when, where)
• Keep ability to process
• Keep data!
EDUCAUSE Australasia 2007
14. a centre of expertise in data curation and preservation
What to do about curation - 2
• Use standard/agreed formats for data
• Make ownership & restrictions clear, &
explain how to cite your data
• Offer for deposit in institutional or discipline
repository
• Appraisal and selection essential
• Possible time-limited embargos
• “Publish” data in support of articles
EDUCAUSE Australasia 2007
15. a centre of expertise in data curation and preservation
Internet Archaeology: publication with
data
EDUCAUSE Australasia 2007
16. a centre of expertise in data curation and preservation
Database as book…
• Buneman (early pilot)
work on IUPHAR
database
• MySQL to XML
database
• Historic to logical
schema
• XML via XSLT to LaTeX
EDUCAUSE Australasia 2007
17. a centre of expertise in data curation and preservation
The StORe vision
• Seamless transport Source
from research data to
research publications
and vice versa ware
• Bi-directional links Middle
proven in social science
e-research but capable
of export to other
disciplines
Output
•http://jiscstore.jot.com/WikiHome/
EDUCAUSE Australasia 2007 •Slide from Graham Pryor
18. a centre of expertise in data curation and preservation
What are the reusability issues?
• Data not neutral to hypothesis
• Hard to know the risks & pitfalls of a particular
dataset
• Data not self-describing: hard to find
appropriate data (but see Murray-Rust on
Googling InChi etc)
• Hard to “understand” data once found
• Really need information, not data!
• Hard to use data once understood
EDUCAUSE Australasia 2007
19. a centre of expertise in data curation and preservation
Context
• Data meaningless without context
• Metadata of many kinds
• Representation information… from data to
information
• Linkage and connection between datasets
• Use your workflow!
• Provenance
• Authenticity/integrity
• Computational lineage
EDUCAUSE Australasia 2007
20. a centre of expertise in data curation and preservation
NASA
Csat8-day composite and subsceneCsat 8-day composite subscene PAR subscene RPT
E0SST and Pbopt calc H
Ctot calc Zeu calc PPeu calc
University research
University group3 local
research
research decision-
group1
group2 making body
EDUCAUSE Australasia 2007 Slide from Rajendra Bose
21. a centre of expertise in data curation and preservation
Access and re-use
• Ethics and rights control access
• Weak in expressing this long-term
• Collaboration tools
• Annotation, discussion, review (see DART…)
• Re-use leading to change and development
• “Publication”
• Not just in “print”
• Underlying data should be “published”, too
EDUCAUSE Australasia 2007
22. a centre of expertise in data curation and preservation
Database citation issues…
• Citation for human readers and machine use cases
• Granularity: database, record, item
• Citation of changing objects
• Version change (eg W3C practice: no version = latest, vs bibliographic:
no version = first)
• An efficient way to reference and access “archived” past states of
more rapidly changing dataset, eg Genomics… datasets that result
from the combined work of curators, or contain opinions or facts likely
to change (work in progress, Buneman et al)
• Standards conflict and immature (NLM best?)
• Citation ESSENTIAL for motivating quality academic work on data
management and curation
EDUCAUSE Australasia 2007
23. a centre of expertise in data curation and preservation
Who does curation?
• Individuals
• Departments or groups
• Institutions, maybe through libraries
• Communities
• Disciplines
• Publishers
• National services
• Other 3rd parties…
EDUCAUSE Australasia 2007
24. a centre of expertise in data curation and preservation
Curation: Individual
• “Small science 2-3 times more data than Big
science”, but much more at risk
• PhD student? RA? PI? Administrator? IT support?
• Data potentially on local hard drives, or at best
shared network drives
• May be inadequately protected
• Liable for policy-led deletion on resignation
• Individual “knows” too much (tacit knowledge)
• Documentation/metadata unlikely to be adequate
• Future: gone!
EDUCAUSE Australasia 2007
26. a centre of expertise in data curation and preservation
Department: eCrystals
• Partnership with Institutional
Repository
• Specialist department
archive (& national service)
• Workflow recording of lab
parameters (R4L)
• Public & private elements
• Trying to build eCrystals
federation (eBank 3)
• Future: likely to continue
EDUCAUSE Australasia 2007
27. a centre of expertise in data curation and preservation
Data in institutional repositories
EDUCAUSE Australasia 2007
28. a centre of expertise in data curation and preservation
Institution: Cambridge Chemistry
• 175,000 small molecule
structures in CML
• Alongside Archaeology,
Manuscripts, Learning
Materials, etc
• No library curation skills;
dependent on research
group enthusiast
• Collection isolated from
other Chemistry
• (Only 5 UK institutional
repositories claim to hold
data)
• Future: assured…
EDUCAUSE Australasia 2007
29. a centre of expertise in data curation and preservation
Community: LOCKSS?
• Self-selected group of
collectors: closest to genuine
open activity (despite
Alliance)?
• Traditionally libraries
collecting eJournals
• Model respects IPR
• No domain expertise; rely on
origins
• Data limitations…
• Future: potentially very
persistent (low cost, high
reliability, attack resistance,
distributed)
EDUCAUSE Australasia 2007
30. a centre of expertise in data curation and preservation
Discipline: Atmospheric Science
• Strong believer in need
for domain scientists as
curators
• Significant participant in
“community proxy”
agenda-setting activities
• Internationally
fragmented resources
• Future: mostly
dependent on grant
funding (but strong
commitment)
EDUCAUSE Australasia 2007
31. a centre of expertise in data curation and preservation
Discipline: Pharmacology
• International Scientific
Union
• Attempting to build
credit for data
contributions
• Future: extremely
limited funding
EDUCAUSE Australasia 2007
32. a centre of expertise in data curation and preservation
Bio-informatics: Nature article
23 June 05
• Databases in Peril
• 51 out of 89 biological databases contacted reported they
were struggling financially
• 7 have closed
• Several being updated in owner’s spare time
• (Notes that not all deserve long term support)
• [Nucleic Acids Research reports 968 databases in
2007!]
• Major issue: money
EDUCAUSE Australasia 2007
33. a centre of expertise in data curation and preservation
Publisher: Crystallography
• Publisher and Scientific
Union
• Created key domain
crystallographic standard
(CIF)
• Strong motivator for deposit
of structure data
• Consistent quality checks
• DOIs used for structure data
• Future: publishing business
model
EDUCAUSE Australasia 2007 •Slide from IUCr
34. a centre of expertise in data curation and preservation
National bodies: British Library
• Serious and robust
approach
• Legal deposit powers &
responsibilities as driver
• Oriented primarily
towards “cultural
heritage” (broadly
interpreted)
• Little data, no science
domain experience
• Future: strong future
commitment
EDUCAUSE Australasia 2007
35. a centre of expertise in data curation and preservation
National bodies: TNA/NDAD
• Specialist archive for
government datasets
• Understand government
regulations, dynamics &
requirements
• Subject generalists;
disconnected from
associated science
• Technology specialists
(understand databases)
• Future: likely to pass
eventually to The National
Archives
EDUCAUSE Australasia 2007
36. a centre of expertise in data curation and preservation
3rd parties: Portico
• Specific area: eJournals
• Depends on publisher
agreements
• No data or domain
science expertise
• Future: commitment
from Mellon +
publishers +
subscriptions, good
funding mix
EDUCAUSE Australasia 2007
37. a centre of expertise in data curation and preservation
3rd Parties: Iron Mountain?
• Records management
IS a curation problem
• Organisations like this
very likely to branch out
• No domain science
expertise
• Future: business case,
viability, stock market…
EDUCAUSE Australasia 2007
38. a centre of expertise in data curation and preservation
3rd parties: Web 2.0 style,
Swivel.com??
EDUCAUSE Australasia 2007
39. a centre of expertise in data curation and preservation
Institutions & the network
• Institutions have Inst’ Inst’n Inst’n
n1 2 3
fundamental
sustainability Discipline 1 X X
• Disciplines have domain
knowledge advantage Discipline 2 X X
but sustainability is an
issue
Discipline 3 X X
• Can we get the best of
both?
• Needs serious work to etc
examine!
EDUCAUSE Australasia 2007
40. a centre of expertise in data curation and preservation
Who are the curation players?
EDUCAUSE Australasia 2007
41. a centre of expertise in data curation and preservation
Cultural change
• If we build it, will they come? NO!!
• Outreach important: communication with
scientists and researchers is hard graft
• Cultural change to new approach requires more:
• Incentives, rewards and mandates
• Successful exemplars (well publicised)
• Discipline-oriented approach (one size does not fit all)
EDUCAUSE Australasia 2007
42. a centre of expertise in data curation and preservation
Australian context?
• In the emerging context of the Research
Quality Framework, and the expected
National Collaborative Research
Infrastructure Strategy, curation can only
increase in importance!
EDUCAUSE Australasia 2007
43. a centre of expertise in data curation and preservation
Thank you
•(Citations in paper in proceedings)
EDUCAUSE Australasia 2007