Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
1. The Many and the One
BCE problems in 21st c. data curation
Tracking it Back to the Source: Managing and Citing Research Data
NISO Forum, Denver, Sept 24, 2012
Allen H. Renear
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
Principal researchers of material presented:
David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H. Renear
Center for Informatics Research in Science and Scholarship
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
NSF/OCI-ITR DataNet Award #0830976
IMLS/LB Award #RE-05-08-0062-08
2. Problems, Problems, Problems
Identity problems:
– Is this the data we think it is? Is it the same data as that data?
(involves issues of authenticity, integrity, encoding)
Meaning problems:
– What is this data supposed to be telling us?
(involves interpreting the semantics of the data)
Relationship problems:
– How is this data related to that data?
(involves issues of data provenance)
Integration problems:
– How can I combine this data with other data?
(involves harmonizing conflicts at multiple levels)
Interoperation problems:
– how can I get this data to work with my software?
(involves conversion to equivalent formats)
An issue underlying all these is representation…
how do files of digital files represent facts about the world?
4. Identity Problems
Compare:
Two scientists, Jill and John,
used the same statistician.
5. Identity Problems
Compare:
Two scientists, Jill and John,
used the same centrifuge.
6. Identity and Representation Levels
Consider two files with the
… same data,
but relational tables in one case
and RDF triples in another
… with the same data and the same RDF triples,
but an XML serialization in one case,
an N3 serialization in another
… with the same data, the same RDF triples, the same N3 serialization,
but UTF-8 character encoding in one case
and UTF-16 encoding in another
How many of levels do we need? How do we define and manage them?
How can they be identified and re-identified?
Which identifier schemes for which level?
7. What is a dataset anyway?!
Maybe we should ask a scientist
They’ll have an answer, right?
6
9. Cries from the heart
“ the terms ‘Data Product’, ‘Data Set,’ and ‘Version’
are overlaid with multiple meanings between
communities.”
(Barkstrom, 2009)
“There is ambiguity in what type of object a dataset is;
with different groups of users applying different
connotations
There needs to be an explicit statement of what
the intended preservation of a dataset will imply.”
(Pepler, 2008)
8
10. Forcing us to conclude…
No single object can possibly have all those attributes
Therefore it is impossible to give the common colloquial
notion of dataset a precise definition
It must instead be replaced
by a family of new more specific concepts
Sound familiar?
9
12. A FRBR inspired solution
FRBR eliminates the ordinary “book” from our world
The ordinary “book” can be simultaneously
about chordata,
in French,
typeset in neo-Bauhaus,
mustard-stained
but FRBR replaces the book with four objects
the work is about chordata,
the expression is in French,
the manifestation is typeset in neo-Bauhaus,
the item is mustard-stained
13. FRBR entities and attributes
Work: “an … intellectual or artistic creation”
Expression: “the … realization of a work … notation … etc.”
Manifestation: “the physical embodiment of an expression of a work”.
Item: “a single exemplar of a manifestation”
Attribute assignments characteristically disjoint
A work may have a subject.
It does not have a language, typeface, or condition.
An expression may have a language;
It does not have a subject.
(or a typeface or a condition).
A manifestation may have a typeface.
It does not have a subject or a language
(or a condition)
An item may have a condition.
It does not have a subject, language, or typeface.
12
15. Ambiguities
Is
<object name="sample_31">
<feature name="U22376" value="408" />
<feature name="X59417" value="1784" />
An expression?
Is “00001011” an expression?
14
16. FRBR Refactored
Story
M:M
Symbol
Structure
M:M Symbol
Structure
M:M
Matter &
Energy
15
17. FRBR refactored and applied to datasets
All M:M C1: observations
[Semantic Level] expressed by…
S1: RDF triples
encoded by…
S2: N3 statements
[Syntax Level] [Encoding levels] encoded by …
S3: Unicode characters
encoded by…
S4: UTF-8 bit streams
Based on the Systematic
Assertion Model (SAM) for
inscribed in…
modeling datasets, developed Instantiation level
by David Dubin et al.
M1: RAID array state
18. Identifiers
What do we identify with identifiers?
An entity?
Content
Symbol structures
Patterned matter and energy
A nominalized relationship?
How do we confirm identification?
17
19. Identification
How do we identify an expression?
How do we identify an encoding?
How do we identify the data?
On the practical side we do this every day
On the theoretical side it is very difficult to usefully formalize.
18
20. Identity and change problems in Planets
From the Planets Conceptual Data Model, Sharpe et al. (2006)
19
21. Identity and change problems in Planets
• A file is a bitstream
• A file can be modified
• But a bitstream cannot be modified.
Credits to Dave Dubin, Simone Sacchi, Karen Wickett. Data Concepts Group, Data
Conservancy (NSF/OCI-ITR DataNet Award #0830976)
20
22. Center for Informatics Research
in Science and Scholarship (CIRSS)
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
Director: Carole L Palmer
Associate Director: Cathy Blake
c. 12 affiliated GSLIS faculty; 8 Phd students.
CIRSS research groups:
Data Practices: social science of information work
Socio-Technical Data Analytics: algorithms + people
*Data Concepts: modeling for integration/computation
Professional Education:
Data curation specialization within an ALA-accredited LIS program
Other options are being planned
21
23. CIRSS Data Concepts Group
Rationale
Integration and interoperability requires robust formal conceptual
models for scientific data
Especially if semantic technologies are going to be exploited.
Our current models aren’t good enough
Mission
The data concepts group takes a logic-based approach to to
solving conceptual modeling problems in scientific data curation
24. Questions?
This research is being carried out by the Data Concepts Group at the Center for
Research in Informatics and Scholarship (CIRSS) at the University of Illinois at Urbana-
Champaign,
Carole L. Palmer, Director.
Principal contributors include
David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H Renear
NSF/OCI-ITR DataNet Award #0830976
IMLS/LB Award #RE-05-08-0062-08
Notas del editor
I’ll open with some cries from the heart bear with me while thisYou can find othersAnd more succinctly