Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Data Management: The Current Landscape
1. Data
Management
The
Current
Landscape
Carly
Strasser
California
Digital
Library
2012
IASSIST
Conference
University
of
California
Curation
Center
June
2012
2.
3. From
Flickr
by
DW0825
From
Flickr
by
Flickmor
From
Flickr
by
deltaMike
Digital
data
www.woodrow.org
C.
Strasser
Courtesey
of
WHOI
From
Flickr
by
US
Army
Environmental
Command
5. Data
Models
Maximum
Likelihood
estimation
Matrix
Models
Images
Tables
Paper
6. UGLY TRUTH
Most
Earth
|
Environmental
|
Ecological
scientists…
5shortessays.blogspot.com
are
not
taught
data
management
don’t
know
what
metadata
are
can’t
name
data
centers
or
repositories
don’t
share
data
publicly
or
store
it
in
an
archive
aren’t
convinced
they
should
share
data
7. Where
data
end
up
From
Flickr
by
diylibrarian
www
blog.order2disorder.com
From
Flickr
by
csessums
Data
Metadata
From
Flickr
by
csessums
Recreated
from
Klump
et
al.
2006
8. Who
cares?
From
Flickr
by
Redden-‐McAllister
From
Flickr
by
AJC1
www.rba.gov.au
9. Where
data
end
up
From
Flickr
by
diylibrarian
www
Data
www
Metadata
From
Flickr
by
torkildr
Recreated
from
Klump
et
al.
2006
11. Trends
in
Data
Archiving
Journal
publishers
Joint
Data
Archiving
Agreement
Data
Papers
etc.
Ecological
Archives,
Beyond
the
PDF
Funders
Data
management
requirements
12. What
is
a
data
management
plan?
A
document
that
describes
what
you
will
do
with
your
data
during
your
research
and
after
you
complete
your
research
13. Why
should
a
scientist
prepare
a
DMP?
Saves
time
Increases
efficiency
Easier
to
use
data
Others
can
understand
&
use
data
Credit
for
data
products
Funders
require
it
14. NSF
DMP
Requirements
From
Grant
Proposal
Guidelines:
DMP
supplement
may
include:
1. the
types
of
data,
samples,
physical
collections,
software,
curriculum
materials,
and
other
materials
to
be
produced
in
the
course
of
the
project
2.
the
standards
to
be
used
for
data
and
metadata
format
and
content
(where
existing
standards
are
absent
or
deemed
inadequate,
this
should
be
documented
along
with
any
proposed
solutions
or
remedies)
3.
policies
for
access
and
sharing
including
provisions
for
appropriate
protection
of
privacy,
confidentiality,
security,
intellectual
property,
or
other
rights
or
requirements
4.
policies
and
provisions
for
re-‐use,
re-‐distribution,
and
the
production
of
derivatives
5.
plans
for
archiving
data,
samples,
and
other
research
products,
and
for
preservation
of
access
to
them
15. NSF’s
Vision*
DMPs
and
their
evaluation
will
grow
&
change
over
time
(similar
to
broader
impacts)
Peer
review
will
determine
next
steps
Community-‐driven
guidelines
– Different
disciplines
have
different
definitions
of
acceptable
data
sharing
– Flexibility
at
the
directorate
and
division
levels
– Tailor
implementation
of
DMP
requirement
Evaluation
will
vary
with
directorate,
division,
&
program
officer
*Unofficially
Help
from
Jennifer
Schopf,
NSF
17. now
called
DataUp
• Open
source
add-‐in
&
web
application
• Facilitate
data
management,
sharing,
archiving
for
scientists
• Focus
on
atmospheric,
ecological,
hydrological,
and
oceanographic
data
• Collecting
requirements
for
add-‐in
from
scientists,
data
centers,
libraries
Funders:
Gordon
and
Betty
Moore
Foundation,
Microsoft
Research
18. www.dataone.org
• Data
Education
Tutorials
• Database
of
best
practices
&
software
tools
• Primer
on
data
management
• Investigator
Toolkit
now
called
DataUp
19. Data
Management
Best
Practices
Carly
Strasser
California
Digital
Library
2012
IASSIST
Conference
University
of
California
Curation
Center
June
2012
20. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
21. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
22. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
23. 2.
Data
collection
&
organization
Create
unique
identifiers
• Decide
on
naming
scheme
early
• Create
a
key
• Different
for
each
sample
From
Flickr
by
zebbie
From
Flickr
by
sjbresnahan
24. 2.
Data
collection
&
organization
Standardize
• Consistent
within
columns
– only
numbers,
dates,
or
text
• Consistent
names,
codes,
formats
Modified
from
K.
Vanderbilt
From
Pink
Floyd,
The
Wall
themurkyfringe.com
25. 2.
Data
collection
&
organization
Use
descriptive
file
names
PhDcomics.com
26. 2.
Data
collection
&
organization
Use
descriptive
file
names
*
• Unique
• Reflect
contents
Bad:
Mydata.xls
Better:
Eaffinis_nanaimo_2010_counts.xls
2001_data.csv
best
version.txt
Study
Year
organism
Site
name
What
was
measured
*Not
for
everyone
From
R
Cook,
ESA
Best
Practices
Workshop
2010
27. 2.
Data
collection
&
organization
Preserve
information
R
script
for
processing
&
analysis
• Keep
raw
data
raw
• Use
scripts
to
process
data
&
save
them
with
data
Raw
data
as
.csv
28. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
29. 3.
Quality
control
and
quality
assurance
Before
data
collection
• Define
&
enforce
standards
• Assign
responsibility
for
data
quality
From
Flickr
by
StacieBee
30. 3.
Quality
control
and
quality
assurance
During
data
collection/entry
• Minimize
manual
entry
• Use
double
entry
• Use
text-‐to-‐speech
program
to
read
data
back
• Use
a
database
• Document
changes
From
Flickr
by
schock
31. 3.
Quality
control
and
quality
assurance
After
data
entry
• Check
for
missing,
impossible,
anomalous
values
• Perform
statistical
summaries
• Look
for
outliers
• Normal
probability
plots
• Regression
• Scatter
plots
60
50
40
• Maps
30
20
10
0
0
10
20
30
40
32. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
33. 4.
Metadata
Metadata
=
Data
reporting
WHO
created
the
data?
WHAT
is
the
content
of
the
data
set?
WHEN
was
it
created?
From
Flickr
by
//ichael
Patric|{
WHERE
was
it
collected?
HOW
was
it
developed?
WHY
was
it
developed?
34. • Scientific
context
4.
Metadata
• Scientific
reason
why
the
data
were
collected
• What
data
were
collected
• Digital
context
• What
instruments
(including
model
&
• Name
of
the
data
set
serial
number)
were
used
• The
name(s)
of
the
data
file(s)
in
the
data
• Environmental
conditions
during
collection
set
• Where
collected
&
spatial
resolution
When
• Date
the
data
set
was
last
modified
collected
&
temporal
resolution
• Example
data
file
records
for
each
data
• Standards
or
calibrations
used
type
file
• Information
about
parameters
• Pertinent
companion
files
• How
each
was
measured
or
produced
• List
of
related
or
ancillary
data
sets
• Units
of
measure
• Software
(including
version
number)
• Format
used
in
the
data
set
used
to
prepare/read
the
data
set
• Precision
&
accuracy
if
known
• Data
processing
that
was
performed
• Information
about
data
• Personnel
&
stakeholders
• Definitions
of
codes
used
• Who
collected
• Quality
assurance
&
control
measures
• Who
to
contact
with
questions
• Known
problems
that
limit
data
use
(e.g.
• Funders
uncertainty,
sampling
problems)
• How
to
cite
the
data
set
35. 4.
Metadata
What
is
metadata?
Select
the
appropriate
metadata
standard
• Provides
structure
to
describe
data
Common
terms
|
definitions
|
language
|
structure
• Lots
of
different
standards
EML
,
FGDC,
ISO19115,
DarwinCore,…
• Tools
for
creating
metadata
files
Morpho
(EML),
Metavist
(FGDC),
NOAA
MERMaid
(CSGDM)
36. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
37. 5.
Workflows
Workflow:
how
you
get
from
the
raw
data
to
the
final
products
of
your
research
Simple
workflows:
flow
charts
Temperature
data
Data
import
into
R
Data
in
R
Salinity
format
data
Quality
control
&
“Clean”
T
data
cleaning
&
S
data
Analysis:
mean,
SD
Summary
statistics
Graph
production
38. 5.
Workflows
Workflow:
how
you
get
from
the
raw
data
to
the
final
products
of
your
research
Simple
workflows:
commented
scripts
• R,
SAS,
MATLAB
• Well-‐documented
code
is…
Easier
to
review
Easier
to
share
%
#
$
Easier
to
repeat
analysis
&
40. 5.
Workflows
Workflows
enable
From
Flickr
by
merlinprincesse
Reproducibility
can
someone
independently
validate
findings?
Transparency
others
can
understand
how
you
arrived
at
your
results
Executability
others
can
re-‐run
or
re-‐use
your
analysis
41. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
42. 6.
Data
stewardship
&
reuse
From
Flickr
by
greensambaman
The
20-‐Year
Rule
The
metadata
accompanying
a
data
set
should
be
written
for
a
user
20
years
into
the
future
RULE
Document
Document
Document
Document
Document
Document
Document
Document
Document
Document
Document
Document
(National
Research
Council
1991)
43. 6.
Data
stewardship
&
reuse
Use
stable
formats
csv,
txt,
tiff
Create
back-‐up
copies
original,
near,
far
Periodically
test
back-‐ups
Modified from R. Cook
44. 6.
Data
stewardship
&
reuse
Store
data
in
a
repository
Institutional
archive
Discipline/specialty
archive
From
Flickr
by
torkildr
45. 6.
Data
stewardship
&
reuse
Data
Citation
Allows
readers
to
find
data
products
Get
credit
for
data
and
publications
Promotes
reproducibility
Better
measure
of
research
impact
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
Learn
more
at
www.datacite.org
Modified from R. Cook
46. Check
out
the
blog
dcxl.cdlib.org
or
my
website
www.carlystrasser.net
Email
me
carlystrasser@gmail.com
Tweet
me
@carlystrasser
|
@dcxlCDL
DCXL
on
FB
DCXLatCDL