The document provides an introduction to data management for librarians, outlining key concepts such as the research data lifecycle, challenges in managing digital data over time, best practices for organizing, documenting, and storing data, and resources for data management support. Common problems include difficulty locating, accessing, and understanding data in the long run without proper planning and preservation strategies. The role of librarians is to educate researchers on best practices and provide support and training resources.
Establishing the significant properties of digital research
Data Management for Librarians: An Introduction
1. Data Management
for Librarians:
An Introduction
February 19th 2013
Gareth Knight
Manager
RDM Support Service
2. What is Data?
“Data are facts, observations or experiences on which an argument, theory or
test is based. Data may be numerical, descriptive or visual. Data may be raw or
analysed, experimental or observational.“
http://research.unimelb.edu.au/integrity/conduct/data/review
May originate from various sources:
Primary and/or secondary
May contain different content:
Quantitative and/or qualitative
May be expressed in different forms:
Datasets, still images, audio‐video, audio recordings, interactive resources
May be held in a number of variations:
Raw, cleaned, anonymised/pseudomised, analysed
May be encoded in different formats:
MS Excel, TIFF, MPEG2, STATA, FoxPro
What type of data do you have at home?
3. Data in the Research Lifecycle
Brainstorm
Finalise & Develop
submit Proposal
Write‐up
Plan Project
Results
Perform
Research
4. Data in the Research Lifecycle
Brainstorm
Finalise & Develop Produce Data
Develop
submit Proposal Management
Proposal
Plan
Write‐up
Plan Project
Results
Perform
Research
5. Data in the Research Lifecycle
Brainstorm
Finalise & Develop
submit Proposal
Write‐up Plan
Results Project
Perform
Perform
Research
Research
Create /
Share Reuse
Describe Analyse
Store
6. Data in the Research Lifecycle
Share Brainstorm
Finalise &
Finalise & Develop
submit
submit Proposal
Archive
Write‐up Plan
Results Project
Perform
Perform
Research
Research
Create /
Share Reuse
Describe Analyse
Store
7. What is Data Management?
1. Plan
• Determine requirements
• Identify risks & opportunities
• Decide approach
2. Implement
3. Monitor
• Evaluate approach
• Change approach/perform
corrective action
4. Evaluate
• Is it Fit for purpose?
• What additional action is
needed?
‘Benign neglect’ and Poorly‐made decisions in short‐term will have long‐term implications
8. Short-term decisions
with long-term implications
Software products File formats & standards
Data organisation & labelling Quality Controls
9. Why does data need to be managed?
Ensure data can be located Enable analysis
Interesting
paper. Where’s
the data?
Ability to understand for Enable sharing & validation
current and future need
10. Why does data need to be managed?
Ensure data can be located Enable analysis
Comply with Funder &
School requirements Interesting
paper. Where’s
the data?
Ability to understand for Enable sharing & validation
current and future need
11. Researcher Challenges
Issues/challenges encountered when creating, managing,
and sharing research data (web survey results)
Other challenges
• Database creation & management
• Storage of physical questionnaires
Response Type
• Lack of time
Multiple choice • Software instability (particularly
checkbox + free NVivo)
text for other • Ability to enter & access data at
challenges different locations
12. Training Needs
Interest in training on topics related to data management (web survey results)
Note:
Graph omits percentages for other responses
(None, slight, moderate, no opinion)
14. RDM Support Service
Role of Library staff
Provide first point of contact
Help researchers to express
requirements & needs
Direct to potential solution (staff,
website)
Contribute to training activities
Incorporate data considerations
into teaching
Location of Library staff
15. Data Access Over Time
digital vs. analogue
“traditionally, preserving things meant keeping them unchanged;
however … if we hold on to digital information without
modifications, accessing the information will become increasingly
more difficult, if not impossible.”
Su‐Shing Chen, 2001
+ + + =
data computer OS application information
content
16. Change in Process over Time
Intel PC, 2000
Mac laptop, 2006
X64 Ubuntu laptop, 2010
operating software information
hardware
system application content
17. Change in Process over Time
Intel PC, 2000
Mac laptop, 2006
X64 Ubuntu laptop, 2010
operating software information
hardware
system application content
18. Task
• Select two of the following problems when managing digital data:
1. Difficulty locating data
2. Difficulty accessing media
3. Difficulty rendering data in an understandable form
4. Difficulty recreating data as originally intended
5. Difficulty understanding information content
6. Uncertain provenance
Consider the following questions:
a. In what circumstances will the chosen problem occur?
b. What consequences may occur if the problem occurs (e.g. financial
implications)
c. How could you ensure that the problem doesn’t occur?
d. What could you do to resolve the problem after it has
occurred? (Can direct to someone for help)
19. 1. Difficulty Locating Data
Problem
“I created some data 5 years ago. Where is it?”
“I’ve lost my original disk. Do I have the data elsewhere?
Scenarios & Reasons
Loss of storage media
Lots of data stored in many locations
Vague filenames make it difficult to locate
(Potential) Solutions
Preventative:
• Copy data to several storage devices – increase likelihood
of finding it
Post event:
• Find better discovery software?
• Attempt to recreate content?
20. 2. Difficulty accessing Media
Problem
“How do I access this old media?”
“Why can’t I read this disk?”
Scenario & Reasons
Media obsolescence
Physical deterioration & failure
(Potential) Solutions
Preventative:
• Copy data to several storage devices
• Transfer data to new storage media on obsolescence / every 3 years
• Deposit data into a data archive and/or copy to server
Post event:
• Data recovery software
21. Potential Storage Locations
Pros:
Local machine & Cheap, high capacity storage, fast access
Storage
Cons:
Lack of support; potential for theft, loss, or
damage
Pros: Recommended
Academic Storage
Automatic monitoring & backup, multiple
Systems redundancy, remote access, secure (if required)
Cons:
Limited space allocation, Not always accessible
overseas
Third party service Pros:
providers Automated backup, accessible in diff. countries
(usually)
Cons:
Security concerns, ownership concerns, services
can close account at any time
http://www.flickr.com/photos/m0n0/4479450696/
22. 3. Difficulty Rendering Data
Problem
“How can I view data?
“Where do I find software to access my data?”
Scenarios & Reasons
Software obsolescence
New software use different decoding method
(Potential) Solutions
Preventative:
• Transform data to new formats (format conversion strategy)
• Maintain original machine and software to access content (computer museum)
Post event:
• Track down original software product
• Emulate original environment (emulation/virtualisation)
23. Choosing File Formats
Creation Preservation Dissemination
Content Type Preferred Format Acceptable Alternatives
Documents Rich Text Format Microsoft DocX
Open Document Format
Still Images TIFF PNG,
JPEG 2000 (uncompressed) RAW
Audio Wav format MP3
AIFF
FLAC
AudioVideo MPEG2,
MPEG4
When working with multiple copies, decide which is the master copy
24. 4. Difficulty Maintaining
Authenticity
Problem
“Why does my data look different?”
Scenarios & Reasons
New version of software application use different
decoding method
Different software application in use
(Potential) Solutions
Preventative:
• Determine significant properties that should be maintained
• Maintain original machine and software to access content (computer museum)
Post event:
• Emulate original environment (emulation/virtualisation)
25. 5. Difficulty Understanding
Content
Problem
“Where was this information created?
Why did the creator make this decision?
“What does this value mean?”
“How does this data relate to other content?”
Scenarios & Reasons
Memory fails – cannot remember decisions made
Disorganised and poorly labelled data
Lack of documentation
(Potential) Solutions
• Organise data (Chronology, Experiment type,
location, content type) Does a Rosetta stone exist
• Adopt labelling conventions for your data?
• Documentation
26. Filename conventions
• Consider the elements that will help you to organise and locate
content
– E.g. Participant ID, site of data collection,date of data collection
• Consider how data files and directories may be organised & sorted
– 001, 002, 003, 004, can be used for sequential files
– YYYY‐MM‐DD (2012‐12‐04) useful for organising by date (use year first)
• Identify different versions of content in filename (and in content)
– Creation date (YY‐MM‐DD)
– Version/draft number
• Consider how your filenames will look to others
– Avoid spaces ‐ ‘My file.pdf’ becomes ‘My%20file.pdf’ on the web
– Avoid capitalisation ‐ Alters file sorting & CAUSES HEADACHES!
Golden Rule: Be Consistent
27. Data Documentation
What would someone want to know if they
were looking at your data the first time?
1. What is the context of creation?
• Why did you create it? For what purpose?
• What methodology did you use? What assumptions were made?
• Who is the target audience?
2. Collection and set of files:
• What information does each file contain?
• When was it created?
• By whom?
• What actions were performed?
• How does the data contained in the collection relate to each other?
3. Individual components
• What is the meaning of this word/column/row, etc.?
• How are these items measured?
• What are the boundaries of the measurement?
28. 6. Uncertain Provenance
Problem
1. “When was the data created and/or modified?”
2. “Who created/modified the data?”
3. “Why was it created and/or modified?
Scenarios & Reasons
• Lack/Loss of trust in information content
• Reluctance to use information content
(Potential) Solutions
Preventative:
• Limit update to authorised users only
• Store change history
• Keep each version
Post event:
• Locate data creator & editor?
29. Things to Recommend
Advise researchers to:
1. Choose an appropriate storage location and create backups
2. Organise data in a consistent and logical manner
3. Document the data and information content (as well as structure)
4. Consider how you will ensure that information can be accessed in
the long‐term
5. Consider potential for data sharing and ensure it is performed with
consideration of ethics
30. A Few Good References
• Digital Curation Centre
http://www.dcc.ac.uk/resources
• MANTRA – Data Management training for PhD students
http://datalib.edina.ac.uk/mantra/
• UK Data Archive – Managing and Sharing Data
http://www.data‐archive.ac.uk/media/2894/managingsharing.pdf
• Cambridge University – RDM Guidance
http://www.lib.cam.ac.uk/dataman/index.html
• Australia National Data Service
http://ands.org.au/resource/data‐management‐planning.html
• LSHTM Research Data Management Support Service
• http://blogs.lshtm.ac.uk/rdmss/