1. Introduction to Research
Data Management
Stuart Macdonald
EDINA & Data Library
stuart.macdonald@ed.ac.uk
RDM Training, School of Geosciences, 7 November 2012
2. • Background
• Data Library Services & Projects
• Research Data MANTRA
• What is RDM
– Research Data Defined
– Data Management Planning
– Organising Data
– File Formats & Transformations
– Documentation & Metadata
– Storage & Security
– Data protection & Rights
– Preservation & Sharing
3. Background
EDINA and University Data Library (EDL) together
are a division within Information Services of the
University of Edinburgh.
EDINA is a JISC-funded National Data Centre
providing national online resources for education
and research - url: http://edina.ac.uk
The Data Library assists Edinburgh University
users in the discovery, access, use and
management of research datasets - url:
http://www.ed.ac.uk/is/data-library
4. Data Library Services and Projects
• Data Library & Consultancy
• Edinburgh DataShare
• JISC-funded projects
– DISC-UK DataShare (2007-2009)
– Data Audit Framework
Implementation (2008)
– Research Data MANTRA (2010-
2011)
5. Data Library & Consultancy
• finding…
• accessing …
• using …
• teaching …
• managing
Building relationships with researchers via
PG teaching activities, research support projects,
IS Skills workshops, Research Data Management
training and through traditional reference
interviews.
6. Edinburgh DataShare:
url: http://datashare.is.ed.ac.uk/
An online institutional repository of multi-disciplinary
research datasets produced by University researchers,
hosted by the Data Library.
Researchers producing research data associated with a
publication, or which has Re-use potential, can upload their
dataset for sharing and safekeeping. A persistent identifier
and suggested citation will be provided.
DataShare is a customised DSpace instance with a selection
of standards-compliant metadata fields to aid discovery
through Google and other search engines via OAI-PMH.
7. Edinburgh Data Audit Framework
(DAF) Implementation
(May – Dec 2008)
A JISC-funded pilot project produced 6 case studies from
research units across the University in identifying
research
data assets and assessing their management, using DAF
methodology developed by the Digital Curation Centre.
4 main outcomes:
• Develop online RDM guidance
• Develop university research data management policy
• Develop services & support for RDM (in partnership IS)
• Develop RDM training
9. University Research Data
Management Policy
In spring 2010, a review commenced at the University to
address the issue of managing the rapidly expanding volume
and complexity of data produced by Edinburgh researchers.
The Review was overseen by the IT & Library Committee and
had twin tracks to look at Data Storage, and Data
Management, Curation and Preservation.
The Review looked at current practice in the University, in
peer universities & internationally.
Championed by Vice-Principal & Chief Information Officer
Prof. Jeff Haywood the policy for management of research
data was approved by the University Court on 16 May, 2011.
One of the first RDM policies in a UK tertiary education
Institution.
10. IS RDM Roadmap
Drivers: University research data management policy
and EPSRC request that all institutions in receipt of their
funding should develop a roadmap for research data
management (to be implemented by May 1st 2015).
Information Services (IS) has committed to an RDM
Roadmap over an 18 month period (July 2012-Jan. 2014)
across four strategic areas.
The Roadmap will help to engage academic units and
PIs in research data management and provide services
to implement the University’s RDM Policy.
The Roadmap is a cross-divisional goal of IS supported
by: DCC, EDINA & Data Library, User Services, Library
& Collections, IT Infrastructure.
13. Research Data MANTRA
Partnership between:
Edinburgh University Data Library
Institute for Academic
Development
Funded by JISC Managing Research
Data Programme (Sept. 2010 – Aug.
2011)
14. Why Manage
Research Data?
Data Deluge – exponential growth in the
volume of digital research artifacts created
within academia.
Data management is one of the essential
areas of responsible conduct of research.
15. Project Overview
Grounded in three disciplinary contexts: social science,
clinical psychology and geoscience.
Aim was to develop online interactive open learning
resources for PhD students and early career
researchers that will:
• Raise awareness of the key issues related to
research data management & contribute to
culture change.
• Provide guidelines for good practice.
Selling RDM as a Transferrable Skill.
(voluntary participation)
16. Online Learning Module
Eight units with activities, scenarios and videos:
• Research data explained
• Data management plans
• Organising data
• File formats and transformation
• Documentation and metadata
• Storage and security
• Data protection, rights and access
• Preservation, sharing and licensing
Four data handling practicals: SPSS, NVivo, R, ArcGIS
Video stories from researchers in variety of settings
Xerte Online Toolkits – University of Nottingham
17. MANTRA & Research Data Lifecycle
url: http://datalib.edina.ac.uk/mantra/index.html
18. Online Learning Module
• Delivered online – self-paced, available ‘anytime,
anyplace’
• Emphasis on practical experience and active
engagement via online activities
• One hour per unit
• Read and work through scenarios & activities
(incl. videos etc)
• CC licence to allow manipulation of content for
re-use with attribution
• Portable content in open standard formats (e.g.
SCORM)
19. MANTRA Dissemination
• Learning materials deposited with an open
licence in JorumOpen & Xpert.
• Learning materials to be embedded in three
participating postgraduate programmes and
made available through IAD programme for use
by all postgraduate students and early career
researchers.
• Website: http://datalib.edina.ac.uk/MANTRA
• Download/re-brand/re-purpose materials
from JorumOpen in standards compliants
formats.
• Software modules – data handling practicals
(MS Word)
21. What is Research Data Management?
• An umbrella terms to describe all aspects of
planning, organising, documenting, storing
and sharing research data.
• It also takes into account issues such as
documentation, data protection and
confidentiality.
• It provides a framework that supports
researchers and their data throughout the
course of their research and beyond.
22. * Research Information Network. “Stewardship of digital research data - principles and guidelines", 30 March 2007. Viewed 30 October 2012
Research Data Defined
US Office of Management and Budget in its grants management
circular A-110 defines research data as “the recorded factual
material commonly accepted in the scientific community as
necessary to validate research findings.”
The KRDS2 study (Beagrie et al, 2009) define research data as
‘ collections of structured digital data from any disciplines or
sources which can be used by academic researchers to
undertake their research or provides an evidential record of
their research.’
RIN Classification*:
• Observational – real-time, unique, usually irreplaceable
• Experimental – from lab equipment, expensive, often
reproducible
• Simulation – generated from models – model & metadata more
important than output data
• Derived or compiled – reproducible but expensive
• Reference - a (static or organic) collection of smaller (peer-
reviewed) datasets, most probably published and curated
23. Research Data Defined
• Research data, unlike other information
types, is collected, observed, or created, for
purposes of analysis to produce original
research results.
• Research data can be generated for different
purposes and through different processes in
a multitude of digital formats.
24. Research data may include the
following:
• Documents (text, MS Word), spreadsheets
• Lab books, field notes, diaries
• Questionnaires, transcripts, codebooks
• Audiotapes, videotapes, photographs, images
• Slides, artefacts, specimens, samples
• Collection of digital objects acquired & generated during the research
process
• Database contents (video, audio, text, images)
• Models, algorithms, scripts
• Contents of an application (input, output, logfiles for analysis software,
schemas)
• Methodologies, workflows
• SOPs, protocols
25. By managing your data you will:
• ensure scientific integrity of research and aid replication
• ensure research data and records are accurate, complete, authentic
and reliable
• increase your research efficiency
• save time, effort and resources in the long run
• enhance data security and minimise the risk of data loss
• prevent duplication of effort by enabling others to use your data
• meet funding council grant requirements
Note:
It may also be important to manage research records (both digital &
hardcopy) during and beyond the life of the project e.g.
correspondence (emails); project files; grant applications; technical
reports; research reports; consent forms; ethics applications.
27. What Do Funders Want?
• timely release of data
- once patents are filed or on (acceptance for)
publication.
• open data sharing
- minimal or no restrictions if possible.
• preservation of data
- typically 5-10+ years if of long-term value.
See the RCUK Common Principles on data policy:
www.rcuk.ac.uk/research/Pages/DataPolicy.aspx
28. Data Management & Sharing Plans
Five common questions asked by funders are:
• What data will be created? (format, types, volumes etc)
• What standards and methodologies will you use?
• How will you manage ethics and Intellectual Property?
• What are the plans for data sharing and access?
• What is the strategy for long-term preservation?
DCC’s DMP Online tool: https://dmponline.dcc.ac.uk
How to write a DMP guide:
www.dcc.ac.uk/resources/how-guides/develop-data-plan
29. Data Management Plan. What is it?
A DMP is a document which describes:
What research data will be created.
What policies (funding, institutional, legal) apply to the data.
What data management practices (backups, storage, access
control, archiving) will be used.
What facilities and equipment will be required (hard-disk
space, backup server, repository).
Who will own the copyright and have access to the data.
Who will be responsible for each aspect of the plan.
How its reuse will be enabled and long-term preservation
ensured after the original research is completed.
The data management plan must be continuously maintained
and kept up-to-date throughout the course of research.
30. Why do we need one?
It improves your research both now and later...
•Data is often valuable for a long time!
•Results of your research may outlast your degree.
•Will you use your data throughout your career?
•Loss of physical/digital data and records.
•Loss of usefulness through records loss, media and
software obsolescence,
•Forgetting stuff!
Good practice → Better research
31. Why do we need one?
•Ensure research integrity (and repeatability) through keeping
better records.
•People can trace your outcomes from data collection,
through research methodology, through to results.
•Maximises usefulness of data to fellow researchers.
•Highlights how data was collected, quality controls, how
people can and should use it (access and licensing), how you
then attribute people/projects.
•Facilitates data use within collaboration.
•Can help lead to subsequent research papers.
32. Getting started with a DMP
Gain an understanding of terminology & issues.
Gain understanding of your project/community
– Supervisor and colleagues
– People in your School, i.e. IT Officers, Graduate Research
Coordinator...
Talk to your supervisor about data authorship, IP, licensing,
policies.
Use a research data planning checklist.
Keep it practical and simple, don't spend too much time. What
you don't know leave gaps, investigate, fill in later.
Remember it is never finished! Review it regularly through the
course of your research.
33. Organising your data
•Research data files and folders need to be labelled and
organised in a systematic way so that they are both
identifiable and accessible for current and future users.
•Naming datasets according to agreed conventions should
make file naming easier for colleagues because they will not
have to ‘re-think’ the process each time.
•One benefit of consistent research data file labelling is that
files are not accidentally overwritten or deleted.
•It is important to consistently identify and distinguish
versions of data files. This ensures that a clear audit trail
exists for tracking the development of a data file and
identifying earlier versions when needed.
34. File Formats & Transformation
• A file format encodes information in a computer file, enabling
another program to access data within it
• HTML and PDF are two examples of commonly used file format
and may be identified by their suffixes .html and .pdf.
• Files are based on either text or binary encoding. The former is
both machine- and human-readable and the latter only readable
by means of appropriate software.
• Thus text files are less likely to become obsolete. Examples of file
name extensions for these files are .txt, .csv and .por.
• If you convert or migrate your data files from one format to
another, be aware of the potential risk of the loss or corruption of
your data and take appropriate steps to avoid/minimise it.
35. File Formats & Transformation
•When compressing your data files for storage,
transportation or transmission, you encode the information
using fewer bits than the original representation. Commonly
used compression programs are Zip and Tar.
•You may use the process of data normalisation. This means
to convert data from one format (e.g. proprietary) into another
for use or preservation (e.g. ASCII).
•You may also need to compute new values from old in your
data, a process which is called data transformation.
•This may be necessary prior to analysing your data. Three
techniques for doing this are aggregation, anonymisation and
perturbation.
36. Documenting Data
There are many reasons why you need to document
your data:
•To help you remember the details later
•To help others understand your research
•Verify your findings
•Review your submitted publication
•Replicate your results
•Archive your data for access and re-use
Some examples of data documentation are:
•Laboratory notebooks
•Field notes
•Questionnaires
37. Documenting Data
Laboratory or field notebooks, for example play an
important role in supporting claims relating to
intellectual property developed by University
researchers, and even defending claims against
scientific fraud.
Research data need to be documented at various
levels:
•Project level
•File or database level
•Variable or item level
The term metadata (‘data about data’) is often used.
The importance of metadata lies in the potential for
machine-to-machine interoperability to assist location
and access to data through search interfaces.
38. Secure data storage:
For the purposes of integrity, efficiency and ease of replication it is
important that research data is stored securely & backed up regularly via:
• Networked drives
• Fileservers managed by department / school / IS.
• Stored in single, secure, accessible place – regular back-ups.
• Personal computers / laptops
• Convenient, temporary storage - should not be used for storing
master copies.
• Local drives may fail & laptops may get lost/stolen.
39. • External storage devices
• Hard drives, USB sticks, CDs, DVDs – low cost & portable BUT
not recommended for long term storage.
• Longevity not guaranteed – degradation over time.
• Easily damaged or misplaced.
• Not big enough for all research data – need for use of multiple
discs/drives.
• May pose a security threat.
If USB sticks, DVDs, CDs are used for working data or extra back-up
then:
• Choose high quality products from reputable manufacturers.
• Conduct regular checks to ensure media is not failing.
• Periodically refresh data (i.e. copy to a new disc or drive).
• Ensure confidential data is password protected / encrypted
40. • Remote or online back-up services - services that provides
an online system for storing and backing-up computer files e.g.
Dropbox, Mozy, Humyo, A-Drive
• Allow users to store and sync data files online and between
computers.
• Employ cloud computing storage facilities (e.g. Amazon S3).
• Business model – first few GBs free, pay for more space.
41. Backing-up
Considerations for back-up policy:
• Whether all data (full back-up), or only changed data will be backed-up
(incremental back-up)?
• How often full and incremental back-ups will be made?
• How much hard-drive space or DVDs will be required to maintain this
schedule?
• If working with sensitive data, how will it be secured (and destroyed)?
• What back-up services are available that meet your these needs?
• Who will be responsible for ensuring back-ups are available?
Recommendation:
Keep at least 3 copies of your data (e.g. original, external/local, and
external/remote) and put in place regular back-up procedure
42. Data Security
The means of ensuring that data is kept safe from corruption and that
access to it is suitably controlled. It is important to consider data security
to prevent:
• Accidental or malicious damage / modification to data.
• Theft of valuable or irreplaceable data.
• Breach of confidentiality agreements and privacy laws.
• Release of data before it has been checked for accuracy and
authenticity.
43. Data Protection
• The 1998 Data Protection Act regulates how personal data may be
held and processed, and is aimed at organisations but also applies
to individuals.
• The Act recognises that personal data on its own or linked with
other data, can reveal the identity of an actual living person.
• You must comply with the Act from the moment you obtain
personal data until the time when the data have been returned,
destroyed, or perhaps transformed into a public use dataset for
purposes of sharing.
• Research exemption exists if you are able to process anonymised
data instead of personal data for your research by destroying the
“key” between the identifiers and the personally identifying
information.
• The Records Management Office has full guidance on its website.
44. Rights and access
• Intellectual property rights (IPR) can be defined as rights
acquired over any work created or invented with the intellectual
effort of an individual.
• Facts are not copyrightable but the structure of a database could
be.
• As a researcher, you should clarify ownership of and rights
relating to research data before a project starts. This includes the
right of access and the right to make copies.
• Data licences determine the terms and conditions of use by
another, and may accompany a purchase or subscription.
• Open data licences attempt to “set data free” by minimising and
standardising the terms and conditions of re-use. Conditions may
include attribution, non-commercial use, no derivative works, or
‘share alike’.
45. Benefits of Sharing Data
• Scientific integrity – publishing & citing data in published
research papers can allow others to replicate, validate, or
correct results, thus improving the scientific record.
• Publicly funded research - there is a growing movement for
making publicly funded research available to the public.
• Funding mandates - UK research councils are increasingly
mandating data sharing so as to avoid duplication of effort and
save costs.
• University of Edinburgh’s mission - "the creation, dissemination
and curation of knowledge" implies transparency about the
research that is conducted in its name.
• Preserve research data for researchers’ own future use.
46. THANK YOU!
Data Library services:
http://www.ed.ac.uk/is/data-library
EDINA:
http://edina.ac.uk/
Research data management guidance pages:
http://www.ed.ac.uk/is/research-data-management
Edinburgh University data policy:
http://www.ed.ac.uk/is/research-data-policy
Edinburgh Data Audit Framework (DAF) Implementation:
http://ie-repository.jisc.ac.uk/283/
Research data MANTRA course:
http://datalib.edina.ac.uk/mantra
47. Scenarios for Discussion
At completion of a research project the data and
records are boxed and stored in a departmental
storeroom. A participant in a research project lodges a
claim for compensation, alleging that he was not
adequately informed about the effects of the study and
does not recall giving consent. He finds that the
storeroom has since been converted into a coffee shop.
Where are the records?
48. Scenarios for Discussion
Sometime after completion of a research project the
researcher wishes to revisit her findings, applying a new
statistical approach. She manages to read the floppy discs
that the data were stored on, eventually gets the old
software format imported into her current statistical
package, only to find she cannot remember what many of the
variable labels –each 8 digits in length - actually mean. Has
she documented her data?
You publish a paper based on your thesis and are surprised
to find it has become a hot topic in your field. Suddenly
people are writing to you asking for the underlying data. How
much effort is required to give them a well-cleaned dataset
and adequate documentation for re-use?
Editor's Notes
25 years ago disk storage - expensive researchers interested in working with data came together to petition the PLU and the University’s Library – wanting a university-wide provision for files that were too large to be stored on individual computing accounts Early holdings were research data from universities of edinburgh, glasgow, and strathclyde
Primarily social sciences but not exclusively so, large scale government surveys (micro data), macro-economic time series data (country-level data), Elections studies, Geospatial data, financial datasets, population census data Free on internet / subscription / through national data centres/archives / resource discovery portals Registration / authorisaiton and authentication / special conditions / budget to pay for data SPSS, STATS, SAS, R, ArcGIS – interpret documentaiton/codebooks, merge and match users data with other data (via look-up tables), subset data Data Catalogue
Training for postgraduates and early career researchers These were the School of Divinity, School of History, Classics and Archaeology), School of Biomedical Sciences), (School of Molecular and Clinical Medicine), (School of Physics and Astronomy). Also, the School of Geosciences
Digital Curation centre, Data Library, Information Services Infrastructure, Research Computing, Library & Collections Concern is both for the shorter term – ensuring competitive advantage through secure and easy-to-use access, and for the longer term – ensuring enduring access and usability to the research community into the future and compliance with legislation. 2 working groups RDS working group RDM working group
Funded by JISC as part of its UK programme, Managing Research Data to develop online learning materials to assist researchers manage their digital assets. IAD – set up to deliver training and development for postgraduate students and staff – via online course, Virtual Learning Environments, transferable skills training
A set of Multi- or Cross-Disciplinary online learning resources FRUIT principles – Fun Relevant Useful Interesting Timely
Shareable Content Object Reference Model – XML-based
JorumOpen - national OER repository
What about preserving?
Observational – sensor data, survey or sample data, neuroimages – e.g. ocean temperature, voters attitudes before an election, photographs of a supernova Experimental – e.g. gene sequences, chromatograms, toroid magnetic field data, HPLC, gel electrophoresis, chemical reaction rates, Simulation – e.g. climate models, economic models, algorithms Derived – e.g. text and data mining, compiled database, 3D models, maps Reference - e.g. gene sequence databanks, chemical structures, spatial data portals
BioData Blog “ Documenting data may seem like a tedious, wasteful step, but each researcher must think of its long-term benefits ” - methodologies, workflows, procedures, recording conditions etc