The document provides an overview of the CISER Data Archive at Cornell University and introduces key concepts of research data management (RDM).
The CISER Data Archive is a collection of over 27,000 numeric datasets to support quantitative research in various social science fields. It provides consulting services to help users find, access, and use data. It also maintains the Cornell research data repository.
The document defines research data and outlines the research data lifecycle. It discusses best practices for organizing, documenting, storing, and securing research data. Key aspects of RDM include developing data management plans, using appropriate file formats, and ensuring long-term preservation and sharing of research data.
1. CISER Data Archive &
Introduction to RDM
Stuart Macdonald
CISER Data Services Librarian
srm262@cornell.edu
Research Design CRP-7201, Stone Laboratory, Cornell Univ. 19 March 2014
2. • CISER Data Archive
• What is Research Data Management (RDM)
• Research Data Defined
• Data Management Planning
• Organising Data
• File Formats & Transformations
• Documentation & Metadata
• Storage & Security
• Data protection & Rights
• Preservation & Sharing
• Research Data MANTRA
3. CISER Data Archive: Collection and Services
Established over 30 years ago
Collection of numeric datasets to support quantitative
research
c. 27,000 online files in addition to thousands of studies on CD/DVD
Emphasis on demography (state/federal censuses),
economics, health, labor, election studies, attitudinal and
behavioral studies, family life etc.
4. • Consulting services to match user needs with appropriate data
and statistical analysis software
•finding, accessing and using data
• Current Cornell researchers can download archive files from
online catalog (search & browse) in formats conversant with
statistical software
• Data files are identified by a ‘traffic light’ icon that indicates
usage level:
• Green – downloadable by anyone
• Yellow – downloadable from links in the catalog with CUWebAuth
authentication (for use within the CISER research computing
environment - CISERRSCH) – Cornell researchers can apply for a
computing account
• Red – data to be used in restriction (via CRADC or conditions
imposed by data provider)
6. 6
CISER Data Archive maintain links to a range of social science
data resources including:
•Data Distributors and Producers: U.S. Government e.g. Dept. Agriculture,
Dept. Commerce, Dept. Energy, Dept. Justice, Dept. Labor, Federal Agencies
•Data Distributors and Producers: Other U.S. Sources
•Data Distributors and Producers: International eg. Eurostat, FAOSTAT, ILO,
OCED, UN Statistics Division, World Bank
•Data Libraries and Archives e.g. Harvard-MIT Data Center, UKDA, DANS, CESSDA,
•Social Science Research Institutes e.g. Odum Institute, Survey Research
Institute
•Online Reference Tools e.g. Boundary files, geocoding tools, SIC codes, data
citation tools
•State and Local Government data and statistical sources e.g. NY State
Depts. Education, Health, Labor, State Data Center
See URL: http://ciser.cornell.edu/ASPs/datasource.asp
7. • Provides Cornell social science researchers with a
repository for sharing and providing long-term preservation
of their numeric/statistical research data
• Participates in Cornell’s Research Data Management
Service Group
• Assist Cornell social science researchers with Research
Data Management (RDM) plans
• Provide Cornell social science researchers with support
and expertise in obtaining and using restricted data
8. Other social science research data resources:
• Inter-University Consortium for Political and Social Research
(ICPSR)
• National Archive of Criminal Justice Data
• Minority Data Resource Center
• National Archive of Computerized Data on Aging
• Roper Center for Public Opinion Archives
• International Data Archives
• CESSDA, UKDA, Eurostat
• CESSDA catalog (DDI) provides a multi-lingual interface to datasets from
member social science data archives across Europe
• Non-Governmental Organizations
• National / Governmental Statistical Agencies
9. • CISER Data Archive Catalog:
http://ciser.cornell.edu/ASPs/search.asp
• ICPSR:
www.icpsr.umich.edu/
• Roper Center for Public Opinion Research:
http://www.ropercenter.uconn.edu/
• CESSDA:
http://www.cessda.org/
• Eurostat:
http://www.epp.eurostat.ec.europa.eu/
URLs:
10. CISER Data Archive is located at 391 Pine Tree Road,
Ithaca
CISER is open 8.30am – 4.30pm (Mon-Fri) – walk-in
assistance is not always available – so appointments are
recommended
Location & hours:
Contacts:
Tel.: (607) 255 4801
Email: ciser@cornell.edu
12. Why Manage Research Data?
Current research data management initiatives are based
on three trends:
The data deluge – exponential growth in volume of digital
research artifacts created within academia (often
created by publicly funded research)
Data management is required by multiple disciplines
Increasing perception of the value of data (data as
commodity)
13. What is Research Data Management?
• RDM is an umbrella terms to describe all aspects
of planning, organising, documenting, storing and
sharing research data.
• It also takes into account issues such as
documentation, data protection and
confidentiality.
• It provides a framework that supports researchers
and their data throughout the course of their
research and beyond.
• It is one of the essential areas of responsible
conduct of research
15. Research Data Defined
US Office of Management and Budget in its grants management circular A-110
defines research data as “the recorded factual material commonly accepted in
the scientific community as necessary to validate research findings.”
The KRDS2 study (Beagrie et al, 2009) define research data as ‘collections of
structured digital data from any disciplines or sources which can be used by
academic researchers to undertake their research or provides an evidential
record of their research.’
RIN Classification*
• Observational – real-time, unique, usually irreplaceable
• Experimental – from lab equipment, expensive, often reproducible
• Simulation – generated from models – model & metadata are as important as
output data
• Derived – resulting from processing or combining “raw” data. reproducible
but expensive
• Reference - a (static or organic) collection of smaller (peer-reviewed)
datasets, probably published and curated
* Stewardship of digital research data: a framework of principles and guidelines, Research Information Network, 2008. URL: http://tinyurl.com/l56gftx
16. Research Data Defined
• Research data, unlike other information types, is
collected, observed, or created, for purposes of
analysis to produce original research results.
• Research data can be generated for different
purposes and through different processes in a
multitude of digital formats.
17. Research data comes in many varied formats:
Text Flat text files, Word, Portable Document Format (PDF), Rich‐
Text Format (RTF), Extensible Markup Language (XML).
Numerical SPSS, Stata, Excel.‐
Multimedia - jpeg, tiff, dicom, mpeg, quicktime.
Models - 3D, statistical.
Software - Java, C.
Discipline specific - Flexible Image Transport System (FITS) in
astronomy, Crystallographic Information File (CIF) in chemistry,
Instrument specific - Olympus Confocal Microscope Data Format,Carl
Zeiss Digital Microscopic Image Format (ZVI)
18. Research data may include the
following:
• Documents (text, MS Word), spreadsheets
• Lab books, field notes, diaries
• Questionnaires, transcripts, codebooks
• Audiotapes, videotapes, photographs, images
• Slides, artefacts, specimens, samples
• Collection of digital objects acquired & generated during the research
process
• Database contents (video, audio, text, images)
• Models, algorithms, scripts
• Contents of an application (input, output, logfiles for analysis software,
schemas)
• Methodologies, workflows
• SOPs, protocols
19. By managing your data you will:
• ensure scientific integrity of research and aid replication
• ensure research data and records are accurate, complete, authentic
and reliable
• increase your research efficiency
• save time, effort and resources in the long run
• enhance data security and minimise the risk of data loss
• prevent duplication of effort by enabling others to use your data
• meet funding grant requirements
Note:
It may also be important to manage research records (both digital &
hardcopy) during and beyond the life of the project such as:
correspondence (emails)
grant applications
technical reports
research reports
consent forms
ethics applications
20. What Do Funders Want?
• timely release of data
- once patents are filed or on (acceptance for)
publication.
• data shared openly
- minimal or no restrictions if possible.
• preservation of data
- typically 5-10+ years if of long-term value.
• data management plans
See :
NIH Data Sharing Policy: https://grants.nih.gov/grants/policy/data_sharing/
NSF Data Sharing Policy: http://www.nsf.gov/bfa/dias/policy/dmp.jsp
21. Data Management Plan. What is it?
Funding bodies require researchers to supply detailed, cost-
effective plans for managing research data. These are called Data
Management Plans
A DMP is a document which describes:
What research data will be created.
What policies (funding, institutional, legal) apply to the data.
What data management practices (backups, storage, access
control, archiving) will be used.
What facilities and equipment are equired (hard-disk space,
backup server, repository).
Who will own the copyright and have access to the data.
How long-term preservation will be ensured after the original
research is completed.
The data management plan must be continuously maintained and
kept up-to-date throughout the course of research.
22. Why do we need one?
It improves your research both now and later...
•Data is often valuable for a long time!
•Results of your research may outlast your project.
•Will you use your data throughout your career?
•Prevents loss of digital data and records.
•Prevents loss of usefulness through media and software
obsolescence,
•Forgetting stuff!
Good practice Better research→
23. Why do we need one?
•Ensure research integrity (and repeatability) through
keeping better records.
•People can trace your outcomes from data collection,
through research methodology, through to results.
•Maximises usefulness of data to fellow researchers.
•Highlights how data was collected, quality controls,
how people can and should use it (access and
licensing).
•Facilitates data use within collaboration.
•Can help lead to subsequent research papers.
24. Getting started with a DMP
Gain an understanding of terminology & issues.
Gain understanding of your project/community
– Supervisor and colleagues
– People in your School, i.e. IT Officers, Research
Coordinator/Administrator
Talk to your supervisor about data authorship, IP, licensing,
policies.
Keep it practical and simple, don't spend too much time. What
you don't know leave gaps, investigate, fill in later.
Remember it is never finished! Review it regularly through the
course of your research.
CDL’s DMP Tool: https://dmp.cdlib.org/
Cornell University RDM Services Group - Writing a DMP:
https://confluence.cornell.edu/display/rdmsgweb/data-
management-planning-overview
26. Benefits of organising your data
Research data files and folders need to be labelled and
organised in a systematic way so that:
•Data files are not accidentally overwritten or deleted
•Data files are distinguishable from each other within their
containing folder
•Data file naming prevents confusion when multiple people are
working on shared files
•Data files are easier to locate and browse
•Data files can be retrieved by both creator and by other users
•Data files can be sorted in logical sequence
•Different versions of data files can be identified
•If data files are moved to other storage platforms their names
will retain useful context
27. File Formats & Transformation
• Files are based on either text or binary encoding. The
former is both machine- and human-readable and the latter
only readable by means of appropriate software.
• Thus text files are less likely to become obsolete. Examples
of file name extensions for these files are .txt, .csv
and .por.
• Be aware of the file formats your data exists in
– Does this format require a specific type of software?
– Can others access the data in this format?
– Can alternative formats be used?
• Using widely available or open formats maximises the
chances of your data being stable and usable
28. File Formats & Transformation
•When compressing your data files for storage or
transportation you encode the information using fewer bits than
the original representation. Commonly used compression
programs are Zip and Tar.
•You may use the process of data normalisation. This means to
convert data from one format (e.g. proprietary) into another for
use or preservation (e.g. ASCII).
•If you convert or migrate your data files from one format to
another, be aware of potential risk of data loss or corruption
and take appropriate steps to avoid/minimise it.
•Watch out for backwards compatibility if software is upgraded
30. Documenting Data
There are many reasons why you need to document your
data:
•To help you remember the details later
•To help others understand your research
•Verify your findings
•Review your submitted publication
•Replicate your results
•Archive your data for access and re-use
Some examples of data documentation are:
•Laboratory notebooks
•Field notes
•Questionnaires
31. Documenting Data
Research data need to be documented at various levels:
•Project level
•File or database level
•Variable or item level
The term metadata (‘data about data’) is often used.
The importance of metadata lies in the potential for
machine-to-machine interoperability to assist location and
access to data through search interfaces.
32. Secure data storage:
For the purposes of integrity and efficiency it is important that research
data is stored securely & backed up regularly via:
• Networked drives
• Fileservers managed by department / school / IT Dept.
• Stored in single, secure, accessible place – regular back-ups.
• Personal computers / laptops
• Convenient, temporary storage - should not be used for storing
master copies.
• Local drives may fail & laptops may get lost/stolen.
33. • External storage devices
• Hard drives, USB sticks, CDs, DVDs – low cost & portable BUT not
recommended for long term storage.
• Longevity not guaranteed – degradation over time.
• Easily damaged or misplaced.
• Not big enough for all research data – might be need to use multiple
discs/drives.
• May pose a security threat.
If USB sticks, DVDs, CDs are used for working data or extra back-up
then:
• Choose high quality products from reputable manufacturers.
• Conduct regular checks to ensure media is not failing.
• Periodically refresh data (i.e. copy to a new disc or drive).
• Ensure confidential data is password protected / encrypted
34. • Remote or online back-up services – services that
provides an online system for storing and backing-up computer
files e.g. Dropbox, Mozy, Humyo, A-Drive
• Allow users to store and sync data files online and between
computers.
• Employ cloud computing storage facilities (e.g. Amazon S3).
• Business model – first few GBs free, pay for more space.
35. Backing-up
Considerations for back-up policy:
• Whether all data (full back-up), or only changed data will be
backed-up (incremental back-up)?
• How often full and incremental back-ups will be made?
• How much hard-drive space or DVDs will be required to maintain
this schedule?
• If working with sensitive data, how will it be secured (and
destroyed)?
• What back-up services are available that meet your these needs?
• Who will be responsible for ensuring back-ups are available?
Recommendation:
Keep at least 3 copies of your data (e.g. original, external/local,
and external/remote) and put in place regular back-up procedure
36. Data Security
The means of ensuring that data is kept safe from corruption and
that access to it is suitably controlled. It is important to consider
data security to prevent:
• Accidental or malicious damage / modification to data.
• Theft of valuable or irreplaceable data.
• Breach of confidentiality agreements and privacy laws.
• Release of data before it has been checked for accuracy and
authenticity.
38. Data Protection (also called data privacy)
• In the US, there is no single, comprehensive federal (national) law
regulating the collection and use of personal data. Instead, the US has
a patchwork system of federal and state laws, and regulations that
overlap, dovetail and may contradict one another.
• The combination of an increase in cross-border data flow, together
with the increased enactment of data protection statutes heightens the
risk of privacy violations and creates a significant challenge for a data
owner/distributor.
Data protection is the relationship between:
•collection and dissemination of data
•technology
•the public expectation of privacy and the legal and political issues
surrounding them
39. Rights and access
• Intellectual property rights (IPR) can be defined as rights acquired
over any work created or invented with the intellectual effort of an
individual.
• Facts are not copyrightable but the structure of a database could be.
• As a researcher, you should clarify ownership of and rights relating to
research data before a project starts. This includes the right of access
and the right to make copies.
• Data licences determine the terms and conditions of use by another,
and may accompany a purchase or subscription.
• Open data licences attempt to “set data free” by minimising and
standardising the terms and conditions of re-use. Conditions may
include attribution, non-commercial use, no derivative works, or ‘share
alike’.
40. Open Data Commons (ODC) have prepared a set
of licences each with an accompanying statement
which can be placed with your data on a webpage
that points to your data.
Open Data Commons: http://opendatacommons.org/
41. Benefits of Sharing Data
• Scientific integrity – publishing & citing data in published
research papers can allow others to replicate, validate, or
correct results, thus improving the scientific record.
• Publicly funded research - there is a growing movement for
making publicly funded research available to the public.
• Funding mandates - US Funding Agencies are increasingly
mandating data sharing so as to avoid duplication of effort and
save costs.
• Preserve research data for researchers’ own future use.
43. Research Data MANTRA
Partnership between:
EDINA & Data Library, University of Edinburgh
Institute for Academic Development
Funded by JISC Managing Research Data Programme (Sept.
2010 – Aug. 2011)
Aim was to develop online interactive open learning resources
for PhD students and early career researchers that will:
Raise awareness of the key issues related to research data
management & contribute to culture change.
Provide guidelines for good practice.
44. Eight units with activities, scenarios and videos:
• Research data explained
• Data management plans
• Organising data
• File formats and transformation
• Documentation and metadata
• Storage and security
• Data protection, rights and access
• Preservation, sharing and licensing
Four data handling practicals: SPSS, NVivo, R, ArcGIS
Video stories from researchers in variety of settings
Online Learning Module
45. Online Learning Module
• Delivered online – self-paced, available ‘anytime, anyplace’
• Emphasis on practical experience and active engagement via
online activities
• One hour per unit
• Read and work through scenarios & activities (incl. videos etc)
• CC licence to allow manipulation of content for re-use with
attribution
• Portable content in open standard formats (e.g. SCORM)
• Research data MANTRA course:
http://datalib.edina.ac.uk/mantra
Data, documentation and associated files (e.g. SAS, SPSS, Stata) are housed on the CISER file server. Files are downloaded from the catalog in ZIP compressed format..
Cross-National Time Series data
As CISER is an ICPSR member, researchers can gain access to data held in those CESSDA Archives that are themselves ICPSR members
CESSDA member organisations adhere to a Trans-border Data Access Agreement
European community household panel survey, European Union labour force survey, Community Innovation survey, European health Interview Survey, Structure of Earnings Survey, European Union Statistics on Income and Living Conditions
What about preserving?
Observational – sensor data, survey or sample data, neuroimages – e.g. ocean temperature, voters attitudes before an election, photographs of a supernova
Experimental – e.g. gene sequences, chromatograms, toroid magnetic field data, HPLC, gel electrophoresis, chemical reaction rates,
Simulation – e.g. climate models, economic models, algorithms
Derived – e.g. text and data mining, compiled database, 3D models, maps
Reference - e.g. gene sequence databanks, chemical structures, spatial data portals
Funded by JISC as part of its UK programme, Managing Research Data to develop online learning materials to assist researchers manage their digital assets.
IAD – set up to deliver training and development for postgraduate students and staff – via online course, Virtual Learning Environments, transferable skills training
Shareable Content Object Reference Model – XML-based