Spring 2014 Data Management Lab: Session 1 Slides (more details at http://ulib.iupui.edu/digitalscholarship/dataservices/datamgmtlab)
What you will learn:
1. Build awareness of research data management issues associated with digital data.
2. Introduce methods to address common data management issues and facilitate data integrity.
3. Introduce institutional resources supporting effective data management methods.
4. Build proficiency in applying these methods.
5. Build strategic skills that enable attendees to solve new data management problems.
1. Research Data Management
Spring 2014: Session 1
Practical strategies for better results
University Library
Center for Digital Scholarship
2. Acknowledgements
Department of Biostatistics – Data Management,
Indiana University School of Medicine
Colleagues at Johns Hopkins University, Purdue
University, Oregon State University, University of
Oregon, New York University, and others who shared
their expertise.
4. Overview
• Four sessions, 2 hours each
• Some lecture, more discussion and activities
• Major products
– Practical, detailed data management plan [DRAFT]
– Map of data outcomes
– Storage & backup plan
– Documentation checklist
– Data quality standards
– Screening & cleaning checklist
5. Products & Resources
• Box folders
– Session 1, 2, 3, 4: Materials for each session
– Resources: Miscellaneous resources that span
sessions or are useful later
– Upload HERE: Folder for uploading products
• Will be used to assess my teaching – content & delivery
• Will NOT be used to assess you
• Please delete your name from the file before you
upload them
6. 1. Research data
management plans
& planning
2. Documentation &
metadata
3. Data quality
4. Ethical & Legal issues
in data sharing &
reuse
7. Session 1
1. Research data management plans & planning
a) Planning for good data management from the
start
b) Defining expected outcomes for your data
c) Getting a storage and backup plan
8. Activities & Discussions
• Introductions (<1 minute each)
–Name
–Department or Program
–What do you want to get out of these
workshops?
10. LEARNING
OUTCOMES
• Describe key challenges
associated with
managing digital
research data
• Identify the potential
consequences for
irresponsible or
inattentive data
management
11. Photocourtesyofwww.carboafrica.net
Data is collected from sensors, sensor
networks, remote sensing, observations,
and more - this calls for increased attention
to data management and stewardship
Data Deluge
Photocourtesyof
http://modis.gsfc.nasa.gov/
Photocourtesyof
http://www.futurlec.com
CCimagebytajaionFlickr
CCimagebyCIMMYTonFlickr
ImagecollectedbyVivHutchinson
12. Source: John Gantz, IDC Corporation: The Expanding Digital Universe
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
2005 2006 2007 2008 2009 2010
The World of Data Around Us)
Transient
information
or unfilled
demand for
storage
Information
Available Storage
PetabytesWorldwide
13. Why Data Management
• Natural disaster
• Facilities infrastructure failure
• Storage failure
• Server hardware/software failure
• Application software failure
• External dependencies (e.g. PKI
failure)
• Format obsolescence
• Legal encumbrance
• Human error
• Malicious attack by human or
automated agents
• Loss of staffing competencies
• Loss of institutional commitment
• Loss of financial stability
• Changes in user expectations and
requirements
CCimagebySharynMorrowonFlickr
CCimagebymomboleumonFlickr
14. Best Practices
Best Practices for Preparing Ecological Data Sets, ESA, August 2010
Poor data practice results in loss of information
(data entropy)
InformationContent
Time
Time of publication
Specific details
General details
Accident
Retirement or
career change
Death
(Michener et al. 1997)
14
16. “MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004
Miscoding and Billing Errors from Doctors and Hospitals totaled $20,000,000,000 in FY
2003 (9.3% error rate) . The error rate measured claims that were paid despite being
medically unnecessary, inadequately documented or improperly coded. In some
instances, Medicare asked health care providers for medical records to back up their
claims and got no response. The survey did not document instances of alleged fraud.
This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
“AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” (AP) February 2007
The Justice Department Inspector General found only two sets of data out of 26
concerning terrorism attacks were accurate. The Justice Department uses these
statistics to argue for their budget. The Inspector General said the data “appear to be
the result of decentralized and haphazard methods of collections … and do not appear
to be intentional.”
“OOPS! TECH ERROR WIPES OUT Alaska Info” (AP) March 2007
A technician managed to delete the data and backup for the $38 billion Alaska oil
revenue fund – money received by residents of the State. Correcting the errors cost the
State an additional $220,700 (which of course was taken off the receipts to Alaska
residents.)
Slide courtesy of BLM
18. Benefits of GOOD Data Management
• Efficiency
• Safety
• Quality
• Reputation
• Compliance
19. Minute paper
Why should we care about how
research data is managed?
[Subtext: Why should researchers spend time
managing their data better?]
Don’t forget to upload your paper to Box.
20. References
1. DataONE Education Module: Data Management. DataONE. Retrieved
December 2013. From http://www.dataone.org/sites/all/documents/
L01_DataManagement.pptx
2. Cook, B. (2013). NACP All Investigator Meeting: Data Management
Practices for Early Career Scientists. Presented February 3, 2013. From
http://daac.ornl.gov/NACP_AIM_2013/NACP_AIM_Agenda.html
3. Vines et al, (2014), Current Biology, The availability of research data
declines rapidly with article age.
http://dx.doi.org/10.1016/j.cub.2013.11.014
22. LEARNING
OUTCOMES
• Understand the life
cycle approach to
managing research data
• Summarize the basic
components of US
federal funding agency
requirements for data
management and
sharing.
• Outline planned project
and data
documentation in a data
management plan.
• Define expected
outcomes for data.
23. The Life Cycle Approach
• Helps define and explain complex processes
(graphically). (Carlson, 2013)
• Help to identify important components, roles,
responsibilities, milestones, etc. (Carlson, 2013)
• Demonstrates connections and relationships
between parts and the whole. (Carlson, 2013)
• Emphasizes the role of data management as an
active process embedded throughout the
research and knowledge creation life cycles.
28. OSTP Memo - February 2013
• Data
– Maximize access by the general public and without charge…protecting
confidentiality and personal privacy
– …recognizing proprietary interests, business confidential information,
and intellectual property rights
– …preserving the balance between the relative value of long-term
preservation and access and administrative burden
– …ensure all researchers develop data management plans
– Ensure appropriate evaluation of the merits of submitted DMPs
– Promote the deposit of data in publicly accessible databases
– …support training, education, and workforce development related to
scientific data management, analysis, storage, preservation, and
stewardship
29. Policy Drivers
• Funding agencies
– Increased impact of funding dollars
– Reduce redundant data collection
– Further scientific research
• Research Communities
– Enhance use and value of existing data
– Address big challenges
31. DMPs – What do they do?
• Outlines what you will do with your data
during and after you complete your research
• Submitted to funders – formal document
• Functional DMP – working document
– Start developing during design
– Use to guide project start-up
– Review and update throughout the project
32. DMPs – Why?
• Doing it right saves you time and makes your
research more efficient
– Document crucial information for your thesis or
dissertation
• Makes it easier to preserve and share your data
• Increases visibility of research
Data management is an
investment in your research to
make it easier and more efficient.
33. A dose of DMP realism
My data management plan – a satire
34. DMP
Introduction to the DMP
• Workshop - emphasis on planning
• BUT it is a working document
Sections to draft
• Data description
• Existing data (if applicable)
• Format
35. Mapping Data Outcomes
• Clearly describe what you want your research
project to accomplish
• Define what the data need to be in order for
you to answer your research questions
• Review example
37. References
1. Carlson, J. (2013). ICPSR Curating and Managing Data for
Reuse: Life Cycle Models and Principles.
2. DataONE Education Module: Data Management Planning.
DataONE. From http://www.dataone.org/sites/all/
documents/L03_DataManagementPlanning.pptx
3. Humphrey, C. (2008). e-Science and the Life Cycle of
Research. From http://datalib.library.ualberta.ca/
~humphrey/lifecycle-science060308.doc
4. Whitmire, A. (2013). Research Life Cycle. From
http://guides.library.oregonstate.edu/content.php?pid=5020
68&sid=4136875
39. LEARNING
OUTCOMES
• Identify your legal
obligations for sharing
and long-term
preservation.
• Identify your ethical
obligations for ensuring
data confidentiality,
privacy, and security.
• Describe intellectual
property issues for data
that result in a
patentable or
commercial product.
40. Ethical vs. Legal
• Ethical (Professional Society, Licensure, Community of Practice)
– Sharing (consent, IRB approval, de-identification, etc.)
– Redistribution & Re-use
– Citation
• Legal (Federal, State, Local, Funding Agency, Institution)
– Intellectual Property (e.g., who owns it?)
– Copyright
– Patents
– Trade secrets
– Licensing
– Monetary exchange
– Open source vs. proprietary software
– Data retention
41. Privacy
• Privacy: having control over the extent, timing, and
circumstances of sharing oneself (physically, behaviorally, or
intellectually) with others.
• Federal guidelines: FERPA, HIPAA
• Most research involves asking subjects to provide or release
information voluntarily following an informed consent
process.
• Privacy issues arise in regard to information obtained for
research purposes without the consent of the subjects.
42. Confidentiality
• Confidentiality: treatment of information that an individual has
disclosed in a relationship of trust and with the expectation that it
will not be divulged to others in ways that are inconsistent with the
understanding of the original disclosure without permission.
• Questions to consider:
– Are identifiers really needed or could data be collected anonymously?
– If identifiers are needed, can coded IDs be created to use for data collection,
merging, and analysis, with identifiers kept entirely separate and secure?
– How will the data be protected from inadvertent disclosure or unauthorized
access during collection, storage, and analysis?
– Should data be manipulated in specific ways to reduce specificity, by
collapsing data into categories with small numbers of individuals, reducing age
or geographic specificity, etc.
43. Intellectual Property Rights
• Patent
• Copyright
• Trademark
• Design
• Circuit Layout Right
• Plant Breeder’s Right
• Trade Secret
45. References
1. Australian Research Council. (nd). National Principles of
Intellectual Property Management for Publicly Funded
Research. From http://www.arc.gov.au/pdf/01_01.pdf
48. Storage & Back-up Plan
• Storage
– Keep primary copies in a secure, accessible location
• Backup
– Additional copies to prevent data loss
– Rule of 3
– Diversify hardware, software, and physical location
• Other considerations
– Security, encryption, compression
49. Storage @ IU
• Box @ IU
– http://kb.iu.edu/data/bdsv.html
• Research File System
– http://kb.iu.edu/data/aroz.html
• Scholarly Data Archive
– http://kb.iu.edu/data/aiyi.html
• REDCap
– http://www.indianactsi.org/rct
• Slashtmp (sharing)
– http://kb.iu.edu/data/angt.html
50. Backup Plan
• Rule of 3
– Local copy (ex: desktop or laptop)
– Semi-local copy (ex: IU cloud storage)
– Remote copy (ex: IU cloud storage)
• Backup frequency
– How much data can you risk losing?
• Backup procedure
– Manual or automatic?
– Full or incremental?
– Verification/testing?
– Documentation
51. Security & Encryption
• Use IU systems
– Strong authentication protocols
• Encryption
– Useful for portable devices (e.g., laptops, external hard
drives, flash drives, smartphones, etc.)
– Use for highly sensitive data
– IU recommendations
• http://kb.iu.edu/data/ayzi.html
• http://kb.iu.edu/data/bcnh.html
52. Master Files
• Provides snapshots of key phases in the data life
cycle
– Raw
– Cleaned
– Phases of processing
• In combination with detailed documentation, these
files make write-up easier and supports
reproducibility and reuse
53. EF-5 Horror Stories
• World’s Biggest Data Breaches:
http://www.informationisbeautiful.net/visualizations/worlds-
biggest-data-breaches-hacks/
• Excel error responsible for misinterpretation of data and
resulting policy decisions: http://arstechnica.com/tech-
policy/2013/04/
microsoft-excel-the-ruiner-of-global-economies/
• Sandy’s floodwaters damage 1500 volumes of digital art:
http://www.theverge.com/2013/1/15/3876790/eyebeam-
hurricane-sandy-digital-archive-rescue
54. EF-3 Horror Stories
• UNC Researcher Demoted over data breach:
– http://www.insidehighered.com/news/2011/01/27/unc_case_h
ighlights_debate_about_data_security_and_accountability_for_
hacks
– http://www.databreaches.net/cancer-researcher-fights-unc-
demotion-over-data-breach/
• UK Tamiflu Clinical Trial data:
http://blogs.plos.org/speakingofmedicine/2014/01/03/follow-the-
money-or-why-it-took-an-accounts-committee-to-decide-why-
access-to-clinical-trial-data-matters/
• Data loss at Emory Healthcare exposes over 315,000 patients:
http://www.bizjournals.com/atlanta/news/2012/04/18/data-loss-
at-emory-healthcare-exposes.html?s=print