SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Research Data Management
Spring 2014: Session 3
Practical strategies for better results
University Library
Center for Digital Scholarship
QUALITY ASSURANCE & CONTROL
MODULE 3
LEARNING
OUTCOMES
• Develop procedures
for quality
assurance and
quality control
activities.
Data Integrity
1. Data have integrity if they have been
maintained without unauthorized alteration
or destruction
2. Data integrity is data that has a complete or
whole structure.
(http://www.princeton.edu/~achaney/tmve/
wiki100k/docs/Data_integrity.html)
Data Quality
• Fitness for use (depends on context of your questions)
• Data quality is the most important aspect of data
management
• Ensured by
– Sufficient resources and expertise
– Paying close attention to the design of data collection
instruments
– Creating appropriate entry, validation, and reporting processes
– Ongoing QC processes
– Understanding the data collected
Chapman, 2005
Dept of Biostatistics – Data Management, IUSM
Data Quality Standards
• Check data for its logical consistency.
• Check data for reasonableness.
• Ensure adherence to sound estimation methodologies.
• Ensure adherence to monetary submission standards for
stolen and recovered property.
• Ensure that other statistical edit functions are processed
within established parameters.
FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines
Dept of Biostatistics – Data Management, IUSM
Data Entry and Manipulation
• Strategies for preventing errors from entering a dataset
• Activities to ensure quality of data before collection
• Activities that involve monitoring and maintaining the
quality of data during the study
Data Entry and Manipulation
• Define & enforce standards
◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality
◦ Be sure assigned person is educated in QA/QC
Quality Assurance v. Control
• QA: set of processes, procedures, and activities that
are initiated prior to data collection to ensure the
expected level of quality will be reached and data
integrity will be maintained.
• QC: a system for verifying and maintaining a desired
level of quality in a product or service.
http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityC
ontrol
Quality Assurance in Practice
• CRF (data collection instrument) review & validation
• System/process testing & validation
• Training, education, communication of a team
• Standard Operating Procedures, Standard Operating
Guidelines
• Site audits
Dept of Biostatistics – Data Management, IUSM
Quality Control in Practice
• Set of processes, procedures, and activities
associated with monitoring, detection, and action
during and after data collection.
• Examples:
– Errors in individual data fields
– Systematic errors
– Violation of protocol
– Staff performance issues
– Fraud or scientific misconduct
Dept of Biostatistics – Data Management, IUSM
Activity
Define data quality standards for the following
variables:
• Age
• Height
• BMI
• Life satisfaction scale
• Number of close friends
Don’t forget to upload this to Box.
Suggested file name “Data Quality Standards”
References
1. Department of Biostatistics – Data Management Team, Indiana
University School of Medicine (2013). Data Management including
REDCap. (provided via email)
2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for
the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020-
03-8. http://www.gbif.org/resources/2829
3. DataONE Education Module: Data Quality Control and Assurance.
DataONE. From http://www.dataone.org/sites/all/documents
/L05_DataQualityControlAssurance.pptx
DATA COLLECTION
MODULE 3
LEARNING
OUTCOMES
• Describe key
considerations for
selecting data
collection tools.
Choose your tools wisely
Choose your tools wisely
Allie Brosh, 2010
Activity
Draft data collection instrument
See document “DataMgmtLab-Spr14-
CollectionCodingEntry_EX“
Don’t forget to upload this to Box.
Suggested file name “Data Collection Tool”
References
1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably.
http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt-
have-ebola-probably.html
DATA CODING & ENTRY
MODULE 3
LEARNING
OUTCOMES
• Use best practices
for coding.
• Use best practices
for data entry.
Goals of Data Entry
• Publishable results!
– Valid data that are organized to support smooth
analysis
• Easy to import into analytical program
• Minimize manipulations and errors
• Has a logical [data] structure
Activity
Draft data coding scheme for data
entry
• Review data entry best practices
document in Box
Don’t forget to upload this to Box.
Suggested file name “Coding Scheme”
References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx
2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist
presented at the AGU Workshop. From
http://wiki.esipfed.org/index.php/2011AGUworkshop
3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt
CRC Research Skills Workshop Series. From
http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf
DATA SCREENING & CLEANING
MODULE 3
LEARNING
OUTCOMES
• Develop a screening
and cleaning
protocol and/or
checklist.
Data Entry and Manipulation
Data Contamination
• Process or phenomenon, other than the one of interest,
that affects the variable value
• Erroneous values
CCimagebyMichaelCoghlanonFlickr
Data Entry and Manipulation
• Errors of Commission
o Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omission
o Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the
field
CCimagebyNickJWebbonFlickr
Data Entry and Manipulation
• Double entry
◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the
recording
• Use text-to-speech program to read data back
CCimagebyweskrieselonFlickr
Data Entry and Manipulation
• Design data storage well
◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data
◦ Avoids duplicate error checking
◦ Allows undo if necessary
Data Entry and Manipulation
• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CCimagebychesapeakeclimateonFlickr
Data Entry and Manipulation
• Look for outliers
◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35
Data Entry and Manipulation
• Methods to look for outliers
◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Maps
◦ Subtract values from mean
Data Entry and Manipulation
• Data contamination is data that results from a factor not
examined by the study that results in altered data values
• Data error types: commission or omission
• Quality assurance and quality control are strategies for
◦ preventing errors from entering a dataset
◦ ensuring data quality for entered data
◦ monitoring, and maintaining data quality throughout the project
• Identify and enforce quality assurance and quality control
measures throughout the Data Life Cycle
Discussion
Using the Data Review Checklist,
evaluate the HBSC codebook
“DataMgmtLab-Spr14_DataReviewChecklist_EX”
What screening & cleaning procedures
were used?
Data Entry and Manipulation
1. D. Edwards, in Ecological Data: Design, Management and Processing,
WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70-
91. Available at www.ecoinformatics.org/pubs
2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for
preparing ecological data sets to share and archive. Bull. Ecol. Soc.
Amer. 82, 138-141 (2001).
3. A. D. Chapman, “Principles of Data Quality:. Report for the Global
Biodiversity Information Facility” (Global Biodiversity Information
Facility, Copenhagen, 2004). Available at
http://www.gbif.org/communications/resources/print-and-online-
resources/download-publications/bookelets/
References
1. Cook, 2013, NACP Best Data Management Practices Workshop. From
http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_201
3.02.03.ppt
2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data
provenance in e-Science. SIGMOD Record, 34(3), 31-36. From
http://www.sigmod.org/publications/sigmod-record/0509/p31-special-
sw-section-5.pdf
3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and
Management. From http://www.slideshare.net/INSITEUA/provenance-
management-to-enable-data-sharing
AUTOMATION
MODULE 3
LEARNING
OUTCOMES
• Explain why
automation
provides better
provenance than
manual processes.
• Identify effective
tools for automating
data processing and
analysis.
Choose your tools wisely
• Documents
• Excel
• Access
• SPSS, Minitab
• Mathematica, MATLAB, Scilab
• SAS, Stata
• R
• MapReduce
• NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc.
http://www.dataone.org/all-software-tools
Data Formats; Version 1.0
Overview
• Spreadsheets are amazingly flexible, and are commonly
used for data collection, analysis and management
• Spreadsheets are seldom self-documenting, and seldom
well-documented
• Subtle (and not so subtle) errors are easily introduced
during entry, manipulation and analysis
• Spreadsheet conventions – often ad hoc and evolutionary –
may change or be applied inconsistently
• Spreadsheet file formats are proprietary and thus generally
unacceptable as long term archival purposes
Data Entry and Manipulation
• Great for charts, graphs,
calculations
• Flexible about cell content
type—cells in same column
can contain numbers or text
• Lack record integrity--can
sort a column independently
of all others)
• Easy to use – but harder to
maintain as complexity and
size of data grows
• Easy to query to select
portions of data
• Data fields are typed – For
example, only integers are
allowed in integer fields
• Columns cannot be sorted
independently of each other
• Steeper learning curve than
a spreadsheet
NACP Best Data Management Practices, February 3, 2013
5. Preserve information (cont)
• Use a scripted language to process data
– R Statistical package (free, powerful)
– SAS
– MATLAB
• Processing scripts are records of processing
– Scripts can be revised, rerun
• Graphical User Interface-based analyses may
seem easy, but don’t leave a record
45
Provenance, Audit Trails, etc.
• “…information that helps determine the
derivation history of a data product, starting from
its original sources.” (Simmhan et al, 2005)
– Ancestral data products from which the data evolved
– Process of transformation of these ancestral data
products
• Uses: data quality, audit trail, replication recipe,
attribution, informational
More Considerations
• Field names & descriptions
• Structured entry
• Validation
• Record integrity
• Missing data
• Data/field types
• File types: common, open documented standard
• Output required for analysis and visualization
Demonstration & Discussion
Run [analysis] in Excel and Stata.
Compare output.
• What features does Stata have that Excel
does not?
• How do these features support
provenance and data integrity?
References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx

Más contenido relacionado

Destacado

Construction Materials Engineering and Testing
Construction Materials Engineering and TestingConstruction Materials Engineering and Testing
Construction Materials Engineering and Testing
mecocca5
 
Science laboratory equipment
Science laboratory equipmentScience laboratory equipment
Science laboratory equipment
Lauriz Aclan
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
drasifk
 
Data Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of DataData Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of Data
Roqui Malijan
 

Destacado (20)

Are Your Students Ready for Lab?
Are Your Students Ready for Lab?Are Your Students Ready for Lab?
Are Your Students Ready for Lab?
 
Corporate Data Quality Management Research and Services Overview
Corporate Data Quality Management Research and Services OverviewCorporate Data Quality Management Research and Services Overview
Corporate Data Quality Management Research and Services Overview
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
Big Data At A Human Scale
Big Data At A Human ScaleBig Data At A Human Scale
Big Data At A Human Scale
 
Data Quality Control
Data Quality ControlData Quality Control
Data Quality Control
 
Biology lab safety
Biology lab safety Biology lab safety
Biology lab safety
 
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
 
Highway Engineering Lab Protocol (Cycle-1)
Highway Engineering Lab Protocol (Cycle-1)Highway Engineering Lab Protocol (Cycle-1)
Highway Engineering Lab Protocol (Cycle-1)
 
Physics Lab Practical
Physics Lab PracticalPhysics Lab Practical
Physics Lab Practical
 
Construction Materials Engineering and Testing
Construction Materials Engineering and TestingConstruction Materials Engineering and Testing
Construction Materials Engineering and Testing
 
Science laboratory equipment
Science laboratory equipmentScience laboratory equipment
Science laboratory equipment
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
 
Lab safety rules and symbols Summary
Lab safety rules and symbols SummaryLab safety rules and symbols Summary
Lab safety rules and symbols Summary
 
Material Testing Lab Equipments
Material Testing Lab EquipmentsMaterial Testing Lab Equipments
Material Testing Lab Equipments
 
Graphical representation of data mohit verma
Graphical representation of data mohit verma Graphical representation of data mohit verma
Graphical representation of data mohit verma
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
 
Graphical Representation of data
Graphical Representation of dataGraphical Representation of data
Graphical Representation of data
 
Data Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of DataData Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of Data
 
Chapter 4 presentation of data
Chapter 4 presentation of dataChapter 4 presentation of data
Chapter 4 presentation of data
 
Presentation of data
Presentation of dataPresentation of data
Presentation of data
 

Similar a Data Management Lab: Session 3 Slides

Data Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data IntegrityData Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data Integrity
ClinosolIndia
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Health Catalyst
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
Vaticle
 
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
ClinosolIndia
 

Similar a Data Management Lab: Session 3 Slides (20)

Ensuring data quality
Ensuring data qualityEnsuring data quality
Ensuring data quality
 
Data Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesData Management Lab: Session 1 Slides
Data Management Lab: Session 1 Slides
 
(2012) The Role of Test Administrator and Error proposal
(2012) The Role of Test Administrator and Error proposal(2012) The Role of Test Administrator and Error proposal
(2012) The Role of Test Administrator and Error proposal
 
Quality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data WarehouseQuality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data Warehouse
 
Machine Learning for Predictive Data Analysis in Clinical Research
Machine Learning for Predictive Data Analysis in Clinical ResearchMachine Learning for Predictive Data Analysis in Clinical Research
Machine Learning for Predictive Data Analysis in Clinical Research
 
Data Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data IntegrityData Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data Integrity
 
A simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouseA simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouse
 
Data Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructionsData Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructions
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptx
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...
 
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Ethical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data RevolutionEthical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data Revolution
 
CLINICAL DATA MANAGEMENT.pptx
CLINICAL DATA MANAGEMENT.pptxCLINICAL DATA MANAGEMENT.pptx
CLINICAL DATA MANAGEMENT.pptx
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 

Más de IUPUI

Building the Future of Research Together
Building the Future of Research TogetherBuilding the Future of Research Together
Building the Future of Research Together
IUPUI
 

Más de IUPUI (20)

Altmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in LibrariesAltmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in Libraries
 
Gather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchGather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your research
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interity
 
Case studies for open science
Case studies for open scienceCase studies for open science
Case studies for open science
 
Midwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data PanelMidwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data Panel
 
Gathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate ImpactGathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate Impact
 
Citation & altmetrics - a comparison
Citation & altmetrics - a comparisonCitation & altmetrics - a comparison
Citation & altmetrics - a comparison
 
Altmetrics for Team Science
Altmetrics for Team ScienceAltmetrics for Team Science
Altmetrics for Team Science
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data loss
 
Practical Data Management Plans
Practical Data Management PlansPractical Data Management Plans
Practical Data Management Plans
 
Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)
 
Building the Future of Research Together
Building the Future of Research TogetherBuilding the Future of Research Together
Building the Future of Research Together
 
NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - Handout
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - Slides
 
Data Management Lab: Session 4 Slides
Data Management Lab: Session 4 SlidesData Management Lab: Session 4 Slides
Data Management Lab: Session 4 Slides
 
Data Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review OutlineData Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review Outline
 
Data Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review ChecklistData Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review Checklist
 
Data Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best PracticesData Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best Practices
 
Data Management Lab: Session 3 Data Coding Best Practices
Data Management Lab: Session 3 Data Coding Best PracticesData Management Lab: Session 3 Data Coding Best Practices
Data Management Lab: Session 3 Data Coding Best Practices
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Data Management Lab: Session 3 Slides

  • 1. Research Data Management Spring 2014: Session 3 Practical strategies for better results University Library Center for Digital Scholarship
  • 2. QUALITY ASSURANCE & CONTROL MODULE 3
  • 3. LEARNING OUTCOMES • Develop procedures for quality assurance and quality control activities.
  • 4. Data Integrity 1. Data have integrity if they have been maintained without unauthorized alteration or destruction 2. Data integrity is data that has a complete or whole structure. (http://www.princeton.edu/~achaney/tmve/ wiki100k/docs/Data_integrity.html)
  • 5. Data Quality • Fitness for use (depends on context of your questions) • Data quality is the most important aspect of data management • Ensured by – Sufficient resources and expertise – Paying close attention to the design of data collection instruments – Creating appropriate entry, validation, and reporting processes – Ongoing QC processes – Understanding the data collected Chapman, 2005 Dept of Biostatistics – Data Management, IUSM
  • 6. Data Quality Standards • Check data for its logical consistency. • Check data for reasonableness. • Ensure adherence to sound estimation methodologies. • Ensure adherence to monetary submission standards for stolen and recovered property. • Ensure that other statistical edit functions are processed within established parameters. FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines Dept of Biostatistics – Data Management, IUSM
  • 7. Data Entry and Manipulation • Strategies for preventing errors from entering a dataset • Activities to ensure quality of data before collection • Activities that involve monitoring and maintaining the quality of data during the study
  • 8. Data Entry and Manipulation • Define & enforce standards ◦ Formats ◦ Codes ◦ Measurement units ◦ Metadata • Assign responsibility for data quality ◦ Be sure assigned person is educated in QA/QC
  • 9. Quality Assurance v. Control • QA: set of processes, procedures, and activities that are initiated prior to data collection to ensure the expected level of quality will be reached and data integrity will be maintained. • QC: a system for verifying and maintaining a desired level of quality in a product or service. http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityC ontrol
  • 10. Quality Assurance in Practice • CRF (data collection instrument) review & validation • System/process testing & validation • Training, education, communication of a team • Standard Operating Procedures, Standard Operating Guidelines • Site audits Dept of Biostatistics – Data Management, IUSM
  • 11. Quality Control in Practice • Set of processes, procedures, and activities associated with monitoring, detection, and action during and after data collection. • Examples: – Errors in individual data fields – Systematic errors – Violation of protocol – Staff performance issues – Fraud or scientific misconduct Dept of Biostatistics – Data Management, IUSM
  • 12. Activity Define data quality standards for the following variables: • Age • Height • BMI • Life satisfaction scale • Number of close friends Don’t forget to upload this to Box. Suggested file name “Data Quality Standards”
  • 13. References 1. Department of Biostatistics – Data Management Team, Indiana University School of Medicine (2013). Data Management including REDCap. (provided via email) 2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020- 03-8. http://www.gbif.org/resources/2829 3. DataONE Education Module: Data Quality Control and Assurance. DataONE. From http://www.dataone.org/sites/all/documents /L05_DataQualityControlAssurance.pptx
  • 15. LEARNING OUTCOMES • Describe key considerations for selecting data collection tools.
  • 17. Choose your tools wisely Allie Brosh, 2010
  • 18. Activity Draft data collection instrument See document “DataMgmtLab-Spr14- CollectionCodingEntry_EX“ Don’t forget to upload this to Box. Suggested file name “Data Collection Tool”
  • 19. References 1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably. http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt- have-ebola-probably.html
  • 20. DATA CODING & ENTRY MODULE 3
  • 21. LEARNING OUTCOMES • Use best practices for coding. • Use best practices for data entry.
  • 22.
  • 23. Goals of Data Entry • Publishable results! – Valid data that are organized to support smooth analysis • Easy to import into analytical program • Minimize manipulations and errors • Has a logical [data] structure
  • 24.
  • 25. Activity Draft data coding scheme for data entry • Review data entry best practices document in Box Don’t forget to upload this to Box. Suggested file name “Coding Scheme”
  • 26. References 1. DataONE Education Module: Data Entry and Manipulation. DataONE. From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx 2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist presented at the AGU Workshop. From http://wiki.esipfed.org/index.php/2011AGUworkshop 3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt CRC Research Skills Workshop Series. From http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf
  • 27. DATA SCREENING & CLEANING MODULE 3
  • 28. LEARNING OUTCOMES • Develop a screening and cleaning protocol and/or checklist.
  • 29. Data Entry and Manipulation Data Contamination • Process or phenomenon, other than the one of interest, that affects the variable value • Erroneous values CCimagebyMichaelCoghlanonFlickr
  • 30. Data Entry and Manipulation • Errors of Commission o Incorrect or inaccurate data entered o Examples: malfunctioning instrument, mistyped data • Errors of Omission o Data or metadata not recorded o Examples: inadequate documentation, human error, anomalies in the field CCimagebyNickJWebbonFlickr
  • 31. Data Entry and Manipulation • Double entry ◦ Data keyed in by two independent people ◦ Check for agreement with computer verification • Record a reading of the data and transcribe from the recording • Use text-to-speech program to read data back CCimagebyweskrieselonFlickr
  • 32. Data Entry and Manipulation • Design data storage well ◦ Minimize number of times items that must be entered repeatedly ◦ Use consistent terminology ◦ Atomize data: one cell per piece of information • Document changes to data ◦ Avoids duplicate error checking ◦ Allows undo if necessary
  • 33. Data Entry and Manipulation • Make sure data line up in proper columns • No missing, impossible, or anomalous values • Perform statistical summaries CCimagebychesapeakeclimateonFlickr
  • 34. Data Entry and Manipulation • Look for outliers ◦ Outliers are extreme values for a variable given the statistical model being used ◦ The goal is not to eliminate outliers but to identify potential data contamination 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
  • 35. Data Entry and Manipulation • Methods to look for outliers ◦ Graphical • Normal probability plots • Regression • Scatter plots ◦ Maps ◦ Subtract values from mean
  • 36. Data Entry and Manipulation • Data contamination is data that results from a factor not examined by the study that results in altered data values • Data error types: commission or omission • Quality assurance and quality control are strategies for ◦ preventing errors from entering a dataset ◦ ensuring data quality for entered data ◦ monitoring, and maintaining data quality throughout the project • Identify and enforce quality assurance and quality control measures throughout the Data Life Cycle
  • 37. Discussion Using the Data Review Checklist, evaluate the HBSC codebook “DataMgmtLab-Spr14_DataReviewChecklist_EX” What screening & cleaning procedures were used?
  • 38. Data Entry and Manipulation 1. D. Edwards, in Ecological Data: Design, Management and Processing, WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70- 91. Available at www.ecoinformatics.org/pubs 2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for preparing ecological data sets to share and archive. Bull. Ecol. Soc. Amer. 82, 138-141 (2001). 3. A. D. Chapman, “Principles of Data Quality:. Report for the Global Biodiversity Information Facility” (Global Biodiversity Information Facility, Copenhagen, 2004). Available at http://www.gbif.org/communications/resources/print-and-online- resources/download-publications/bookelets/
  • 39. References 1. Cook, 2013, NACP Best Data Management Practices Workshop. From http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_201 3.02.03.ppt 2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance in e-Science. SIGMOD Record, 34(3), 31-36. From http://www.sigmod.org/publications/sigmod-record/0509/p31-special- sw-section-5.pdf 3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and Management. From http://www.slideshare.net/INSITEUA/provenance- management-to-enable-data-sharing
  • 41. LEARNING OUTCOMES • Explain why automation provides better provenance than manual processes. • Identify effective tools for automating data processing and analysis.
  • 42. Choose your tools wisely • Documents • Excel • Access • SPSS, Minitab • Mathematica, MATLAB, Scilab • SAS, Stata • R • MapReduce • NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc. http://www.dataone.org/all-software-tools
  • 43. Data Formats; Version 1.0 Overview • Spreadsheets are amazingly flexible, and are commonly used for data collection, analysis and management • Spreadsheets are seldom self-documenting, and seldom well-documented • Subtle (and not so subtle) errors are easily introduced during entry, manipulation and analysis • Spreadsheet conventions – often ad hoc and evolutionary – may change or be applied inconsistently • Spreadsheet file formats are proprietary and thus generally unacceptable as long term archival purposes
  • 44. Data Entry and Manipulation • Great for charts, graphs, calculations • Flexible about cell content type—cells in same column can contain numbers or text • Lack record integrity--can sort a column independently of all others) • Easy to use – but harder to maintain as complexity and size of data grows • Easy to query to select portions of data • Data fields are typed – For example, only integers are allowed in integer fields • Columns cannot be sorted independently of each other • Steeper learning curve than a spreadsheet
  • 45. NACP Best Data Management Practices, February 3, 2013 5. Preserve information (cont) • Use a scripted language to process data – R Statistical package (free, powerful) – SAS – MATLAB • Processing scripts are records of processing – Scripts can be revised, rerun • Graphical User Interface-based analyses may seem easy, but don’t leave a record 45
  • 46. Provenance, Audit Trails, etc. • “…information that helps determine the derivation history of a data product, starting from its original sources.” (Simmhan et al, 2005) – Ancestral data products from which the data evolved – Process of transformation of these ancestral data products • Uses: data quality, audit trail, replication recipe, attribution, informational
  • 47. More Considerations • Field names & descriptions • Structured entry • Validation • Record integrity • Missing data • Data/field types • File types: common, open documented standard • Output required for analysis and visualization
  • 48. Demonstration & Discussion Run [analysis] in Excel and Stata. Compare output. • What features does Stata have that Excel does not? • How do these features support provenance and data integrity?
  • 49. References 1. DataONE Education Module: Data Entry and Manipulation. DataONE. From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx