SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Data Governance in Two Different Data Archives: When
is a Federal Data Repository Useful?
Greg Farber
Director, Office of Technology Development and Coordination
National Institute of Mental Health
National Institutes of Health
March 2018
1) Most research subjects want their data to be used to understand
disease broadly. They are not too concerned about how researchers
use their data.
2) The diseases we are trying to understand today are complex meaning
that the same symptoms can have many different underlying biological
causes. Except in the cases where a deeply penetrant point mutation
uncovers a single biological pathway to a disease, understanding the
“subgroups” for complex diseases requires data from large populations
who have similar symptoms.
3) Differences in data sharing laws in different countries makes it difficult
or impossible to move data across international borders. Federating
data archives that are storing data in a similar way provides an
inelegant but workable solution to this problem.
4) Despite the urgent need to aggregate data to understand complex
diseases, individual consents and local laws must be respected.
Guiding Ideas
2
Policy Considerations can be Manipulated to Become
an Excuse Not to Share Data
3
• Contrast two data archives that have built the infrastructure necessary
to aggregate data on complex diseases.
• NIMH Data Archive (NDA) Overview
▪ Federal Data Repository where the data are owned by the US National
Institutes of Health
▪ Infrastructure
▪ Policy Issues
• Human Connectome Program (HCP)
▪ Large NIH funded project
▪ Access to most data was by self certification
▪ Initial Data Distribution was through Washington University
Roadmap
• Stores data from experiments involving human subjects that are
deposited by research laboratories.
▪ Federal data repository
▪ Originally contained data from human subjects related to mental illness
(and control subjects), but that has expanded in a number of ways over the
past 12 months. Most subjects have consented to broad data sharing.
▪ Data are available to the research community through a not too difficult
application process.
▪ Both submission and access to subject level data require approval of an
institutional official.
▪ Summary data are available to everyone with a browser (https://data-
archive.nimh.nih.gov/)
• Begun in late 2006, and first data was received in 2008
• The data types include demographic data, clinical assessments,
imaging data, and –omic data. There are no formal limits to the types
of data that can be stored in NDA.
NIMH Data Archive
• The NDA currently makes data available to the research community
from 200,000 subjects. Additional data are held by the NDA but are
not yet ready for sharing because the grant is still active and/or has
not published papers.
• Many subjects have longitudinal data.
• ~1.1 PB of imaging and –omic data is securely stored in the Amazon
cloud.
• Currently, the NDA does not contain any personally identifiable
information, but we expect to begin holding such data in the near
future (data from mobile devices).
▪ This change will likely require that NDA verify that the use of the data has
been approved by an Institutional Review Board.
NIMH Data Archive – Current Size and Scope
• It is best to think of NDA as a large (~182,000 data elements by
~200,000 people), sparse, two dimensional matrix.
NDA Structure – Rows and Columns are the Building
Blocks
8
• The NDA data dictionary is one of the key building blocks for this
repository. It provides a flexible and extensible framework for data
definition by the research community.
• 2,000+ instruments, freely available to anyone
▪ 180,000+ unique data elements and growing
▪ Data dictionaries describing
• Clinical
• Genomics/Proteomics
• MRI Modalities
• Other complex data (EEG, eye tracking)
• Accommodates any data type and data structure
• Describes the data collected by the research community
Data Dictionary – The First Building Block
• Curated by NDA (this takes a lot of time)
• Data held in different archives needs to use common data
dictionaries to allow deep federation.
• The associated validation tool allows investigators to
quickly perform quality control tests of their data without
submitting data anywhere.
• Data in archives that don’t have a similar QC step are
likely to have issues.
• Both to enhance the quality of the science and to ensure
that the time and effort that research subjects are
spending in our research protocols, the validation tool
should be run frequently (daily, weekly). This is common
practice in many other domains.
Data Dictionary – The First Building Block
• The NDA GUID software allows any researcher
to generate a unique identifier using some
information from a birth certificate.
• If the same information is entered in different
laboratories, the same GUID will be generated.
• This strategy allows NDA to aggregate data on
the same subject collected in multiple
laboratories without holding any of the personally
identifiable information about that subject.
• The GUID is now being discussed in a number of
additional research communities. We think we
have a reasonable plan to prevent a GUID from
becoming something like a social security
number (which would be identifying in itself)
• External studies indicate that the GUID
implementation is pretty robust both to false
positives and false negatives in large
populations.
Global Unique Identifier – the Other Building Block
Federation – The GUID Does Work
At this point, data has been received from the laboratory
that measured the data. Each subject has a GUID or a
pseudo-GUID. A data dictionary has have been defined,
and the submitted data have been validated against that
definition.
How does an outside user find data they are interested
in?
An Example of Data Associated with a Particular Laboratory
Now assigning DOIs to
each study, and we can
track how often a DOI
link is clicked (the start
of a data citation)
Results in 750 subjects
being discovered
• Assertion: Any consent language that restricts the use of the data
for particular purposes (for autism research…) results in profound
negative consequences.
• For example, if a researcher is trying to aggregate data between
subjects with schizophrenia and autism to understand common
symptoms that are observed in the two diagnostic groups, a consent that
limited a dataset for use only to understand one of those diagnostic
conditions would probably mean the data is not accessible for a
comparison study.
• Restricted data are also probably off limits for those who are trying to
use data mining techniques to develop or substantiate a hypothesis.
• There are some cases where restrictive consents might be appropriate,
but this should be the rare exception.
Policy – Consents
23
• NIMH expects that
research we pay for
involving human subjects
will result in that data
being made available in
NDA.
• Journals can also have a
positive role to play in
requiring that data be
placed in a repository
prior to publication.
• Asking for data volunteers
probably isn’t good
enough right now.
Policy – Data Deposition
24
• Summary data are available to anyone via the web site, but
accessing subject level data requires a data access form.
• Similarly, a data submission agreement is required that
certifies that the data were consented for sharing.
• Both forms require the signature from the PI and an
institutional official. This means that the research
institution is formally responsible for ensuring that the
data are “treated with respect”.
• Although neither form is complicated, they do raise barriers
to accessing the data.
Policy – NDA Data Access and Data Submission
25
• The NIMH Data Archive does hold some data that were
collected outside the US.
• For those datasets, the institutional official has decided that
depositing data is allowed both by the terms of the informed
consent and by the laws in that country.
• When there are restrictions to allowing data to be moved, it
is still possible to make it easy for the research community
to find data by federating data archives.
Policy – Data from Institutions Outside the US
26
• For NDA, submitting data is separate from sharing that data
with the research community.
• Data are shared when the grant is complete or when a
paper is published.
• Other sharing timelines are possible.
• No matter when the data are shared, data need to be
submitted on a regular basis. This ensures that the data
from a grant award has been submitted before funding is
exhausted. More importantly, periodic data submission
ensures that the data have undergone basic QC checks as
they are collected.
Policy – Timeline for Data Sharing
27
• Responding to a number of instances of high visibility/impact
experiments that were not thoughtfully designed, NIH (and
NIMH) have instituted a number of programs to enhance
rigor and reproducibility in research supported by NIH.
• These discussions with the community started in June 2012.
The new guidelines to increase rigor and reproducibility are
outlined in NOT-OD-15-103 and at a web site
(https://www.nih.gov/research-training/rigor-reproducibility).
• Data archives plays an important role in improving the rigor
and reproducibility of NIMH funded research.
Rigor and Reproducibility – Data Archives Help
28
• Data dictionaries are a key part of the NDA infrastructure. Each item in a data
dictionary has an allowable range of values. The NDA has a validation tool that
allows users to check a data set to see if it conforms with the allowable ranges and
formats in a data dictionary.
• Because of our mandated data deposition schedule the validation tool allows labs
to find errors every 6 months when data are deposited (or more often if they
choose).
Rigor and Reproducibility - 2
29
• The NDA makes it easy to identify the data associated with a publication, and we
assign a doi to that dataset to make it trivial for the research community to find the
data.
• Identifying the data from a publication allows researchers to look at all of the data
collected under an award and compare that to the data used in the publication.
Rigor and Reproducibility - 3
30
• There are “professional” research participants who seem to make a
living volunteering for clinical studies. Websites exist that make it
relatively easy for such participants to find out the right answers to
screening questions to be admitted to a study. Clearly, this can be
dangerous to the volunteer and can also put the rigor and reproducibility
of the study at risk.
• Recruitment in certain diagnostic categories may take place in a small
number of clinical centers. This means that papers from many different
research groups may be sampling from a smaller population than the
“independent” papers might suggest.
• The NDA GUID helps the research community understand the size of
these problems and deal with these issues.
• There are also commercial services that aid in the screening for
someone who is participating in multiple clinical trials.
GUIDs and Rigor/Reproducibility
31
NIH/NIMH Data Archives Staff
1) NIMH Data Archive – very heterogeneous data collected in multiple
laboratories. NDA attempts to aggregate this data using a global
unique identifier system as well as data dictionaries to describe the
myriad experiments.
2) Human Connectome Program – heterogeneous data (clinical
assessments, imaging, MEG, genomics) collected using a common
protocol. The first phase of this project involved data collection from
typical research subjects at a single site. The project has recently
expanded to include data collected across the lifespan for control
subjects as well as from subjects with a diagnosis. Those datasets are
collected at multiple laboratories, but still use similar data collection
protocols.
Two Different Sorts of Data Archives
• The NIH Human Connectome Project (HCP) is supported by the NIH
Neuroscience Blueprint ICs
• The HCP is an ambitious effort to map the neural pathways that underlie
human brain function. The overarching purpose of the Project is to
acquire and share data about the structural and functional connectivity of
the human brain. It has greatly advance the capabilities for imaging and
analyzing brain connections, resulting in improved sensitivity, resolution,
and utility, thereby accelerating progress in the emerging field of human
connectomics.
• Phase 1 of the HCP resulted in two awards
■ David Van Essen and Kamil Ugurbil, Wash U and U Minnesota
■ Bruce Rosen, MGH
Human Connectome Project
34
1) Deliver advanced MRI scanners and
techniques with high spatial and
temporal resolution for functional
and diffusion MRI.
■ Both the MGH and the Wash U MRIs
worked as designed and were able
to collect data quickly. Siemens
learned a great deal from
collaborating on both instruments,
and their new family of 3T MRIs (the
Prisma) has operating characteristics
similar to the Wash U scanner.
■ The supplements to port the pulse
sequences to other laboratories and
to other manufacturers has also been
successful.
Phase 1 Connectome Accomplishments
35
2) Deliver high quality data to the research community
■ Wash U has released data from 1200 subjects. This includes
behavioral assessments, structural MRIs, rs fMRI, task fMRI, and
diffusion experiments. MEG data has also just been released. This is
the first time that a large imaging award adopted “genome speed” data
release.
■ Data from MGH are being made available on their web site as well as
at the Wash U web site.
■ The data are being widely used by the research community. More
than 100 papers cited the Wash U grant at the point where data
collection was only half complete.
■ High visibility papers have appeared. Researchers from outside the
WU-Minn collaboration have authored some of those papers.
Phase 1 Connectome Accomplishments
36
• Based on the results from the original connectome project (which
collected data on 22-34 year old healthy subjects), NIH decided to fund
awards for a lifespan connectome. Three awards have been made that
will cover the age range from birth to the oldest old (90+).
• In addition, NIH has funded 14 awards to measure connectomics on
groups that have some sort of diagnosis (Alzheimers, low vision,
dementia, epilepsy, mood and anxiety disorders, psychosis, …).
• Over 8,000 subjects are participating and nearly 12,000 scans are
expected in the data infrastructure by 2021. Phenotypic and clinical
assessments as well as other non-MRI data are being collected and
made available.
• In addition, the Adolescent Brain Cognitive Development (ABCD) study
has chosen to use the connectome data collection protocol. That study
intends to enroll 10,000 children aged 9-10 and follow them into early
adulthood. This dataset requires a data access agreement.
Connectome Today
37
• A Connectome Coordination Facility has been created to hold all of the
data (https://www.humanconnectome.org/).
• The original HCP consents allowed almost unlimited access to the data
(clinical and phenotypic data as well as the MRI and MEG data).
• An individual who wanted data simply enters a working web site into the
registration system and certifies that the will not attempt to re-identify
any of the research participants.
• Many of the original participants are part of the Missouri twin study. This
caused some of the measured data (family structure, substance use) to
be declared sensitive. The sensitive data had a more restrictive data
access protocol.
HCP Data
38
ConnectomeDB
Moving to NIMH Data Archive (NDA)
ConnectomeDB
ConnectomeDB – Widespread Data Usage
• Clearly resulted in a lot of data use – transfers, papers, …
• Open access probably helped the community to adopt HCP data
collection as the current standard
• Even in this open access data set, there is still some information that is
sensitive and requires approval. When the data were at Wash U, the
only penalty for misusing the data was loss of further access to the data.
• No penalties were ever imposed for mistreating data.
• ABCD early data availability (needs DAC) seems to be around the same
level as HCP – does this mean that researchers will do what it takes to
get good data?
• Probably the key question to think about when deciding between open
access and a more restrictive model is what penalties need to be
imposed if the data are not treated in accord with the data access
agreement.
HCP – Open Access
42
The “WU-Minn”HCP consortium of the initial HCP Dataset
• Understanding complex diseases need lots of different data
from a variety of sources.
• Informed consents, national laws concerning data sharing,
and investigator preferences can all restrict the aggregation
of data.
• All of those issues can be solved, with some effort.
• If you have an option, deciding whether to share data under a
very open access model or in a federal database should be
made based on what needs to happen if the data are not
treated appropriately.
• Even though it is easier to get data from an open repository,
early results from the ABCD project suggest that users will
take the steps needed to get access to high quality data.
Summary

Más contenido relacionado

La actualidad más candente

Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Beth Plale
 
Chain Event: Intro - Sean Manion
Chain Event: Intro - Sean ManionChain Event: Intro - Sean Manion
Chain Event: Intro - Sean ManionSean Manion PhD
 
A Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterpriseA Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterprisePhilip Bourne
 
From Data Sharing to Data Stewardship
From Data Sharing to Data StewardshipFrom Data Sharing to Data Stewardship
From Data Sharing to Data StewardshipICPSR
 
A SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIHA SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIHPhilip Bourne
 
Use of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesUse of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesLouise Corti
 
20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data ThingsKatina Toufexis
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data ThingsKatina Toufexis
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGPhilip Bourne
 
DataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy IssuesDataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy IssuesDataONE
 
Meeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human HealthMeeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human HealthPhilip Bourne
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaidatascienceiqss
 
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...ICPSR
 
Big Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH HeadedBig Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH HeadedPhilip Bourne
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemMichel Dumontier
 
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014ICPSR
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data CitationMicah Altman
 
NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)Lance K. Manning
 
Research in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career ResearchersResearch in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career ResearchersRebecca Grant
 

La actualidad más candente (20)

Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
 
Chain Event: Intro - Sean Manion
Chain Event: Intro - Sean ManionChain Event: Intro - Sean Manion
Chain Event: Intro - Sean Manion
 
A Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital EnterpriseA Successful Academic Medical Center Must be a Truly Digital Enterprise
A Successful Academic Medical Center Must be a Truly Digital Enterprise
 
From Data Sharing to Data Stewardship
From Data Sharing to Data StewardshipFrom Data Sharing to Data Stewardship
From Data Sharing to Data Stewardship
 
A SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIHA SWOT Analysis of Data Science @ NIH
A SWOT Analysis of Data Science @ NIH
 
Use of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesUse of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issues
 
20160523 23 Research Data Things
20160523 23 Research Data Things20160523 23 Research Data Things
20160523 23 Research Data Things
 
20160719 23 Research Data Things
20160719 23 Research Data Things20160719 23 Research Data Things
20160719 23 Research Data Things
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAG
 
DataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy IssuesDataONE Education Module 10: Legal and Policy Issues
DataONE Education Module 10: Legal and Policy Issues
 
Meeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human HealthMeeting the Computational Challenges Associated with Human Health
Meeting the Computational Challenges Associated with Human Health
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
 
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
Understanding ICPSR - An Orientation and Tours of ICPSR Data Services and Edu...
 
Big Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH HeadedBig Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH Headed
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health System
 
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
Instructional Data Sets from Q-step Launch Event (Univ of Exeter) 3-20-2014
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
 
NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)NIH Big Data to Knowledge (BD2K)
NIH Big Data to Knowledge (BD2K)
 
Research in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career ResearchersResearch in the time of Covid: Surveying impacts on Early Career Researchers
Research in the time of Covid: Surveying impacts on Early Career Researchers
 

Similar a Data Governance in two different data archives: When is a federal data repository useful

NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutIUPUI
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesIUPUI
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interityIUPUI
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersIncisive_Events
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
 
big-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdfbig-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdfAsefaAdimasu2
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Richard Huffine
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Richard Huffine
 
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...dkNET
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research RequirementsICPSR
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open ScienceMark Parsons
 
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET
 
Compliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to DataCompliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to DataMargaret Henderson
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhilip Bourne
 
Federal funder mandates
Federal funder mandatesFederal funder mandates
Federal funder mandatesSherry Lake
 

Similar a Data Governance in two different data archives: When is a federal data repository useful (20)

NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - Handout
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - Slides
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interity
 
Alain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producersAlain Frey Research Data for universities and information producers
Alain Frey Research Data for universities and information producers
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
Preparing Research Data for Sharing
Preparing Research Data for SharingPreparing Research Data for Sharing
Preparing Research Data for Sharing
 
big-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdfbig-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdf
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...
 
Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...Overview of Emerging Requirements for Data Management of Federally Funded Res...
Overview of Emerging Requirements for Data Management of Federally Funded Res...
 
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023? New NIH Data Management and Sha...
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
 
Compliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to DataCompliance: Data Management Plans and Public Access to Data
Compliance: Data Management Plans and Public Access to Data
 
BLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, FigshareBLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, Figshare
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early Thoughts
 
Federal funder mandates
Federal funder mandatesFederal funder mandates
Federal funder mandates
 

Último

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 

Último (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 

Data Governance in two different data archives: When is a federal data repository useful

  • 1. Data Governance in Two Different Data Archives: When is a Federal Data Repository Useful? Greg Farber Director, Office of Technology Development and Coordination National Institute of Mental Health National Institutes of Health March 2018
  • 2. 1) Most research subjects want their data to be used to understand disease broadly. They are not too concerned about how researchers use their data. 2) The diseases we are trying to understand today are complex meaning that the same symptoms can have many different underlying biological causes. Except in the cases where a deeply penetrant point mutation uncovers a single biological pathway to a disease, understanding the “subgroups” for complex diseases requires data from large populations who have similar symptoms. 3) Differences in data sharing laws in different countries makes it difficult or impossible to move data across international borders. Federating data archives that are storing data in a similar way provides an inelegant but workable solution to this problem. 4) Despite the urgent need to aggregate data to understand complex diseases, individual consents and local laws must be respected. Guiding Ideas 2
  • 3. Policy Considerations can be Manipulated to Become an Excuse Not to Share Data 3
  • 4. • Contrast two data archives that have built the infrastructure necessary to aggregate data on complex diseases. • NIMH Data Archive (NDA) Overview ▪ Federal Data Repository where the data are owned by the US National Institutes of Health ▪ Infrastructure ▪ Policy Issues • Human Connectome Program (HCP) ▪ Large NIH funded project ▪ Access to most data was by self certification ▪ Initial Data Distribution was through Washington University Roadmap
  • 5. • Stores data from experiments involving human subjects that are deposited by research laboratories. ▪ Federal data repository ▪ Originally contained data from human subjects related to mental illness (and control subjects), but that has expanded in a number of ways over the past 12 months. Most subjects have consented to broad data sharing. ▪ Data are available to the research community through a not too difficult application process. ▪ Both submission and access to subject level data require approval of an institutional official. ▪ Summary data are available to everyone with a browser (https://data- archive.nimh.nih.gov/) • Begun in late 2006, and first data was received in 2008 • The data types include demographic data, clinical assessments, imaging data, and –omic data. There are no formal limits to the types of data that can be stored in NDA. NIMH Data Archive
  • 6.
  • 7. • The NDA currently makes data available to the research community from 200,000 subjects. Additional data are held by the NDA but are not yet ready for sharing because the grant is still active and/or has not published papers. • Many subjects have longitudinal data. • ~1.1 PB of imaging and –omic data is securely stored in the Amazon cloud. • Currently, the NDA does not contain any personally identifiable information, but we expect to begin holding such data in the near future (data from mobile devices). ▪ This change will likely require that NDA verify that the use of the data has been approved by an Institutional Review Board. NIMH Data Archive – Current Size and Scope
  • 8. • It is best to think of NDA as a large (~182,000 data elements by ~200,000 people), sparse, two dimensional matrix. NDA Structure – Rows and Columns are the Building Blocks 8
  • 9. • The NDA data dictionary is one of the key building blocks for this repository. It provides a flexible and extensible framework for data definition by the research community. • 2,000+ instruments, freely available to anyone ▪ 180,000+ unique data elements and growing ▪ Data dictionaries describing • Clinical • Genomics/Proteomics • MRI Modalities • Other complex data (EEG, eye tracking) • Accommodates any data type and data structure • Describes the data collected by the research community Data Dictionary – The First Building Block
  • 10. • Curated by NDA (this takes a lot of time) • Data held in different archives needs to use common data dictionaries to allow deep federation. • The associated validation tool allows investigators to quickly perform quality control tests of their data without submitting data anywhere. • Data in archives that don’t have a similar QC step are likely to have issues. • Both to enhance the quality of the science and to ensure that the time and effort that research subjects are spending in our research protocols, the validation tool should be run frequently (daily, weekly). This is common practice in many other domains. Data Dictionary – The First Building Block
  • 11.
  • 12.
  • 13.
  • 14. • The NDA GUID software allows any researcher to generate a unique identifier using some information from a birth certificate. • If the same information is entered in different laboratories, the same GUID will be generated. • This strategy allows NDA to aggregate data on the same subject collected in multiple laboratories without holding any of the personally identifiable information about that subject. • The GUID is now being discussed in a number of additional research communities. We think we have a reasonable plan to prevent a GUID from becoming something like a social security number (which would be identifying in itself) • External studies indicate that the GUID implementation is pretty robust both to false positives and false negatives in large populations. Global Unique Identifier – the Other Building Block
  • 15. Federation – The GUID Does Work
  • 16. At this point, data has been received from the laboratory that measured the data. Each subject has a GUID or a pseudo-GUID. A data dictionary has have been defined, and the submitted data have been validated against that definition. How does an outside user find data they are interested in?
  • 17.
  • 18. An Example of Data Associated with a Particular Laboratory
  • 19.
  • 20. Now assigning DOIs to each study, and we can track how often a DOI link is clicked (the start of a data citation)
  • 21.
  • 22. Results in 750 subjects being discovered
  • 23. • Assertion: Any consent language that restricts the use of the data for particular purposes (for autism research…) results in profound negative consequences. • For example, if a researcher is trying to aggregate data between subjects with schizophrenia and autism to understand common symptoms that are observed in the two diagnostic groups, a consent that limited a dataset for use only to understand one of those diagnostic conditions would probably mean the data is not accessible for a comparison study. • Restricted data are also probably off limits for those who are trying to use data mining techniques to develop or substantiate a hypothesis. • There are some cases where restrictive consents might be appropriate, but this should be the rare exception. Policy – Consents 23
  • 24. • NIMH expects that research we pay for involving human subjects will result in that data being made available in NDA. • Journals can also have a positive role to play in requiring that data be placed in a repository prior to publication. • Asking for data volunteers probably isn’t good enough right now. Policy – Data Deposition 24
  • 25. • Summary data are available to anyone via the web site, but accessing subject level data requires a data access form. • Similarly, a data submission agreement is required that certifies that the data were consented for sharing. • Both forms require the signature from the PI and an institutional official. This means that the research institution is formally responsible for ensuring that the data are “treated with respect”. • Although neither form is complicated, they do raise barriers to accessing the data. Policy – NDA Data Access and Data Submission 25
  • 26. • The NIMH Data Archive does hold some data that were collected outside the US. • For those datasets, the institutional official has decided that depositing data is allowed both by the terms of the informed consent and by the laws in that country. • When there are restrictions to allowing data to be moved, it is still possible to make it easy for the research community to find data by federating data archives. Policy – Data from Institutions Outside the US 26
  • 27. • For NDA, submitting data is separate from sharing that data with the research community. • Data are shared when the grant is complete or when a paper is published. • Other sharing timelines are possible. • No matter when the data are shared, data need to be submitted on a regular basis. This ensures that the data from a grant award has been submitted before funding is exhausted. More importantly, periodic data submission ensures that the data have undergone basic QC checks as they are collected. Policy – Timeline for Data Sharing 27
  • 28. • Responding to a number of instances of high visibility/impact experiments that were not thoughtfully designed, NIH (and NIMH) have instituted a number of programs to enhance rigor and reproducibility in research supported by NIH. • These discussions with the community started in June 2012. The new guidelines to increase rigor and reproducibility are outlined in NOT-OD-15-103 and at a web site (https://www.nih.gov/research-training/rigor-reproducibility). • Data archives plays an important role in improving the rigor and reproducibility of NIMH funded research. Rigor and Reproducibility – Data Archives Help 28
  • 29. • Data dictionaries are a key part of the NDA infrastructure. Each item in a data dictionary has an allowable range of values. The NDA has a validation tool that allows users to check a data set to see if it conforms with the allowable ranges and formats in a data dictionary. • Because of our mandated data deposition schedule the validation tool allows labs to find errors every 6 months when data are deposited (or more often if they choose). Rigor and Reproducibility - 2 29
  • 30. • The NDA makes it easy to identify the data associated with a publication, and we assign a doi to that dataset to make it trivial for the research community to find the data. • Identifying the data from a publication allows researchers to look at all of the data collected under an award and compare that to the data used in the publication. Rigor and Reproducibility - 3 30
  • 31. • There are “professional” research participants who seem to make a living volunteering for clinical studies. Websites exist that make it relatively easy for such participants to find out the right answers to screening questions to be admitted to a study. Clearly, this can be dangerous to the volunteer and can also put the rigor and reproducibility of the study at risk. • Recruitment in certain diagnostic categories may take place in a small number of clinical centers. This means that papers from many different research groups may be sampling from a smaller population than the “independent” papers might suggest. • The NDA GUID helps the research community understand the size of these problems and deal with these issues. • There are also commercial services that aid in the screening for someone who is participating in multiple clinical trials. GUIDs and Rigor/Reproducibility 31
  • 33. 1) NIMH Data Archive – very heterogeneous data collected in multiple laboratories. NDA attempts to aggregate this data using a global unique identifier system as well as data dictionaries to describe the myriad experiments. 2) Human Connectome Program – heterogeneous data (clinical assessments, imaging, MEG, genomics) collected using a common protocol. The first phase of this project involved data collection from typical research subjects at a single site. The project has recently expanded to include data collected across the lifespan for control subjects as well as from subjects with a diagnosis. Those datasets are collected at multiple laboratories, but still use similar data collection protocols. Two Different Sorts of Data Archives
  • 34. • The NIH Human Connectome Project (HCP) is supported by the NIH Neuroscience Blueprint ICs • The HCP is an ambitious effort to map the neural pathways that underlie human brain function. The overarching purpose of the Project is to acquire and share data about the structural and functional connectivity of the human brain. It has greatly advance the capabilities for imaging and analyzing brain connections, resulting in improved sensitivity, resolution, and utility, thereby accelerating progress in the emerging field of human connectomics. • Phase 1 of the HCP resulted in two awards ■ David Van Essen and Kamil Ugurbil, Wash U and U Minnesota ■ Bruce Rosen, MGH Human Connectome Project 34
  • 35. 1) Deliver advanced MRI scanners and techniques with high spatial and temporal resolution for functional and diffusion MRI. ■ Both the MGH and the Wash U MRIs worked as designed and were able to collect data quickly. Siemens learned a great deal from collaborating on both instruments, and their new family of 3T MRIs (the Prisma) has operating characteristics similar to the Wash U scanner. ■ The supplements to port the pulse sequences to other laboratories and to other manufacturers has also been successful. Phase 1 Connectome Accomplishments 35
  • 36. 2) Deliver high quality data to the research community ■ Wash U has released data from 1200 subjects. This includes behavioral assessments, structural MRIs, rs fMRI, task fMRI, and diffusion experiments. MEG data has also just been released. This is the first time that a large imaging award adopted “genome speed” data release. ■ Data from MGH are being made available on their web site as well as at the Wash U web site. ■ The data are being widely used by the research community. More than 100 papers cited the Wash U grant at the point where data collection was only half complete. ■ High visibility papers have appeared. Researchers from outside the WU-Minn collaboration have authored some of those papers. Phase 1 Connectome Accomplishments 36
  • 37. • Based on the results from the original connectome project (which collected data on 22-34 year old healthy subjects), NIH decided to fund awards for a lifespan connectome. Three awards have been made that will cover the age range from birth to the oldest old (90+). • In addition, NIH has funded 14 awards to measure connectomics on groups that have some sort of diagnosis (Alzheimers, low vision, dementia, epilepsy, mood and anxiety disorders, psychosis, …). • Over 8,000 subjects are participating and nearly 12,000 scans are expected in the data infrastructure by 2021. Phenotypic and clinical assessments as well as other non-MRI data are being collected and made available. • In addition, the Adolescent Brain Cognitive Development (ABCD) study has chosen to use the connectome data collection protocol. That study intends to enroll 10,000 children aged 9-10 and follow them into early adulthood. This dataset requires a data access agreement. Connectome Today 37
  • 38. • A Connectome Coordination Facility has been created to hold all of the data (https://www.humanconnectome.org/). • The original HCP consents allowed almost unlimited access to the data (clinical and phenotypic data as well as the MRI and MEG data). • An individual who wanted data simply enters a working web site into the registration system and certifies that the will not attempt to re-identify any of the research participants. • Many of the original participants are part of the Missouri twin study. This caused some of the measured data (family structure, substance use) to be declared sensitive. The sensitive data had a more restrictive data access protocol. HCP Data 38
  • 39. ConnectomeDB Moving to NIMH Data Archive (NDA)
  • 42. • Clearly resulted in a lot of data use – transfers, papers, … • Open access probably helped the community to adopt HCP data collection as the current standard • Even in this open access data set, there is still some information that is sensitive and requires approval. When the data were at Wash U, the only penalty for misusing the data was loss of further access to the data. • No penalties were ever imposed for mistreating data. • ABCD early data availability (needs DAC) seems to be around the same level as HCP – does this mean that researchers will do what it takes to get good data? • Probably the key question to think about when deciding between open access and a more restrictive model is what penalties need to be imposed if the data are not treated in accord with the data access agreement. HCP – Open Access 42
  • 43. The “WU-Minn”HCP consortium of the initial HCP Dataset
  • 44. • Understanding complex diseases need lots of different data from a variety of sources. • Informed consents, national laws concerning data sharing, and investigator preferences can all restrict the aggregation of data. • All of those issues can be solved, with some effort. • If you have an option, deciding whether to share data under a very open access model or in a federal database should be made based on what needs to happen if the data are not treated appropriately. • Even though it is easier to get data from an open repository, early results from the ABCD project suggest that users will take the steps needed to get access to high quality data. Summary