The PDB An Exemplar for Data Science To Date, But What About the Future?
1. The PDB An Exemplar for Data Science To
Date, But What About the Future?
Philip E. Bourne Ph.D.
Associate Director for Data Science
National Institutes of Health
2. Background
6/12 2/14 3/14
• Findings:
• Sharing data & software through catalogs
• Support methods and applications development
• Need more training
• Hire CSIO
• Continued support throughout the lifecycle
http://acd.od.nih.gov/diwg.htm
3. Motivation for This Talk
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
5. The way we fund and operate
biomedical databases will not scale.
How do we keep the best features of
todays resources but also respond to
shrinking budgets and changes in the
way we do science?
Lets address this question using the
PDB as an example
6. Disclaimer: This is NOT a talk about
the PDB per se, but a talk about data
resources in general, but using the
PDB as an example since we are all
familiar with it and it is considered an
exemplar by most stakeholders
7. Good News: We Trust the PDB
PDB
Trust in the data
is perhaps the PDB’s
biggest achievement
8. Good News: Trust
Trust is like compound interest
Comes from listening
Comes from engaging the community in every aspect
of the process
Comes from data consistency and level of annotation
Comes from responsiveness
Comes from the quality of the delivery service
9. Good News/Bad News Re Data Quality
Good News:
– If done right in the
beginning 25% of the
PDB’s budget could have
been saved
– Ontologies can work
– Automation has reduced
cost even as the amount
of data has increased
– Reproducibility is
improved
Bad News:
– Complex ontologies slow
adoption
– All data are created
equal
– Annotation is limited
10. Good News/Bad News Re Community
Good News:
– The community is
engaged
– The community has
driven data sharing
Bad News:
– The community does not
reduce costs through
active participation
– There is insufficient
reward for being part of
the community e.g. as an
annotator
11. How we do science is changing. Do
data resources including the PDB best
serve the needs of the user at this
point?
12. How is Science Changing?
More interdisciplinary
More translational
More access to diverse data types
More computational
More collaborative
13. Good News/Bad News for the PDB in
this Changing Landscape
Bad News:
– Interface complex and
uni-data oriented
– Data accessible;
methods accessible (sort
of); but not together
– Significant redundancy in
services offered
Good News:
– Annotation!
– Demand is increasing
– Integrated with other
data types
– Restful services
14. General Problem Statement:
How to insure a high quality
annotated data source that provides
the optimal environment for
accessibility and analysis by a broad
community of diverse users?
15. Okay so what can the funders do to
address a situation where really the
PDB is currently a best case
scenario?
16. 1. Encourage more
understanding for how
existing data are used
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010
1RUZ: 1918 H1 Hemagglutinin
Structure Summary page activity for
H1N1 Influenza related structures
3B7E: Neuraminidase of A/Brevig Mission/1/1918
H1N1 strain in complex with zanamivir
[Andreas Prlic]
17. We Need to Learn from Industries Whose
Livelihood Addresses the Question of Use
18. 2. Address the Issue that
Scholarship is Broken
I have a paper with 17,500 citations that no one has
ever read
I have papers in PLOS ONE that have more citations
than ones in PNAS
I have data sets I am proud of few places to put
them
I edited a journal but it did not count for much
20. 4. Enable Reproducibility
Much of the research life cycle is now digital -
encourage the reliability, accessibility, findability,
usability of data, methods, narrative, publications etc.
How?
Data sharing plans
Standards frameworks
Data and software catalogs
PubMedCentral
? The Commons – PMC for the complete lifecycle
? Machine readable data sharing plans
? Small funding to communities
? Support for training and best practices in eScholarship
21. 5. Establish The Commons
Public/private partnership
Work with IC’s, NCBI and CIT to identify and run
pilots – cloud, HPC centers
Port DbGAP to the cloud
? Experiment with new funding strategies
Evaluate
22. Sustainability and Sharing: The Commons
Data
The Long Tail
Core Facilities/HS Centers
Clinical /Patient
The Why:
Data Sharing Plans
The
Commons
Government
The How:
Data
Discovery
Index
Sustainable
Storage
Quality
Scientific
Discovery
Usability
Security/
Privacy
Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment
The End Game:
KnowledgeNIH
Awardees
Private
Sector
Metrics/
Standards
Rest of
Academia
Software Standards
Index
BD2K
Centers
Cloud, Research Objects,
23. What Does the Commons Enable?
Dropbox like storage
The opportunity to apply quality metrics
Bring compute to the data
A place to collaborate
A place to discover
http://100plus.com/wp-content/uploads/Data-Commons-3-
1024x825.png
24. The PDB in the Commons
Components:
– Annotated collection of data files
– API’s to access these data files
– Example methods using these APIs
Potential outcomes
– Nothing happens?
– A new breed of developer starts to use PDB data in new
ways ?
– The casual user has a broader set of services that
previously?
– Quality declines?
25. Some Acknowledgements
Eric Green & Mark Guyer (NHGRI)
Jennie Larkin (NHLBI)
Leigh Finnegan (NHGRI)
Vivien Bonazzi (NHGRI)
Michelle Dunn (NCI)
Mike Huerta (NLM)
David Lipman (NLM)
Jim Ostell (NLM)
Andrea Norris (CIT)
Peter Lyster (NIGMS)
All the over 100 folks on the BD2K team