What Can Happen when Genome Sciences Meets Data Sciences?
1. What Can Happen when Genome
Sciences Meets Data Sciences?
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
02/14/18 UVA Genome Sciences 1
2. I am more interested in having a
discussion than giving a lecture …
This is not about my research
specifically but what is happening
more broadly
02/14/18 UVA Genome Sciences 2
3. Agenda
• Some context
– My definition of data science
– What drives my thinking
– What is the NIH thinking?
• Relevant examples
• The DSI and what is happening at UVA
• Together, where do we go from here?
02/14/18 UVA Genome Sciences 3
4. What Do I Mean by Big Data/Data
Science?
• Use of the ever increasing amount of open,
complex, diverse digital data
• Finding ways to ask and then answer relevant
questions by combining such diverse data sets
• Arriving at statistically significant conclusions
not otherwise obtainable
• Sharing such findings in a useful way
• Translating such findings into actions that
improve the human condition
02/14/18 UVA Genome Sciences 4
6. A Few Random Data {Science} Facts
• There are ~2.7 Zetabytes (2.7 x 106 PB) of digital
data currently
– = US population tweeting 3x/min for 26,976 years
• Big data currently estimated as a $50bn business
– could save $3.1tn
• 40% growth in data/yr; 5% growth in IT
expenditure
• US 140,000- 190,000 unfilled deep data analytics
jobs
• DSI has 600 applicants this year for 50 spots;
MSDS/MBA highly sought
02/14/18 UVA Genome Sciences 6
7. A Few Random Data {Science} Facts
• There are ~2.7 Zetabytes (2.7 x 106 PB) of digital
data currently
– = US population tweeting 3x/min for 26,976 years
• Big data currently estimated as a $50bn business
– could save $3.1tn – private sector research
• 40% growth in data/yr; 5% growth in IT
expenditure - undervalued
• US 140,000- 190,000 unfilled deep data analytics
jobs – competition for skilled researchers high
• DSI has 600 applicants this year for 50 spots;
MSDS/MBA highly sought – large human capital
02/14/18 UVA Genome Sciences 7
8. How Much Biomedical Data?
• Big Data
– Total data from NIH-funded research in 2016
estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is
expected to grow by 10 PB in 2016
• Dark Data
– Only 12% of data described in published papers is
in recognized archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on
maintaining data archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
02/14/18 UVA Genome Sciences 8
9. Consider Some Current High Profile
NIH Examples Where Data Science is
Being Applied
• Moonshot - Bringing together 5 petabytes of homogenized data within the
Genome Data Commons (GDC) to explore genotype-phenotype
relationships
• MODs – Multiple high value high cost genomic resources
• Human Microbiome Project – microbe characterization and analysis
• TOPMed – Genomic, proteomic, metabolomic, image and EHR data
• All-of-Us Precision Medicine - Building a platform to support data on >1M
individuals with extensive and constantly updated health profiles
• ECHO – Effects of Environmental Exposures on Child Health and
Development - Integration of child health and environmental data
• BRAIN - Temporal and spatial analysis of neural circuits
9
10. How is Data Science Being Applied?
• Moonshot – new ways to analyze genotype-phenotype associations
• MODs – new curation and integration tools
• Human Microbiome Project – new cloud based tools
• TOPMed – large scale storage and analysis; data harmonization
• All-of-Us Precision Medicine – security; analysis of sensor data; EHR
integration
• ECHO – metadata descriptions of health and environmental data;
application of geospatial methods
• BRAIN – methods for network analysis, visualization
All:
Analytics, the Commons, FAIR, sustainability, workforce
10
Wilkinson et al The FAIR Guiding Principles for
scientific data management and stewardship. Sci
Data. 2016 Mar 15;3:160018
https://datascience.nih.gov/TheCommons
11. Some underlying concerns at NIH…
Reproducibility…
Conformance to data sharing policies
& governance more generally
11
12. Why a More Open Process?
Use case:
Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick
02/14/18 UVA Genome Sciences 12
13. Timeline of genomic studies in DIPG
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary, co-
occurring mutation
From Adam Resnick
02/14/18 UVA Genome Sciences 13
14. What do we need to do differently to
reveal ACVR1?
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIR
From Adam Resnick
02/14/18 UVA Genome Sciences 14
15. Both funders and some institutions
see the need to move from pipes to
platforms to accelerate research…
02/14/18 UVA Genome Sciences 15
https://blog.lexicata.com/wp-content/uploads/2015/03/platform-model-
750x410.png
16. If platforms are the answer we could
ask the question…
Will biomedical research become more
like Airbnb?
02/14/18 UVA Genome Sciences 16
Vivien Bonazzi
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
17. I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship
between consumer (renter) and supplier (host)
• The platform focuses on maximizing the exchange of services
between supplier and consumer and maximizing the amount
of trust associated with a given stakeholder
• It seems to be working:
– 60 million users searching 2 million listings in 192 countries
– Average of 500,000 stays per night.
– Evaluation of US $25bn
02/14/18 UVA Genome Sciences 17
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
18. Platforms will ultimately digitally
integrate the scholarly workflow for
human and machine analysis
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818UVA Genome Sciences 1802/14/18
19. Why a comparison to Airbnb is not fair
• Airbnb was born digital
• The exchange of services on Airbnb are
simple compared to what is required of a
platform to support biomedical research
Nevertheless there is much to be
learnt
02/14/18 UVA Genome Sciences 19
20. Impediments to a biomedical platform
• Current work practices by all stakeholders
• Entrenched business models
• Size of the undertaking aka resources
needed
• Trust
• Incentives to use the platform
http://www.forbes.com/sites/johnhall/2013/04/29/1
0-barriers-to-employee-innovation/#8bdbaa811133
02/14/18 UVA Genome Sciences 20
21. In summary there is not currently a
widely adopted single platform for
the exchange of services in
biomedical research. Either there is a
platform per service or no platform
at all….
Funders and the institutions they
fund need to work more closely to
implement platforms
02/14/18 UVA Genome Sciences 21
23. How is the DSI responding to these
various needs?
02/14/18 UVA Genome Sciences 23
24. 02/14/18 UVA Genome Sciences 24
Working across the grounds
to break down traditional silos
25. • Currently sustainable
• Planning for where the academical village meets Google – an
ecosystem in which students, faculty, staff, visitors, private sector
reps, entrepreneurs live and work
• Open UVA and open data
• Not owning anything; only working through collaboration e.g.
– Dual degrees
– Research projects across disciplines
• MS DS focusing on practical training
• Dual degrees
• Soon PhD and undergraduate major
• Wikimedian in residence (March, 2018)
02/14/18 UVA Genome Sciences 25
Hallmarks
26. Emergent DSI Organization
02/14/18 UVA Genome Sciences 26
Data Integration
& Engineering
Machine Learning
& Analytics
Visualization
Data Acquisition
& Dissemination
Ethics, Law,
Policy,
Social Implications
27. Emergent DSI Organization
02/14/18 UVA Genome Sciences 27
Data Integration
& Engineering
Machine Learning
& Analytics
Visualization
Data Acquisition
& Dissemination
Ethics, Law,
Policy,
Social Implications
Biomedical Data Sciences
28. Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider Reagent Consumer
Software Provider Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Data Acquisition &
Dissemination
Pilot Open Data Lab
Underway
UVA Genome Sciences 28gDOC02/14/18
29. Data Integration and
Engineering
• Ontologies
• Object identifiers
• Indexing schemes
• Common data models
02/14/18 UVA Genome Sciences 29gDOC
30. Machine Learning &
Analytics
• Neural nets
• Deep learning
• NLP
• Gene expression &
neurological disease (Kipnis)
• Predicting opioid overdose
(VA Health)
• Predicting escalating care
and mortality risk of
cirrhosis patients (UVA HS)
• Human microbiome &
mental health in maternal
health (Physcology &
Nursing)
02/14/18 UVA Genome Sciences 30gDOC
33. Points of Interaction
• Dual degrees with an MSDS
• Specific projects for:
– Presidential fellows (due March 19, 2018)
– Capstones (due June 29, 2018)
• Thoughts on biomedical data science cluster hires
• Data Science Internship program with NIH, Inova, GMU, VT,
GWU, UMD…
• Join the DSI faculty
• Join the mailing list
– Lunch and learn
– Distinguished lectures
– Special events
02/14/18 UVA Genome Sciences 33
34. References
• Dunn and Bourne Building the Biomedical Data Science
Workforce PLoS Biol. 2017 Jul 17;15(7):e2003082.
• Bonazzi and Bourne Should Biomedical Research be like
Airbnb? PLoS Biol. 2017 Apr 7;15(4):e2001818.
• McKiernan et al How Open Science Helps Researchers
Succeed Elife. 2016 Jul 7;5. pii: e16800
• Wilkinson et al The FAIR Guiding Principles for scientific
data management and stewardship. Sci Data. 2016
Mar 15;3:160018.
• https://datascience.nih.gov/TheCommons
02/14/18 UVA Genome Sciences 34
35. Acknowledgements
02/14/18 UVA Genome Sciences 35
The BD2K Team at NIH
My New Colleagues at UVA
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
Scott and Beth Stephenson
Anonymous donors for the DSI endowment
$1.25bn per year to capture all data.
After a significant effort at reduction, intramurally data is spread across > 60 data centers; imagine the extramural situation.