How structural biology can influence data science and vice versa. Based on a forthcoming paper in Current Opinions in Structural Biology https://arxiv.org/abs/1807.09247 and presented as part of the University of Virginia Data Science Institute Lunch and Learn Series, August 31, 2018
1. Data Science Meets Structural
Biology
Philip E. Bourne, Cam Mura & Eli Draizen
(Open Team Science)
https://www.slideshare.net/pebourne
08/31/18 DSI Lunch & Learn 1
https://arxiv.org/abs/1807.09247
2. We are more interested in having a
discussion than giving a lecture …
08/31/18 DSI Lunch & Learn 2
3. Lets start with a couple of definitions…
08/31/18 DSI Lunch & Learn 3
4. What Do We Mean by Data Science?
• Use of the ever increasing amount of open,
complex, diverse digital data
• Finding ways to ask and then answer relevant
questions by combining such diverse data sets
• Arriving at statistically significant conclusions
not otherwise obtainable
• Sharing such findings in a useful way
• Translating such findings into actions that
improve the human condition
08/31/18 DSI Lunch & Learn 4
5. What Do We Mean by Structural
Biology?
08/31/18 DSI Lunch & Learn 5
6. Structure… What’s it good for??
Classic structural biology example
A point mutation (E6→V) in the Hb β globin chain results in sickle
cell anemia
7. Structural biology success stories
microtubule
Atomic-resolution studies of cellular-scale systems have bec-
ome increasingly possible — immense explanatory power!
mid-1990s
1960-70s
early1990s
~2002
1986
8. Why Do We Care About this
Intersection?
08/31/18 DSI Lunch & Learn 8
Stepping back…
Data are transforming how we think about
everything, including biomedical research…
Most folks just do not realize it yet…
Your reading of this slide relies on structural
biology (a photoreceptor called rhodopsin!)
10. How is the DSI Responding to this Change?
• Societal good
• Interdisciplinary
• Practical experience
• Ethical conduct
• Openness and transparency
08/31/18 DSI Lunch & Learn 10
Surge in publications involving machine
learning in the biosciences ('J-curve')
11. Example of Why More Openness:
Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick
08/31/18 DSI Lunch & Learn 11
12. Timeline of genomic studies in DIPG
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary, co-
occurring mutation
From Adam Resnick
08/31/18 DSI Lunch & Learn 12
13. What do we need to do differently to
reveal ACVR1?
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIR
From Adam Resnick
08/31/18 DSI Lunch & Learn 13
14. 08/31/18 DSI Lunch & Learn 14
Working across the Grounds
to break down traditional silos
15. • Sustainable
• Designing for where the academical village meets Google – an
ecosystem in which students, faculty, staff, visitors, private sector
reps, entrepreneurs live and work
• Open UVA and open data – Wikimedian in Residence
• Collaboration
– Dual degrees
– Research projects across disciplines
– Sister institutions
• MS DS focusing on practical training
• PhD program
• Undergraduate major
• Undergraduate certificate
08/31/18 DSI Lunch & Learn 15
Hallmarks
Reflecting Those
Principles
Under development
16. DSI Organization
Structural Biology is one of Many Cross Cutting Initiatives
08/31/18 DSI Lunch & Learn 16
Data Integration
& Engineering
Machine Learning
& Analytics
Visualization
& Dissemination
Data Acquisition Ethics, Law,
Policy,
Social Implications
Structural Biology
17. DSI Organization
Structural Biology is one of Many Cross Cutting Initiatives
08/31/18 DSI Lunch & Learn 17
Structural Biology mapped onto the five pillars of Data Science
Structural Biology
18. Lets Briefly Focus on those Five Points
of Intersection in the Context of
Structural Biology …
08/31/18 DSI Lunch & Learn 18
19. Data Acquisition
08/31/18 DSI Lunch & Learn 19
The data production issue (the V’s of Big Data)— Experimentally
• Estimated (2017) that ≈2.5 quintillion (2.5×1018) bytes of data generated daily, with 90%
of all the world’s data having been created in the past two years.
• Plaintext PDB files typically ≈ few 100s KB (…but, that’s just the start!)
20. Data Acquisition
08/31/18 DSI Lunch & Learn 20
The data production issue (the V’s of Big Data)— Computationally
• Here are some 2D RMSD matrices from a µs-scale biomolecular simulation.
• Half a mole (6.02×1023) of calculations!
21. Data Acquisition
08/31/18 DSI Lunch & Learn 21
The data reduction issue (the V’s of Big Data)— Computationally
• The produce/spawn/consume idiom (MapReduce)
22. Data Integration and
Engineering
• Data are structured
– Ontologies
– Object identifiers
– Indexing schemes
– Common data models
08/31/18 DSI Lunch & Learn 22