4th International Conference on Big Data and Information Analytics, Theories, Algorithms and Applications in Data Science, December 17-19, 2018, Houston Texas. https://sph.uth.edu/divisions/biostatistics/bigdia/
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Big Data and Analytics Across the Interdisciplinary Divide
1. Big Data & Analytics Across the
Interdisciplinary Divide
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
12/17/18 BigDIA 1
@pebourne
2. Perspective
• I was not trained as a data scientist or computer scientist - I
started as a physical chemist
• At this point I can’t give you a deep technical perspective
• My examples are taken from biomedicine, but broadly
applicable
• Deeply engaged in preparing one academic institution for a very
different data driven interdisciplinary future
12/17/18 BigDIA 2
3. My motivation
The biggest gains for our society are going to come
through interdisciplinary research where data and
analytics catalyze the collaboration
12/17/18 BigDIA 3
5. A wake up call of sorts
12/17/18 BigDIA 5
https://www.sciencemag.org/news/2018/12/google-s-deepmind-aces-protein-folding
https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/
6. Data as driver
12/17/18 BigDIA 6
https://www.ebi.ac.uk/uniprot/TrEMBLstats
Contents of the Protein Data Bank
7. This is a somewhat predictable outcome..
The real excitement comes from the unexpected …
Witness the tale of the trauma surgeon …
12/17/18 BigDIA 7
But there is more…
8. Air pollution-ecosystem feedback: unmanned
aerial vehicles and ecosystem models to
quantify ozone-forest interactions
12/17/18 BigDIA 8
• Spatial heterogeneity
• Novel sampling
• Senor data
Departments:
Environmental Sciences
Electrical Engineering
9. A working definition of what we are doing …
It is the unexpected re-use of information which is
the value added by the web
Tim Berners-Lee
12/17/18 BigDIA 9
https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#116a5a2d55cf
10. A working definition of what we are doing …
It is the unexpected re-use of information which is
the value added by the web and subsequent
analysis of that information for societal benefit
Tim Berners-Lee / Phil Bourne
12/17/18 BigDIA 10
https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#116a5a2d55cf
11. Of course this was all predicted by smart people ..
12/17/18 BigDIA 11
13. I would suggest that this audience has a
responsibility to promote the fourth paradigm
which is not a well recognized phenomenon across
disciplines …
Here is one example of how to do so
12/17/18 BigDIA 13
17. How should we think about organizing ourselves in
an interdisciplinary way to maximize the
opportunities offered by the fourth paradigm?
12/17/18 BigDIA 17
18. The Pillars of Data Science
18
Application Domains
12/17/18 BigDIA
19. Lets briefly focus on those five pillars
in the context of one area of
biomedical informatics – structural
bioinformatics
What kinds of interchange should be
taking place between this field and
data science?
12/17/18 BigDIA 19
Mura et al. 2018 Curr Opin Struct Biol. 52:95-102
20. Data Acquisition
• Persistence of raw data not clear
• Some level of consistency across instrument manufacturers
• Lessons in community/society drive
12/17/18 BigDIA 20
Mura et al. 2018 Curr Opin Struct Biol. 52:95-102
21. Data Integration and Engineering
• URI’s no - stooped in tradition
• Ontologies – somewhat
• Linked data - somewhat
2112/17/18 BigDIA
Years of experience to convey
26. Guiding Principles
• Be constantly strategic and nimble - think supply chain
• Be sustainable - do not over reach
• Be interdisciplinary
• Be a organization without walls
• Be diverse, accessible and open
• Be team not individually driven
• Strive for quality not quantity in education & research
• Be innovative and translational through new forms of engagement with
the private sector, government, NGOs, local, state, national and
international partners
2612/17/18 BigDIA
27. Guiding Principles
• Be constantly strategic and nimble - think supply chain
• Be sustainable - do not over reach
• Be interdisciplinary
• Be a organization without walls
• Be diverse, accessible and open
• Be team not individually driven
• Strive for quality not quantity in education & research
• Be innovative and translational through new forms of engagement with
the private sector, government, NGOs, local, state, national and
international partners
2712/17/18 BigDIA
28. Be Interdisciplinary – Be Without Walls
• Satellites – discipline driven - located in another School
focusing on the mission of that School where data and
analytics play a role, e.g.,
– SOM – data governance and clinical translation
– Education – working on educational analytics
• Centers – Focus area driven e.g.
– Ethics and justice
– Neurodegenerative disorders – Alzheimer's, autism, TBI
– Sports analytics
2812/17/18 BigDIA
29. Guiding Principles
• Be constantly strategic and nimble - think supply chain
• Be sustainable - do not over reach
• Be interdisciplinary
• Be a organization without walls
• Be diverse, accessible and open
• Be team not individually driven
• Strive for quality not quantity in education & research
• Be innovative and translational through new forms of engagement with
the private sector, government, NGOs, local, state, national and
international partners
2912/17/18 BigDIA
30. Be Diverse, Accessible and Open – Why?
• Data science exists largely because of open data
• Open knowledge encourages disciplinary and interdisciplinary
collaboration
• Yet much of the scholarship we produce is not accessible at all and
certainly not accessible to socioeconomically disadvantaged groups
• Gouging by commercial knowledge providers is making the
knowledge produced by others less accessible to us
• Research is suffering from a reproducibility crisis addressable
through greater access to all aspects of the research lifecycle
3012/17/18 BigDIA
31. Be Diverse, Accessible and Open – Why?
Consider Biomedicine
• Big Data
– Total data from NIH-funded research back in 2016 estimated at 650
PB*
– 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10
PB in 2016
• Dark Data
– Only 12% of data described in published papers is in recognized
archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on maintaining data
archives * In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
12/17/18 BigDIA 31
32. A call for making these data open
• Mandates
– NIH, NSF, Data Management Plans
• Business models can be
protected yet everyone benefits
• It saves lives ….
12/17/18 BigDIA 32
33. Why a more open process?
Use case:
Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick12/17/18 BigDIA 33
34. Timeline of genomic studies in DIPG
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary,
co-occurring mutation
From Adam Resnick
12/17/18 BigDIA 34
35. What do we need to do differently
to reveal ACVR1?
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed
to the disease during that time) could
have been impacted if only data were
FAIR
From Adam Resnick
12/17/18 BigDIA 35
36. Research Data Infrastructure …
Both funders and some institutions
see the need to move from pipes to
platforms to accelerate research…
12/17/18 BigDIA 36
https://blog.lexicata.com/wp-content/uploads/2015/03/platform-model-
750x410.png
37. If platforms are the answer we could
ask the question…
Will {biomedical} research become
more like Airbnb?
12/17/18 BigDIA 37
Vivien Bonazzi
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
38. I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship between consumer
(renter) and supplier (host)
• The platform focuses on maximizing the exchange of services between supplier and
consumer and maximizing the amount of trust associated with a given stakeholder
• It seems to be working:
– 60 million users searching 2 million listings in 192 countries
– Average of 500,000 stays per night.
– Evaluation of US $25bn
12/17/18 BigDIA 38
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
39. Platforms will ultimately digitally
integrate the scholarly workflow for
human and machine analysis
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
BigDIA 3912/17/18
40. Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider Reagent Consumer
Software Provider Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Pilot Open Data Lab
(ODL) underway
BigDIA 4012/17/18
41. The NIH through the Big Data to Knowledge
(BD2K) is experimenting with a platform,
keeping in mind the need to overcome these
impediments
Enter The Commons
https://en.wikipedia.org/wiki/Ealing_Common
#/media/File:Ealing_Common_-
_geograph.org.uk_-_17075.jpg12/17/18 BigDIA 41
42. Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent
Provider
Reagent
Consumer
Software
Provider
Software
Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Commons –
Initial focus is on integrating two
layers of the scholarly workflow
12/17/18 BigDIA 42
43. Commons topology
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
PaaS
SaaS
IaaS
https://datascience.nih.gov/commons
12/17/18 BigDIA 43
44. Commons Compliance
• Treat products of research – data,
methods, papers etc. as digital objects
• These digital objects exist in a shared
virtual space
• Digital object compliance through FAIR
principles:
– Findable
– Accessible (and usable)
– Interoperable
– Reusable
https://commonfund.nih.gov/bd2k/commons
12/17/18 BigDIA 44
45. Why a comparison to Airbnb is not fair
• Airbnb was born digital
• The exchange of services on Airbnb are
simple compared to what is required of a
platform to support biomedical research
Nevertheless there is much to be
learnt
12/17/18 BigDIA 45
46. Impediments to platforms
• Current work practices by all stakeholders
• Entrenched business models
• Size of the undertaking aka resources
needed
• Trust
• Incentives to use the platform
http://www.forbes.com/sites/johnhall/2013/04/29/
10-barriers-to-employee-
innovation/#8bdbaa811133
12/17/18 BigDIA 46
47. Even if they are successful, platforms are likely to be
domain specific and only address the
infrastructure..
What else is needed?
12/17/18 BigDIA 47
48. We need to promote openness
• Encourage persistent identifiers e.g., ORCID
• Encourage preprints
• Encourage Open Access (OA)
• Recognize openness in hiring and P&T
• Teach open scholarship
• Promote institutional openness – repositories, wikimedian in
residence
• Support institutional open data governance
• Support global community efforts….
12/17/18 BigDIA 48
49. Wikidata – fast growing
12/17/18 BigDIA 49
• Get on board with developments in schema.org, knowledge
graphs, etc… as part of the rule rather than the exception
• Provide metadata and opinion for data we produce or use
50. Let me summarize:
How do we address the interdisciplinary divide?
• Promote the fourth paradigm
• Work within your institutions to promote data science as an
interdisciplinary field
• Establish an open and integrated environment for data and
analytics
• Be patient and do not oversell …
12/17/18 BigDIA 50
52. Acknowledgements
12/17/18 BigDIA 52
The BD2K Team at NIH
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
Model integration in systems pharmacology. Diverse models need to be integrated
across multiple methodologies, multiple heterogeneous data sets, organismal hierarchy, and
species (transportability).
Distribution of kinases and the number of covalent small-molecule kinase inhibitors (CSKIs) for every targeted kinase across the human kinome
$1.25bn per year to capture all data.
After a significant effort at reduction, intramurally data is spread across > 60 data centers; imagine the extramural situation.
Detailed description of the Commons Framework can be found at : https://datascience.nih.gov/commons