1. Where is Open Going?
Philip E. Bourne
pbourne@ucsd.edu
http://www.slideshare.net/pebourne/
3/01/14
2014 SPARC Annual Meeting
1
2. Where is Open Going?
The answer depends on who you
ask
Here is my biased viewpoint
3/01/14
2014 SPARC Annual Meeting
2
3. My Background/Bias
• Mostly Biomedical
• RCSB PDB/IEDB Database Developer – Views on
community, quality, sustainability …
• PLOS Journal Co-founder – Open Science Advocate
• Associate Vice Chancellor for Innovation – Business
models, interaction with the private
sector,sustainability
• Professor – Mentoring, reward system, value (or not)
of research
• NIH Strategist/Transformer - ??
3/01/14
2014 SPARC Annual Meeting
3
4. Perhaps the first question to ask is:
What is the endpoint?
3/01/14
2014 SPARC Annual Meeting
4
5. Where Is Open Going?
3/01/14
2014 SPARC Annual Meeting
5
6. What Does The Democratization of
Science Imply?
• The obvious – participation by all
• Not so obvious
– More scrutiny
– New types of rewards
– More equal value placed on all participants
– The removal of artificial boundaries that corral
knowledge (through power and resources) within
silos that do not make sense as complexity
increases
3/01/14
2014 SPARC Annual Meeting
6
7. Consider some personal examples that
illustrate these implications
3/01/14
2014 SPARC Annual Meeting
7
8. More Scrutiny – Highlights
Lack of Reproducibility
• I can’t immediately reproduce the research
in my own laboratory:
• It took an estimated 280 hours for an average user
to approximately reproduce the paper
• Workflows are maturing and becoming helpful
• Data and software versions and accessibility
prevent exact reproducibility
Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational Biology:
The Case of the Tuberculosis Drugome PLOS ONE 8(11) e80278 .
3/01/14
2014 SPARC Annual Meeting
8
9. Why New Types of
Rewards?
• I have a paper with 16,000 citations that no
one has ever read
• I have papers in PLOS ONE that have more
citations than ones in PNAS
• I have data sets I am proud of few places to
put them
• I edited a journal but it did not count for much
3/01/14
2014 SPARC Annual Meeting
9
10. Equal Value Placed
on Participants
• The UC System has Research Scientists (RS) &
Project Scientists (PS) as well as tenured
faculty – RS/PS have no senate rights yet:
– RS/PS frequently teach
– RS/PS frequently have more grant money
– RS/PS typically perform more service
– RS/PS are most of the data scientists you know
3/01/14
2014 SPARC Annual Meeting
10
12. Institutional Boundaries
• Academia – Departments of
physics, math, biology, chemistry etc. persist
but scholars rarely confine themselves to
these disciplines
• NIH – 27 institutes and centers, many
dedicated to specific diseases & conditions –
yet a specific gene may transcend ICs
3/01/14
2014 SPARC Annual Meeting
12
13. I have argued that the democratization
of science is compelling
I have not argued for the value of open
access to this picture because you
know that already
3/01/14
2014 SPARC Annual Meeting
13
14. I Would Also Argue That This Process is
About to Accelerate
• Others provide a more
compelling argument:
–
–
–
–
3/01/14
2014 SPARC Annual Meeting
Google car
3D printers
Waze
Robotics
14
15. From the Second Machine Age
From: The Second Machine Age: Work, Progress, and Prosperity in a
Time of Brilliant Technologies by Erik Brynjolfsson & Andrew McAfee
3/01/14
2014 SPARC Annual Meeting
15
16. So what will this look like for an
institution?
Institutions will become digital enterprises
3/01/14
2014 SPARC Annual Meeting
16
17. Components of The Academic Digital
Enterprise
• Consists of digital assets
– E.g. datasets, papers, software, lab notes
• Each asset is uniquely identified and has
provenance, including access control
– E.g. publishing simply involves changing the access
control
• Digital assets are interoperable across the
enterprise
3/01/14
2014 SPARC Annual Meeting
17
18. Life in the Academic Digital Enterprise
•
Jane scores extremely well in parts of her graduate on-line neurology class. Neurology
professors, whose research profiles are on-line and well described, are automatically notified of
Jane’s potential based on a computer analysis of her scores against the background interests of the
neuroscience professors. Consequently, professor Smith interviews Jane and offers her a research
rotation. During the rotation she enters details of her experiments related to understanding a
widespread neurodegenerative disease in an on-line laboratory notebook kept in a shared on-line
research space – an institutional resource where stakeholders provide metadata, including access
rights and provenance beyond that available in a commercial offering. According to Jane’s
preferences, the underlying computer system may automatically bring to Jane’s attention Jack, a
graduate student in the chemistry department whose notebook reveals he is working on using
bacteria for purposes of toxic waste cleanup. Why the connection? They reference the same gene a
number of times in their notes, which is of interest to two very different disciplines – neurology and
environmental sciences. In the analog academic health center they would never have discovered
each other, but thanks to the Digital Enterprise, pooled knowledge can lead to a distinct advantage.
The collaboration results in the discovery of a homologous human gene product as a putative target
in treating the neurodegenerative disorder. A new chemical entity is developed and patented.
Accordingly, by automatically matching details of the innovation with biotech companies worldwide
that might have potential interest, a licensee is found. The licensee hires Jack to continue working
on the project. Jane joins Joe’s laboratory, and he hires another student using the revenue from the
license. The research continues and leads to a federal grant award. The students are
employed, further research is supported and in time societal benefit arises from the technology.
From What Big Data Means to Me JAMIA 2014 21:194
3/01/14
2014 SPARC Annual Meeting
18
19. Let us now turn to the biomedical
sciences and look at what might
happen if the NIH were to become a
digital enterprise
3/01/14
2014 SPARC Annual Meeting
19
20. As of Today
• Assumed the role of Associate Director for
Data Science (ADDS):
NIH Data Science Point Person
Reports to NIH Director
Lead the BD2K initiative
Trans-NIH responsibilities for data
Eric Green, Acting
[Modified slide from Eric Green]
3/01/14
2014 SPARC Annual Meeting
20
21. The focus is on data, but I do not think
that can be separated from the
research life cycle as you will see…
3/01/14
2014 SPARC Annual Meeting
21
22. I Want To Engage With This
Community To:
• Help me understand the most pressing
problems
• Begin a dialog
• Inform you of what I am currently thinking
• Inform you of relevant NIH initiatives that are
underway or planned
• Have you change my thinking appropriately
3/01/14
2014 SPARC Annual Meeting
22
23. The NIH process thus far …
An external advisory group provided a
valuable blueprint for what should be
done
acd.od.nih.gov/diwg.htm
3/01/14
2014 SPARC Annual Meeting
23
24. Blueprint Recommendations
• Promote central and federated catalogs
– Establish minimal metadata framework
– Tools to facilitate data sharing
– Elaborate on existing data sharing policies
• Support methods and applications
– Fund all phases of software development
– Leverage lessons from National Centers
• Training
– More funding
– Enhance review of training apps
– Quantitative component to all awards
• On campus IT strategic plan
– Catalog of existing tools
– Informatics laboratory
– Ditto big data
• Sustainable funding commitment
3/01/14
2014 SPARC Annual Meeting
acd.od.nih.gov/diwg.htm
24
25. Let me outline in general terms where
I see my effort being spent going
forward
http://pebourne.wordpress.com/2013/12/
3/01/14
2014 SPARC Annual Meeting
25
26. ADDS Initial Thrusts
•
•
•
•
•
•
•
•
How data are currently being used
Lightweight metadata standards
Data & software registries
Expanded policies on data sharing, open source
software
Training programs & reward systems
Institutional incentives
Private sector incentives
Data centers serving community needs
3/01/14
2014 SPARC Annual Meeting
26
27. ADDS Initial Thrusts
•
•
•
•
•
•
•
•
How data are currently being used
Lightweight metadata standards
Data & software registries
Expanded policies on data sharing, open source
software
Training programs & reward systems
Institutional incentives
Private sector incentives
Data centers serving community needs
3/01/14
2014 SPARC Annual Meeting
27
28. We need to start by asking, how are
we using the data now?
Only then can we make rational
decisions about data – large or small
3/01/14
2014 SPARC Annual Meeting
28
29. How Data Are Used
Structure Summary page activity for
H1N1 Influenza related structures
Jan. 2008
Jul. 2008
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Jan. 2009
Jul. 2009
Jan. 2010
Jul. 2010
3B7E: Neuraminidase of A/Brevig Mission/1/1918
H1N1 strain in complex with zanamivir
1RUZ: 1918 H1 Hemagglutinin
3/01/14
29
2014 SPARC Annual Meeting
[Andreas Prlic]
30. We Need to Learn from Industries
Whose Livelihood Addresses the
Question of Use
3/01/14
2014 SPARC Annual Meeting
30
31. ADDS Initial Thrusts – More Detail
• Now:
–
–
–
–
–
Data centers (under review)
Data science training grants (call out)
Pilot data catalog consortium (call out)
Genomic Data Sharing Policy (being finalized)
Piloting “NIH-drive”
• What Is Planned:
– Extended public-private programs specifically for data science
activities
– Interagency activities
– International exchange programs
– Cold Spring Harbor-like training facilities – by-coastal?
– Programs for better data descriptions
– Reward institutions/communities
– Policies to get clinical trial data into the public domain
3/01/14
2014 SPARC Annual Meeting
31
32. ADDS Initial Thrusts – More Detail
• Now:
–
–
–
–
–
Data centers (under review)
Data science training grants (call out)
Pilot data catalog consortium (call out)
Genomic Data Sharing Policy (being finalized)
Piloting “NIH-drive”
• What Is Planned:
– Extended public-private programs specifically for data science
activities
– Interagency activities
– International exchange programs
– Cold Spring Harbor-like training facilities – by-coastal?
– Programs for better data descriptions
– Reward institutions/communities
– Policies to get clinical trial data into the public domain
3/01/14
2014 SPARC Annual Meeting
32
33. Pilot NIH-Drive
• Investigator A from the NCI makes frequent
reference to the over expression of genes x and y.
• Investigator B from the NHLBI makes frequent
reference to the under expression of genes x and
y
• Automatic notification of a potential common
interest before publication or database deposition
3/01/14
2014 SPARC Annual Meeting
33
34. Let me come back to the big picture..
3/01/14
2014 SPARC Annual Meeting
34
35. First consider what we do (or wish we
could do) every day:
We take actions on digital data
increasingly across boundaries
3/01/14
2014 SPARC Annual Meeting
35
36. Actions on Biomedical Data Implies:
•
•
•
•
•
•
•
•
•
Insuring data quality and hence trust
Making data sustainable
Making data open and accessible
Making data findable
Providing suitable metadata and annotation
Making data queryable
Making data analyzable
Presenting data as to maximize its value
Rewarding good data practices
3/01/14
2014 SPARC Annual Meeting
36
37. Actions on Biomedical Data Implies:
•
•
•
•
•
•
•
•
•
Insuring data quality and hence trust
Making data sustainable
Making data open and accessible
Making data findable
Providing suitable metadata and annotation
Making data queryable
Making data analyzable
Presenting data as to maximize its value
Rewarding good data practices
3/01/14
2014 SPARC Annual Meeting
37
38. Boundaries on Biomedical Data
Implies:
• Working across biological scales
• Working across biomedical disciplines
• Working across basic and clinical research and
practice
• Working across institutional boundaries
• Working across public and private sectors
• Working across national and international
borders
• Working across funding agencies
3/01/14
2014 SPARC Annual Meeting
38
39. Boundaries on Biomedical Data
Implies:
• Working across biological scales
• Working across biomedical disciplines
• Working across basic and clinical research and
practice
• Working across institutional boundaries
• Working across public and private sectors
• Working across national and international
borders
• Working across funding agencies
3/01/14
2014 SPARC Annual Meeting
39
40. These issues have been around a long
time
The good news is that “Big Data” has
bought more attention to the problem
3/01/14
2014 SPARC Annual Meeting
40
41. What Are Big Data?
• Large datasets from high throughput
experiments
• Large numbers of small datasets
• Data which are “ill-formed”
• The why (causality) is replaced by the what
• A signal that a fundamental change is taking
place – a tipping point?
3/01/14
2014 SPARC Annual Meeting
41
42. The NIH is Starting to Think About the
Digital Enterprise, Witness…
bd2k.nih.gov
3/01/14
2014 SPARC Annual Meeting
42
43. What Will Define the NIH Digital
Enterprise?
•
•
•
•
•
•
•
•
•
NCBI/NLM
Trans-NIH collaboration – a culture change
Long-term NIH strategic planning
The BD2K Initiative
A “hub” of data science activities
International cooperation
Interagency cooperation
Data sharing policies
External forces….
3/01/14
2014 SPARC Annual Meeting
43
44. This is great, but what will it look like
to the end user and to those
interested in scholarly
communication?
3/01/14
2014 SPARC Annual Meeting
44
45. One Possible End Point
0. Full text of PLoS papers stored
in a database
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
4.
1.
1. A link brings up figures
from the paper
2.
3/01/14
3. A composite view of
journal and database
content results
3.
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed
1. User clicks on thumbnail
2. Metadata and a
webservices call provide
a renderable image that
can be annotated
3. Selecting a features
provides a
database/literature
mashup
4. That leads to new
papers
PLoS Comp. Biol. 2005 1(3) e34
45
46. To get to that end point we have to
consider the complete research
lifecycle
3/01/14
2014 SPARC Annual Meeting
46
47. The Research Life Cycle will
Persist
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
3/01/14
2014 SPARC Annual Meeting
47
48. Tools and Resources Will Continue
To Be Developed
Authoring
Tools
Lab
Notebooks
Data
Capture
Analysis
Tools
Software
Scholarly
Communication
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
3/01/14
2014 SPARC Annual Meeting
48
49. Those Elements of the Research Life
Cycle will Become More Interconnected
Authoring Around a Common Framework
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Scholarly
Communication
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
3/01/14
2014 SPARC Annual Meeting
49
50. New/Extended Support Structures Will
Emerge
Authoring
Tools
Data
Capture
Lab
Notebooks
Analysis
Tools
Scholarly
Communication
Software
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Commercial &
Public Tools
DisciplineBased Metadata
Standards
Community Portals
Git-like
Resources
By Discipline
Data Journals
New Reward
Systems
Training
Institutional Repositories
3/01/14
2014 SPARC Repositories
CommercialAnnual Meeting
50
51. We Have a Ways to Go
Authoring
Tools
Data
Capture
Lab
Notebooks
Software
Analysis
Tools
Scholarly
Communication
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Commercial &
Public Tools
DisciplineBased Metadata
Standards
Community Portals
Git-like
Resources
By Discipline
Data Journals
New Reward
Systems
Training
Institutional Repositories
3/01/14
2014 SPARC Repositories
CommercialAnnual Meeting
51
52. Where is Open Going?
• Slowly towards the democratization of science
• Which changes how institutions think and
operate – they become digital enterprises
• This in turn impacts the scholarly research
lifecycle and hence scholarly communication
• I will be working to help the NIH be a leading
institution in this change
3/01/14
2014 SPARC Annual Meeting
52