1. A curse of interdisciplinarity
‘ A challenge in the other discipline always
seems ‘easy’ because we are not hindered by
knowledge’.
Barend Mons
(DTL-DISC/ELIXIR)
NBIC, LUMC.
1
3. ELIXIR
Safeguarding the results of life science
research in Europe
European Life Sciences Infrastructure for Biological
Information
www.elixir-europe.org
4. DISC: the connected data departments of DTL research Hotels
DISC*
technology
facilities
technology
research
education
DTL
& training
*) DISC = DTL Data Integration & Stewardship Centre
5. What is bioinformatics?
• The science of storing,
retrieving and analysing
large amounts of biological
information
• An interdisciplinary science
involving biologists,
biochemists, computer
scientists and
mathematicians
• At the heart of modern
biology
5
6. Bioinformatics underpins life-science research
11Genomes
Genomes
Contain genes
Contain genes
22Genes are
Genes are
transcribed
transcribed
33Transcripts translate
Transcripts translate
to protein sequences
to protein sequences
44Proteins form three-
Proteins form three-
dimensional structures
dimensional structures
55Proteins interact with each other
Proteins interact with each other
and with small molecules to form
and with small molecules to form
pathways
pathways
6 Pathways combine
6 Pathways combine
to build systems
to build systems
6
7. Life Science data: Multi-omics, multi-technology, multi organism, multi dimensional
8. From molecules to medicine
Molecular components Integration Translation
Genomes
Human
populations
Nucleotides Biobanks
Tissues and organs
Transcripts
Complexes Therapies
Proteins Disease
prevention
Domains
Pathways
Cells
Human Early
individuals Diagnosis
Structures
Small molecules
8
9. What is ELIXIR?
• An ESFRI research infrastructure of global significance
• Unites Europe’s leading life science organisations in
managing and safeguarding the vast amounts of
data being generated every day by publicly funded
research.
• A large-scale initiative that will provide the facilities
necessary for Europe’s life-science researchers to
make the most of our rapidly growing store of
information about living systems, which is the
foundation on which our understanding of life is built.
9
10. Why ELIXIR?
• Creating a robust infrastructure for biological
information is a bigger task than EMBL-EBI – or
any individual organisation or nation – can take
on alone.
• Biology has by far the largest research
community:
• ~3 million life science researchers in Europe
• >6 million web hits a day at EMBL-EBI alone
• We need to involve other European partners
10
11. The challenge
• Computer speed
and storage capacity
is doubling every 18
months and this
rate is steady
• DNA sequence data
is doubling every 6-
8 months over the
last 3 years and
looks to continue for Guy Cochrane, ENA, EMBL-EBI
this decade
11
12. Europe has already paid for the
science
Annual cost of generating new protein
structure data in labs around the world
Annual cost of maintaining the data
in a central database
12
13. ELIXIR’s mission
To build a sustainable
European infrastructure for
biological information,
supporting life science
research and its
medicine
translation to:
environment
bioindustries
society
13
15. Benefits
ELIXIR will contribute to European innovation by:
• Optimising access and exploitation of life-science data
• Ensuring longevity of the data, thereby protecting
investments already made in research
• Enhancing the quality of European research by supporting
national efforts to increase the competence and number
of bioinformatics users through training
• Strengthening the global position and influence of Europe
in life-science research in both in academia and industry
15
16. The scientific reason for ELIXIR
• Data is an essential commodity
for life-science research.
• Ten years ago, finding the
connection between a gene
and a characteristic (e.g.
drought tolerance, risk of heart
disease) could take years; now
it takes minutes. Image courtesy of Genome Research Ltd.
• Data analysis is now the bottleneck in life-science research
• ELIXIR is our only realistic hope of easing that bottleneck
16
17.
18. One societal reason for ELIXIR
• The era of personal genome
sequencing is upon us.
• Sequence data will not cross
national boundaries.
• Every national health
system will need expertise
to interpret it and treat
patients accordingly.
• Individuals need to be sure
that their personal
biological data are in safe
hands.
18
19. The financial reason for ELIXIR
• Europe has already spent
the money to generate the
data.
• It will waste all this
investment in research if the
future of the data is not
secured.
• Industry, from SMEs to big
multinationals, needs
access to public data to
analyse its proprietary data.
19
20. Maintaining open access
• Open access to life science is essential for
advances in many areas of research
• Open access to bioinformatics resources provides
a valuable path to discovery, one that in many
other areas of research is limited by commercial
confidentiality
Mark Forster, Syngenta,
• Charging for that data, or seeking to restrict member of the EMBL-EBI
Industry Programme
access through exercising Intellectual Property
(IP) rights, would impede progress
• ELIXIR will guarantee that open access to
biological data is maintained. Speaking with a
single voice will strengthen Europe’s influence in
such global discussions.
20
22. Part two >>>> eScience in LS
• The way we dicover knowledge has changed
fundamentally over just a decade.
BIGNORANC
E
10/09/12 22
23. The general challenge: Data has far outgrown institutional handling capacity is everywhere
The Data Deluge
The Issue: But Life Sciences is particularly
challenged and complex.
More and more
We write
‘about datasets’
….The amount of digital data is That are too large to publish
exploding, with a staggering 1.8
zettabytes in 2011 In narrative
24. Nanopublications & Cardinal Assertions
Nanopublication
A Nanopublication is the smallest unit of
publishable information containing:
1.Assertion
A statement of concepts in terms of one or
more ‘subject -> predicate -> object’ (triple)
relationships.
1.Provenance
a)Attribution – Who made this assertion,
1 ‘n’ when and where?
identical different b)Supporting information – Any other
assertion provenances information which is relevant to the assertion
(e.g. this assertion is only valid in humans
under 18).
A Cardinal Assertion aggregates all ‘n’
Nanopublications making the same
assertion. It therefore has 1 assertion and
‘n’ provenances, eliminating redundancy.
Cardinal Assertion
29. = Known reference pairs
= non-co-occurrence pairs
More mutual information
No increase in concept overlap
Including manual curation
More concepts in common
Removal of low info paths
30.
31. eScience…. in silico reasoning and in cerebro validation
Expert Skype calls
Reading up
32. Organisation of the ecosystem
Global Authority Nanopublishers App & Service Users
Providers
Endorse CA Space
Application Knowledge
(OCS & ICS)
development Management
Providers
Reasoning
services
Practices
Academic &
Best
ONS/INSs technical and Commercial
process Users
consultancy
project
Knowledge
Original delivery
Discovery
Assist & Data Owners capacity
Certify
34. IN ANY CASE: regardless of how
‘sensitive’ your data is, it is malpractice
to:
- Generate data without a solid stewardship plan
- Build impenetrable SILOS
- Fail to record provenance
- Store them in non interoperable format
- Think that data=information
-EVEN if your only goal is the Nobel Prize
(or for Dutch: a Spinoza Prize)
34
35. Acceptance of Semantic Web Approach
Over the last decade, academic
research organisations developed
new methodologies and tools to
address the Big Data problem.
Global agreement by leading
scientists on unique
Nanopublication solution.
100’s of millions already invested
in the basis technology
Applicable as a technology across
(STM) domains and industries.
Pharmaceutical companies are
early adopters (Innovative
Medicine Initiative).
36. The ‘Dutch Team’ Acknowledging…
• Herman van Haagen , MsC. (LUMC)
• Dr. Peter Bram ‘t Hoen (LUMC) CWA- Open PHACTS
• Dr. Marco Roos (LUMC)
• Prof. Amos Bairoch (SIB, Switzerland, CWA)
• Dr. Erik Schultes (LUMC)
• Prof. Carole Goble (Mancheste, CWA, OPS)
• Prof. Johan den Dunnen (LUMC)
• Prof. Katy Borner (Indiana University CWA)
• Prof. Gertjan van Ommen (LUMC)
• Prof. Mark Musen (NCBO, Stanford CWA,OPS)
• Dr. Erik van Mulligen (EMC)
• Dr. Pascale Gaudet (UniProt, ISB, CWA
• Dr. Jan Kors (EMC)
• Dr. Mike Colon (VIVO, UF, CWA)
• Dr. Martijn Schuemie (EMC)
• Prof. Maryann Martone (Force 11, USC, CWA)
• Prof. Johan van der Lei (EMC)
• Dr. Nigam Shah (NCBO, Stanford, CWA, OPS)
• Dr. Rob Hooft (NBIC)
• Dr. Mark Wlikinson (Canada, CWA)
• Dr. Christine Chichester (NBIC)
• Abel Packer (Brazil, Scielo, CWA, OPS)
• Dr. Leon Mei (NBIC)
• Jan Velterop (ACKnowledge, CWA, OPS)
• Kees Burger (NBIC)
• Albert Mons (CWA, NBIC)
• Bharat Singh (NBIC/EMC)
• Prof. Frank van Harnelen (FUA/LARKC, CWA, OPS)
• Dr. Marc van Driel (NBIC)
• Dr. Chris Evelo (Maastrciht, CWA, OPS)
• Dr. Ruben Kok (NBIC)
• Dr. Antony Willams (RSC/ChemSpider, CWA,OPS)
• Prof. Marcel Reinders (NBIC)
• Dr. Richard Kidd (RSC, OPS)
• Prof. Jaap Heringa (NBIC)
• Dr. Paul Groth (FUA, CWA, OPS)
• Prof. Gert Vriend (NBIC)
• Dr. Michel Dumontier (Canada, CWA, OPS)
• Dr. Morris Schwertz (BBMRI, CWA)
• Dr .Andrew Gibson, UA, CWA, OPS)
• Dr. Andra Waagmeester (NBIC)
• Dr. Bryn Williams-Jones (Pfizer, OPS)
• Dr. Kristina Hettne (LUMC)
• Dr. Ian Dix (Astra Zeneca, OPS)
• Dr. Rene van Schaik (eScience Cenrte)
• Dr. Niklas Blomberg (Astra Zeneca, OPS)
• Drs. Albert Mons (PHORTOS consultants)
• Dr. Mike Barnes, GSK, OPS)
• Mr. Drs. Arie Baak (PHORTOS consultants)
• Prof. Jan-erik Litton (CWA, BBMRI)
Notas del editor
Messages: The data in the life sciences is not only immense, but also highly complex First: data are captured from the differently levels of organisation in living organisms: DNA, RNA, Protein, Metabolites, cells, tissues, organs and whole organisms. Next even ecological, social-behavioural and epidemiological data play a key role. These data are captured with a variety of instruments and techniques and are in many different formats (not necessarily compatible) Such data are generated in studies on many different (model) organisms form virusses and bateria to humans. Many data need interpretation across species. Many data have to be captured in time or space series and is therefore also mutlidimensional DISC will nor only provide the necessary tools and compute infrastructure but critically also the experts to integrate and connect the data towards biological interpretation. In some case this will only be two pieces of the puzzle, but in many cases more. The final goal is biological understanding and societal application, not just major publications in the Green, the Red and the White sectors of biology.
Messages: Big Data problem now pervading mainstream non-science literature as well and the deluge is everywhere, however the complexity and multidisciplinary nature of LS data makes them a particular challenge. No single institution or even Big Pharma or DSM/UNILEVER can have all the technology and expertise in-house (see IMI, ESFRI) Even if economically and technically feasible, repeating the deep analysis and preprocessing of massive (frequently publicly available) datasets behind firewalls of institutions or companies is now considered a waste of precious resources as much of it is precompetitive. The real added value is in the biological interpretation of the data and its application in red, green and whit innovations. Modern science is really about ‘projecting’ one’s own limited data on a massive body of ‘known’ and prior biological knowledge, way beyond ‘reading’ DISC will support all super institutional needs for data integration, stewardship and interpretation at the request of the users DISC will be closely associated with the top research institutions participating in it, and distributed over multiple concentrations of expertise and infrastructure to ensure a continued ‘cutting edge’ offering in all four infrastructural aspects (computing, tooling, expertise and training) Several key technologies of can be applied beyond the Life Sciences. If The Netherlands does miss out on massive data expertise other centers will develop and crucial expertise will ‘leave’ our country. Now, NL has a leading role and can benefit (example BGI China).
De ecosystem aanpak met interoperable data maakt knowledge management en knowledge discovery mogelijk over ALLE data
Dat kan een private partij per definitie niet oppakken omdat ze geen trusted party zijn (community vorming, certificering, ONS beheer, etc) Vandaar de 4 kolommen en al het werk dat al is verzet inclusief 'adaptatie' door heel veel relevante Associations en Academic Institutions (CWA, W3C, ..................) Dat vraagt om een PPP benadering waarin Elsevier zijn eigen rol speelt strategisch gepositioneerd in de value chain van het ecosysteem De trusted party activiteiten, de infrastructuur en de community worden door anderen gedaan