Big Data

A curse of interdisciplinarity
‘ A challenge in the other discipline always
seems ‘easy’ because we are not hindered by
knowledge’.

Barend Mons
(DTL-DISC/ELIXIR)
NBIC, LUMC.

1

ELIXIR
Safeguarding the results of life science
research in Europe
European Life Sciences Infrastructure for Biological
Information
www.elixir-europe.org

DISC: the connected data departments of DTL research Hotels

DISC*
technology
facilities

technology
research

education
DTL
& training

*) DISC = DTL Data Integration & Stewardship Centre

What is bioinformatics?
• The science of storing,
retrieving and analysing
large amounts of biological
information
• An interdisciplinary science
involving biologists,
biochemists, computer
scientists and
mathematicians
• At the heart of modern
biology

5

Bioinformatics underpins life-science research

11Genomes
Genomes
Contain genes
Contain genes

22Genes are
Genes are
transcribed
transcribed

33Transcripts translate
Transcripts translate
to protein sequences
to protein sequences

44Proteins form three-
Proteins form three-
dimensional structures
dimensional structures

55Proteins interact with each other
Proteins interact with each other
and with small molecules to form
and with small molecules to form
pathways
pathways

6 Pathways combine
6 Pathways combine
to build systems
to build systems

6

Life Science data: Multi-omics, multi-technology, multi organism, multi dimensional

From molecules to medicine
Molecular components Integration Translation
Genomes
Human
populations
Nucleotides Biobanks
Tissues and organs
Transcripts
Complexes Therapies

Proteins Disease
prevention

Domains

Pathways
Cells
Human Early
individuals Diagnosis
Structures
Small molecules

8

What is ELIXIR?
• An ESFRI research infrastructure of global significance
• Unites Europe’s leading life science organisations in
managing and safeguarding the vast amounts of
data being generated every day by publicly funded
research.
• A large-scale initiative that will provide the facilities
necessary for Europe’s life-science researchers to
make the most of our rapidly growing store of
information about living systems, which is the
foundation on which our understanding of life is built.

9

Why ELIXIR?
• Creating a robust infrastructure for biological
information is a bigger task than EMBL-EBI – or
any individual organisation or nation – can take
on alone.
• Biology has by far the largest research
community:
• ~3 million life science researchers in Europe
• >6 million web hits a day at EMBL-EBI alone
• We need to involve other European partners

10

The challenge
• Computer speed
and storage capacity
is doubling every 18
months and this
rate is steady
• DNA sequence data
is doubling every 6-
8 months over the
last 3 years and
looks to continue for Guy Cochrane, ENA, EMBL-EBI

this decade

11

Europe has already paid for the
science

Annual cost of generating new protein
structure data in labs around the world

Annual cost of maintaining the data
in a central database

12

ELIXIR’s mission
To build a sustainable
European infrastructure for
biological information,
supporting life science
research and its
medicine
translation to:

environment

bioindustries

society

13

A distributed pan-European
infrastructure

14

Benefits
ELIXIR will contribute to European innovation by:
• Optimising access and exploitation of life-science data
• Ensuring longevity of the data, thereby protecting
investments already made in research
• Enhancing the quality of European research by supporting
national efforts to increase the competence and number
of bioinformatics users through training
• Strengthening the global position and influence of Europe
in life-science research in both in academia and industry

15

The scientific reason for ELIXIR
• Data is an essential commodity
for life-science research.
• Ten years ago, finding the
connection between a gene
and a characteristic (e.g.
drought tolerance, risk of heart
disease) could take years; now
it takes minutes. Image courtesy of Genome Research Ltd.

• Data analysis is now the bottleneck in life-science research
• ELIXIR is our only realistic hope of easing that bottleneck

16

One societal reason for ELIXIR
• The era of personal genome
sequencing is upon us.
• Sequence data will not cross
national boundaries.
• Every national health
system will need expertise
to interpret it and treat
patients accordingly.
• Individuals need to be sure
that their personal
biological data are in safe
hands.

18

The financial reason for ELIXIR
• Europe has already spent
the money to generate the
data.
• It will waste all this
investment in research if the
future of the data is not
secured.
• Industry, from SMEs to big
multinationals, needs
access to public data to
analyse its proprietary data.

19

Maintaining open access
• Open access to life science is essential for
advances in many areas of research
• Open access to bioinformatics resources provides
a valuable path to discovery, one that in many
other areas of research is limited by commercial
confidentiality
Mark Forster, Syngenta,
• Charging for that data, or seeking to restrict member of the EMBL-EBI
Industry Programme
access through exercising Intellectual Property
(IP) rights, would impede progress
• ELIXIR will guarantee that open access to
biological data is maintained. Speaking with a
single voice will strengthen Europe’s influence in
such global discussions.

20

13 ELIXIR Countries

21

Part two >>>> eScience in LS
• The way we dicover knowledge has changed
fundamentally over just a decade.

BIGNORANC
E

10/09/12 22

The general challenge: Data has far outgrown institutional handling capacity is everywhere
The Data Deluge
The Issue: But Life Sciences is particularly
challenged and complex.

More and more
We write
‘about datasets’
….The amount of digital data is That are too large to publish
exploding, with a staggering 1.8
zettabytes in 2011 In narrative

Nanopublications & Cardinal Assertions
Nanopublication
A Nanopublication is the smallest unit of
publishable information containing:
1.Assertion
A statement of concepts in terms of one or
more ‘subject -> predicate -> object’ (triple)
relationships.
1.Provenance
a)Attribution – Who made this assertion,
1 ‘n’ when and where?
identical different b)Supporting information – Any other
assertion provenances information which is relevant to the assertion
(e.g. this assertion is only valid in humans
under 18).

A Cardinal Assertion aggregates all ‘n’
Nanopublications making the same
assertion. It therefore has 1 assertion and
‘n’ provenances, eliminating redundancy.
Cardinal Assertion

Managing volume & complexity
Combining Cardinal Assertions with

5
5
Concept profiles reduces the amount of
data with ≈99.999996%

4
4

1
1
Individual

2
2
Concept Profiles
≈4x106
Individual
Cardinal Assertions
5 4 2 1
> 10 11

Individual
Nanopublications
> 1014

The LS concept web: 2x2x106 concepts (profiles)

A dynamic Concept Web versus a static Ontology

28

= Known reference pairs
= non-co-occurrence pairs

More mutual information
No increase in concept overlap
Including manual curation

More concepts in common

Removal of low info paths

eScience…. in silico reasoning and in cerebro validation

Expert Skype calls

Reading up

Organisation of the ecosystem
Global Authority Nanopublishers App & Service Users
Providers

Endorse CA Space
Application Knowledge
(OCS & ICS)
development Management
Providers

Reasoning
services
Practices

Academic &
Best

ONS/INSs technical and Commercial
process Users
consultancy

project
Knowledge
Original delivery
Discovery
Assist & Data Owners capacity
Certify

IN ANY CASE: regardless of how
‘sensitive’ your data is, it is malpractice
to:
- Generate data without a solid stewardship plan
- Build impenetrable SILOS
- Fail to record provenance
- Store them in non interoperable format
- Think that data=information

-EVEN if your only goal is the Nobel Prize
(or for Dutch: a Spinoza Prize)

34

Acceptance of Semantic Web Approach

Over the last decade, academic
research organisations developed
new methodologies and tools to
address the Big Data problem.
Global agreement by leading
scientists on unique
Nanopublication solution.
100’s of millions already invested
in the basis technology
Applicable as a technology across
(STM) domains and industries.
Pharmaceutical companies are
early adopters (Innovative
Medicine Initiative).

The ‘Dutch Team’ Acknowledging…
• Herman van Haagen , MsC. (LUMC)
• Dr. Peter Bram ‘t Hoen (LUMC) CWA- Open PHACTS
• Dr. Marco Roos (LUMC)
• Prof. Amos Bairoch (SIB, Switzerland, CWA)
• Dr. Erik Schultes (LUMC)
• Prof. Carole Goble (Mancheste, CWA, OPS)
• Prof. Johan den Dunnen (LUMC)
• Prof. Katy Borner (Indiana University CWA)
• Prof. Gertjan van Ommen (LUMC)
• Prof. Mark Musen (NCBO, Stanford CWA,OPS)
• Dr. Erik van Mulligen (EMC)
• Dr. Pascale Gaudet (UniProt, ISB, CWA
• Dr. Jan Kors (EMC)
• Dr. Mike Colon (VIVO, UF, CWA)
• Dr. Martijn Schuemie (EMC)
• Prof. Maryann Martone (Force 11, USC, CWA)
• Prof. Johan van der Lei (EMC)
• Dr. Nigam Shah (NCBO, Stanford, CWA, OPS)
• Dr. Rob Hooft (NBIC)
• Dr. Mark Wlikinson (Canada, CWA)
• Dr. Christine Chichester (NBIC)
• Abel Packer (Brazil, Scielo, CWA, OPS)
• Dr. Leon Mei (NBIC)
• Jan Velterop (ACKnowledge, CWA, OPS)
• Kees Burger (NBIC)
• Albert Mons (CWA, NBIC)
• Bharat Singh (NBIC/EMC)
• Prof. Frank van Harnelen (FUA/LARKC, CWA, OPS)
• Dr. Marc van Driel (NBIC)
• Dr. Chris Evelo (Maastrciht, CWA, OPS)
• Dr. Ruben Kok (NBIC)
• Dr. Antony Willams (RSC/ChemSpider, CWA,OPS)
• Prof. Marcel Reinders (NBIC)
• Dr. Richard Kidd (RSC, OPS)
• Prof. Jaap Heringa (NBIC)
• Dr. Paul Groth (FUA, CWA, OPS)
• Prof. Gert Vriend (NBIC)
• Dr. Michel Dumontier (Canada, CWA, OPS)
• Dr. Morris Schwertz (BBMRI, CWA)
• Dr .Andrew Gibson, UA, CWA, OPS)
• Dr. Andra Waagmeester (NBIC)
• Dr. Bryn Williams-Jones (Pfizer, OPS)
• Dr. Kristina Hettne (LUMC)
• Dr. Ian Dix (Astra Zeneca, OPS)
• Dr. Rene van Schaik (eScience Cenrte)
• Dr. Niklas Blomberg (Astra Zeneca, OPS)
• Drs. Albert Mons (PHORTOS consultants)
• Dr. Mike Barnes, GSK, OPS)
• Mr. Drs. Arie Baak (PHORTOS consultants)
• Prof. Jan-erik Litton (CWA, BBMRI)

Big Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (14)

Similar a Big Data

Similar a Big Data (20)

Más de SURFnet

Más de SURFnet (20)

Último

Último (20)

Big Data

Notas del editor