A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it?

A proteomics data “gold mine” at your
disposal: Now that the data is there, what
can we do with it?
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk

Juan A. Vizcaíno
juan@ebi.ac.uk
2nd International de.NBI Symposium
Bielefeld, 23 October 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• Open analysis pipelines for proteomics data
• Challenges for public proteomics data reuse

Juan A. Vizcaíno
juan@ebi.ac.uk
New ELIXIR Proteomics “Community”
Vizcaíno et al., F1000Research, 2017

Juan A. Vizcaíno
juan@ebi.ac.uk
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored
• Leading ProteomeXchange.
• From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Stats (1): Data submissions to PRIDE Archive continue to increase
Top months: 224 and 234 datasets submitted on July & August,
respectively (~2,000 in 2016).
> 400 TBs of data

Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeXchange: A Global, distributed proteomics database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017

Juan A. Vizcaíno
juan@ebi.ac.uk
Origin:
1229 USA
902 Germany
618 China
583 United Kingdom
319 France
250 Netherlands
213 Canada
208 Switzerland
200 Australia
179 Spain
172 Austria
168 Denmark
138 Sweden
133 India
115 Japan
115 Belgium
98 Norway
75 Italy
69 Taiwan
57 Brazil
51 Israel
51 Singapore
43 Finland
44 Ireland…
ProteomeXchange: 7,475 datasets up until September 1st 2017
Type:
4805 PRIDE partial
1552 PRIDE complete
649 MassIVE
117 PeptideAtlas/PASSEL
complete
109 jPOST
243 reprocessed datasets
Publicly Accessible:
4051 datasets, 54% of all
89% PRIDE
6% MassIVE
3% PASSEL
2% jPOST
Top Species studied by at least
50 datasets:
2,787 Homo sapiens
958 Mus musculus
236 Saccharomyces cerevisiae
229 Arabidopsis thaliana
190 Rattus norvegicus
157 Escherichia coli
68 Bos taurus
62 Drosophila melanogaster
~ 1,100 species in total

Juan A. Vizcaíno
juan@ebi.ac.uk
Stats (2): Data growth in EMBL-EBI resources
Genomics
Transcriptomics
Metabolomics
Proteomics

Juan A. Vizcaíno
juan@ebi.ac.uk
Data downloads are increasing
Data download volume for
PRIDE in 2016: 241 TB
0
50
100
150
200
250
300
2013 2014 2015 2016
Downloads in TBs

Juan A. Vizcaíno
juan@ebi.ac.uk
Public data re-analysis -> Data repurposing
• Individual authors can re-analyze MS proteomics
raw data with new hypotheses in mind (not taken
into account by the original authors).
• Proteogenomics studies.
• Discovery of new PTMs/ variants.
• Meta-analysis studies.

Juan A. Vizcaíno
juan@ebi.ac.uk
Across-omics -> Proteogenomics approaches
• Proteomics data is combined with genomics and/or transcriptomics
information, typically by using sequence databases generated from DNA
sequencing efforts, RNA-Seq experiments, Ribo-Seq approaches, and long-
non-coding RNAs.
Nesvizhskii, Nat Methods, 2014

Juan A. Vizcaíno
juan@ebi.ac.uk
MS proteomics: Proteogenomics
in vivo in silico
DNA, RNASeq,
RiboSeq
Proteogenomics

Juan A. Vizcaíno
juan@ebi.ac.uk
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes ->
Discovery of short ORFs, translated lncRNAs, etc

Juan A. Vizcaíno
juan@ebi.ac.uk
Examples of repurposing datasets: proteogenomics
Also some studies have been performed in model organisms: mouse, rat,
Drosophila, and other microorganisms (Mycobacterium tuberculosis,
Helicobacter pylori)

Juan A. Vizcaíno
juan@ebi.ac.uk
MS proteomics: ProteoGenomics
in vivo in silico
Personal genomes
Personal
proteomes
Personalised medicine

Juan A. Vizcaíno
juan@ebi.ac.uk
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, 2017

Juan A. Vizcaíno
juan@ebi.ac.uk
OmicsDI: Portal for omics datasets

Juan A. Vizcaíno
juan@ebi.ac.uk
The “dark” proteome
Sequence-based
search engines

Juan A. Vizcaíno
juan@ebi.ac.uk
The “dark” proteome
• Only ~25-30% of spectra in a typical proteomics experiments
are identified.
• What does that fraction of unidentified spectra correspond to?
• For sure, there will be artefacts (e.g. chimeric spectra).
• Undetected protein variants:
• What it is not included in the searched database cannot be found.
• Peptide containing unexpected Post-Translational Modifications
(PTMs).
• Big potential to find novel biological relevant “proteoforms”.

Juan A. Vizcaíno
juan@ebi.ac.uk
Concept of “proteoform”
Could any of these “undetected” proteoforms have an important biological
function?
Smith et al., Nat Methods, 2013

Juan A. Vizcaíno
juan@ebi.ac.uk
Introduction to Spectrum Clustering
spectra-cluster algorithm
Unidentified
spectrum
Spectrum identified
as peptide A
Spectrum identified
as peptide B
Consensus
spectra
(= data reduction)
Input Mass
Spectra

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by summer 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
recurrent unidentified
spectra
??

Juan A. Vizcaíno
juan@ebi.ac.uk
Consistently unidentified clusters (Recurring Unidentified Spectra)
• 19 M clusters contain only unidentified spectra.
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides -> Potentially relevant
for biology.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.

Juan A. Vizcaíno
juan@ebi.ac.uk
Consistently unidentified clusters-> Open modification searches

Juan A. Vizcaíno
juan@ebi.ac.uk
Meta-analysis: Protein-protein interactions/associations
Drew et al., Mol Systems Biol, 2017 Gupta et al., NAR, 2017, in press

Juan A. Vizcaíno
juan@ebi.ac.uk
Recent examples of meta-analysis studies (2)
Lund-Johanssen et al., Nat Methods, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Further reading…
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Infrastructure as a Service:
• Compute power not necessarily local but remote
• Still from compute centres, but on a larger scale
The ‘service’ is:
• Customer gets infrastructure, but it’s virtualized
• This Abstraction yields better
utilisation and scalability (but...)
• Developer/Customer has to interface with these abstraction layers
What’s a cloud environment?

Juan A. Vizcaíno
juan@ebi.ac.uk
Purpose:
• Develop robust, fully reproducible and automated proteomics
analysis pipelines (DDA ‘shot-gun’ proteomics workflows).
• Deployed in a cloud environment and reusable by the scientific
community.
Started in February (1 year), ELIXIR Implementation study:
• EMBL-EBI (Proteomics & Cloud infrastructure Teams)
• ELIXIR-DE (Kohlbacher EKUT , Eisenacher RUB)
ELIXIR Implementation Study

Juan A. Vizcaíno
juan@ebi.ac.uk
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
ELIXIR Implementation Study

Juan A. Vizcaíno
juan@ebi.ac.uk
We are using the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (search engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Analysis software used

Juan A. Vizcaíno
juan@ebi.ac.uk
Cloud deployment developments at EBI
Cloud workflow in metabolomics:
Enabling flexible workflow implementation
è Kubernetes Galaxy runner development
Kubernetes Framework (k8s):
• Controls the virtual cloud infrastructure
• Manages request and provision of resources (VMs)

Juan A. Vizcaíno
juan@ebi.ac.uk
Summary figure

Juan A. Vizcaíno
juan@ebi.ac.uk
Open analysis pipelines -> In the near future…
• Recent 3-year BBSRC grant awarded to do the same for DIA
approaches (to start on December 2017).
• In collaboration with the Stoller Center (Manchester) (co-PIs Graham,
Hubbard & Townsend)
• Recent 4-year Wellcome Trust grant awarded to do (among other
things) pipelines for proteogenomics approaches (to start mid 2018).
• In collaboration with J. Choudhary (Institute of Cancer Research, London)

Juan A. Vizcaíno
juan@ebi.ac.uk
Purpose
Data dissemination intro popular
bioinformatics resources

Juan A. Vizcaíno
juan@ebi.ac.uk
Main challenges (1)
• Lack of information/annotation about the experimental
design :
• Need for manual curation unavoidable at present
• But wouldn’t it be needed anyway?
• Crowd-sourcing infrastructure?
• Computational infrastructure challenges:
• Datasets get larger and larger…
• Analysing many datasets together is challenging (in
terms of the IT infrastructure)
• Windows OS dependency (things are improving in this
context though)

Juan A. Vizcaíno
juan@ebi.ac.uk
Main challenges (2)
• Computational analysis challenges:
• Need to improve current computational analysis
methods for big datasets (e.g. FDR calculation)
• Size of DBs is very large (e.g. in proteogenomics)
• Open modification searches
• “Controlled access” for clinical proteomics data:
• Not enforced at present, but it is common practise not
to share these datasets just in case…

Juan A. Vizcaíno
juan@ebi.ac.uk
Main challenges (3): Across-omics analysis
• It can be very challenging to find “matching” datasets (e.g.
genomics, proteomics, metabolomics).
• Better connection between resources from different omics
approaches is needed.

Juan A. Vizcaíno
juan@ebi.ac.uk
Summary
• Public proteomics datasets are on the rise! (~250 new
datasets/month in PRIDE).
• A lot of possibilities open for reuse of this data:
• New purposes: proteogenomics approaches.
• Discovery of new proteoforms (e.g. new PTMs, new variants).
• Starting to work in open and reproducible analysis pipelines.
• ELIXIR Implementation Study (DDA data, collaboration EMBL-EBI &
ELIXIR-DE)
• In the near future: DIA and proteogenomics approaches.

Juan A. Vizcaíno
juan@ebi.ac.uk
Aknowledgements: People
Attila Csordas
Tobias Ternent
Mathias Walzer
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!@pride_ebi
@proteomexchange
Pablo Moreno
EBI Cloud infrastructure team

Juan A. Vizcaíno
juan@ebi.ac.uk

A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it?

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it?

Similar a A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it? (20)

Más de Juan Antonio Vizcaino

Más de Juan Antonio Vizcaino (12)

Último

Último (20)

A proteomics data “gold mine” at your disposal: Now that the data is there, what can we do with it?