5. Open Data is Essential for Genomics
@bffo
francis@genomequebec.comE-mail
6. Open Data is Essential for Genomics
Times I’ve been in Italy
• Trieste 1996: Last Yeast Genome Meeting
• Naples 2005: NETTAB “Workflows management:
new abilities for the biological information overflow”
• Rome 2017: Elixir
• Palermo 2017: NETTAB
7. Open Data is Essential for Genomics
Outline
• What I do
• Open Data in genomics
• Final thoughts
8. Open Data is Essential for Genomics
But first, a little about me …
… an unfinished story!
9. Open Data is Essential for Genomics
https://goo.gl/anu933
10. Open Data is Essential for Genomics
http://goo.gl/dJIur
11. Open Data is Essential for Genomics
http://goo.gl/LwVOZ
12. Open Data is Essential for Genomics
http://goo.gl/QI6aL
13. Open Data is Essential for Genomics
http://goo.gl/mYHFO
14. Open Data is Essential for Genomics
http://goo.gl/Jc5TK
15. Open Data is Essential for Genomics
https://goo.gl/3PFr7L
1993-1997
16. Open Data is Essential for Genomics
from the National Centre for Biotechnology Information
17. Open Data is Essential for Genomics
from the National Centre for Biotechnology Information
18. Open Data is Essential for Genomics
from the National Centre for Biotechnology Information
PANIC
36. Open Data is Essential for Genomics
So what unifies all
of what I’ve done?
37. Open Data is Essential for Genomics
So what unifies all
of what I’ve done?
Helping scientists do science.
38. Open Data is Essential for Genomics
Open Data
https://goo.gl/Z63Wxp
39. Open Data is Essential for Genomics
Genomics
https://goo.gl/MX84KA
40. Open Data is Essential for Genomics
What am I calling “Genomics”?
All “omics”
– DNA and RNA, +Epigenomics
– Proteomics, +Protein Interactions, +Pathways
– Metabolomics
– Bioinformatics/Computational Biology
– All of the related data and metadata
• Phenotype
• Clinical
• Images
– New technologies …
41. Open Data is Essential for Genomics
Biological scope?
• Anything with DNA or RNA or protein
43. Open Data is Essential for Genomics
Example of one of a
challenge for all of us?
The integration of genomic data
with deep learning and artificial
intelligence
44. Open Data is Essential for Genomics
AI, Big Data, Deep Computing
• Artificial Intelligence / Deep Learning and
the Big Data Hype?
https://goo.gl/WHg36Q
45. Open Data is Essential for Genomics
What do we need for that?
https://goo.gl/JWpXj2
46. Open Data is Essential for Genomics
What do we need for that?
https://goo.gl/JWpXj2
47. Open Data is Essential for Genomics
What else?
• Data has to be FAIR
– TO BE FINDABLE
– TO BE ACCESSIBLE
– TO BE INTEROPERABLE
– TO BE RE-USABLE
• https://www.force11.org/group/fairgroup/fairprinciples
48. Open Data is Essential for Genomics
Big data examples
• Genomic sequences
• Imaging
• Population scale collected wearable data
49. Open Data is Essential for Genomics
Data Center for all in Québec?
• Health Care in Canada is governed
province by province.
• Génome Québec is working with various
ministries to set something that could be
useful/centralized and make genomic data
usable for research (controlled access).
• Needs to include clinical data
50. Open Data is Essential for Genomics
“Building a data centre is
like making pancakes, you
always need to throw
away the 1st one”
Robert Grossman
Frederick H. Rawson Professor and
the Director of the Center for Data
Intensive Science (CDIS) at the
University of Chicago
http://rgrossman.com/
51. Open Data is Essential for Genomics
Sharing all data types,
including clinical data?
https://goo.gl/ofEPeX
52. Open Data is Essential for Genomics
Authors present at the
“Toronto meeting”
https://goo.gl/ofEPeX
53. Open Data is Essential for Genomics
53 Introduction 1.0
Open data critical to
progress in Science
54. Open Data is Essential for Genomics
54 Introduction 1.0
One example: GenBank
GenBank sequence
database is an open
access, annotated
collection of all publicly
available nucleotide
sequences and their
protein translations.
55. Open Data is Essential for Genomics
55 Introduction 1.0
Open data critical to progress in Science
• Without GenBank and other public
sequence databases
– There would be no BLAST
– There would be no diagnostics DNA testing
– There would be no understanding of the
human genome (there probably would not
have been a human genome to work on in the
first place).
56. Open Data is Essential for Genomics
Adapted from Niko Beerenwinkel ,Chris D. Greenman ,Jens Lagergren
ICGC PCAWG
Docker
Testing
Computational Cancer Biology: An Evolutionary Perspective
•Published: February 4, 2016. https://doi.org/10.1371/journal.pcbi.1004717
57. Open Data is Essential for Genomics
Cancer is a Disease
of the Genome
Challenge in Treating Cancer:
Every tumour is different
Every cancer patient is different
Adapted from Tom Hudsonhttps://www.cancer.gov/research/areas/genomics
58. Open Data is Essential for Genomics
Analysis Data Types
• Simple Somatic Mutations (SSM or SNV)
• Copy Number Alterations (CAN or CNV)
• Structural Variants (SV)
• Germline variants (SNPs)
• Gene Expression (micro-arrays and RNASeq)
• miRNA Expression (RNASeq)
• Epigenomics (Arrays and Methylation)
• Splicing Variation (RNASeq)
• Protein Expression (Arrays)
59. Open Data is Essential for Genomics
International Cancer Genome Consortium
• Collect ~500 tumour/normal pairs from each of 50 different major
cancer types; 25,000 T/N pairs!
• Comprehensive genome analysis of each T/N pair:
– Genome
– Transcriptome
– Methylome
– Clinical data
• Make the data available to the research community & public.
Identify
genome
changes
…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…
Adapted from Tom Hudson
61. Open Data is Essential for Genomics
International Cancer Genome Consortium: http:/icgc.org
62. Open Data is Essential for Genomics
ICGC needs to deal with different
kinds of users!
62
• Biologists/Clinicians:
– Web interface to processed data, providing:
• Affected gene lists with consequences
• Impact on pathways
• Power users:
– Application Programing Interface (API) to get
to data
– Availability and Integration with cloud
resources
63. Open Data is Essential for Genomics
ICGC Data Coordinating Centre:
dcc.icgc.org
63
64. Open Data is Essential for Genomics
https://dcc.icgc.org/
64
65. Open Data is Essential for Genomics
65
https://dcc.icgc.org/icgc-in-the-cloud
66. Open Data is Essential for Genomics
66
http://www.cancercollaboratory.org/
67. Open Data is Essential for Genomics
Some challenges:
67
• So, we have lots of data, is
it generated the same way?
68. Open Data is Essential for Genomics
Every country/group has basically
been submitting:
68
– Simple Somatic Mutations (SSM or SNV)
– Copy Number Alterations (CAN or CNV)
– Structural Variants (SV)
– Germline variants (SNPs)
– Gene Expression (micro-arrays and RNASeq)
– miRNA Expression (RNASeq)
– Epigenomics (Arrays and Methylation)
– Splicing Variation (RNASeq)
– Protein Expression (Arrays)
69. Open Data is Essential for Genomics
Are they all using the same
pipelines?
69
• No
71. Open Data is Essential for Genomics
Steering Committee of PCAWG
71
• Peter Campbell, Sanger Inst.
• Gady Getz, Broad
• Jan Korbel, EMBL
• Lincoln Stein, OICR
• Josh Stuart, UCSC
72. Open Data is Essential for Genomics
PanCancer Analysis of Whole
Genomes (PCAWG)
• > 2,800 T/N pairs with clinical data from 20
tumour type of whole genome analysis.
• Aligned with one standard pipeline.
• Genomic Variants determined with 3 pipelines
• 17 working groups
• > 50 Papers are being
written now.
73. Open Data is Essential for Genomics
https://www.biorxiv.org/search/pcawg
74. Open Data is Essential for Genomics
Deliverable for PCAWG include:
74
• 1st PANCANCER analysis on > 2,800
cancer tumours from a WGS perspective
• RNA, SSM, CNV, Methylation analysis &
germline
• Published (executable) pipelines
– Docker / Dockstore
– Mutiple cloud access to data
– Multiple portal access to data
75. Open Data is Essential for Genomics
https://dcc.icgc.org/pcawg
75
76. Open Data is Essential for Genomics
Working Groups (1/2)
76
1. Novel somatic mutation calling methods
2. Analysis of mutations in regulatory regions
3. Integration of transcriptome and genome
4. Integration of epigenome and genome
5. Consequences of somatic mutations on pathway
and network activity
6. Patterns of structural variations, signatures,
genomic correlations, retrotransposons, mobile
elements
7. Mutation signatures and processes
8. Germline cancer genome
77. Open Data is Essential for Genomics
Working Groups (2/2)
77
9 Inferring driver mutations and identifying cancer
genes and pathways
10 Translating cancer genomes to the clinic
11 Evolution and heterogeneity
12 Exploratory: portals, visualization and software
infrastructure
13 Molecular subtypes and classification
14 Analysis of mutations in non-coding RNA
15 Exploratory: mitochondrial
16 Exploratory: pathogens
17 Tech Technical working group
78. Open Data is Essential for Genomics
https://goo.gl/AMxwSU
79. Open Data is Essential for Genomics
https://goo.gl/AMxwSU
80. Open Data is Essential for Genomics
https://goo.gl/AMxwSU
81. Open Data is Essential for Genomics
https://goo.gl/AMxwSU
82. Open Data is Essential for Genomics
http://dockstore.org
82
83. Open Data is Essential for Genomics
Docker Testing Group
• Group that to ensure all container
workflow work as expected.
https://goo.gl/AMxwSU
84. Open Data is Essential for Genomics
Access to Data?
• Human Data
• Patients consented to have their DNA
looked at so people could understand
cancer
• Need to have a system to maximize
people’s gift to science.
86. Open Data is Essential for Genomics
Identify
yourself
Fill out detail form which
includes:
• Contact and Project
Information
•Information Technology
details and procedures
for keeping data secure
•Data Access Agreement
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf
89. Open Data is Essential for Genomics
89
https://icgc.org/daco/approved-projects
314 groups
90. Open Data is Essential for Genomics
DACO
ICGC
dbGaP
GDC
EGA
TCGA
BAM
Open
Open
ERA
BA
M
BA
M
EGA id
& password
WGS
Ger m
Line
91. Open Data is Essential for Genomics
Challenge:
• Open Data and controlled access data
• Not enough eyeballs on the data
• Eyeballs on the data needed to make
discoveries.
https://goo.gl/ogbWXG
92. Open Data is Essential for Genomics
Culture of Sharing Openly
• Public Funding agencies
• Consortiums
• Mentors
• Peers
• New generation (vs my old generation)
• Has to become the norm
93. Open Data is Essential for Genomics
Final thoughts …
• Access to data is essential for science
• Getting data that is FAIR is hard work
• It is essential to share the work you do if
you want to be recognized, get tenure, get
a job or a promotion.
• Human data is more complicated, but
don’t let that get in the way!
• There is a lot of material out there, learn
from it (& cite your sources)!
94. Open Data is Essential for Genomics
Last message to students and
young PDFs and investigators:
95. Open Data is Essential for Genomics
Last message to students and
young PDFs and investigators:
Be open so people
can see how great
you are!
97. Open Data is Essential for Genomics
DCC Software
Developer
Vincent Ferretti
Dusan Andric
Phuong-My Do
Francois Gerthoffert
Terry Lin
Michael Moncada
Vitalii Slobodianyk
Bob Tiernay
Douglas Wong
Linda Xiang
Junjun Zhang
Acknowledgments
ICGC/OICR
Project leaders:
Tom Hudson
John McPherson
Lincoln Stein
Jared Simpson
Paul Boutros
Vincent Ferretti
Francis Ouellette
Jennifer Jennings
Ouellette Lab
Alysha Moncrieffe
Ann Meyer
Zhibin Lu
Web Dev
Joseph Yamada
Kaman Wu
Kim Cullion
Koji Miyauchi
Miyuki Fukuma
ICGC DCC Biocuration
Hardeep Nahal
Marc Perry
http://oicr.on.ca http://icgc.org
… and all the patients and their
families that that are putting
their hopes into our work!
Research
IT/Systems
David Sutton,
Bob Gibson
David Magda
Rob Naccarato
Brian Ott
Gino Yearwood
EGA
Jordi Rambla De
Argila
Arcadi Navarro
Audald Iloret
Mauricio Moldes
98. |
ÉQUIPE DES AFFAIRES SCIENTIFIQUES
9827 mars 2017
B.F. Francis
Ouellette
Annina Spilker
Joël Savard
Diana IglesiasDiane
Bouchard
Cristina CiurliMicheline
Ayoub
Hélène
Fournier