CDAC 2018 Merico making sense of cancer somatic snv

Making sense of cancer
somatic SNVs and indels:
from variant effects to pathways
Thu 24 May
Daniele Merico, PhD
Director of Molecular Genetics, Deep Genomics Inc.
Visiting Scientist, The Hospital for Sick Children
(Toronto, Canada)

Outline
1. Functional interpretation of somatic variants: overview [5 min]
2. From variants to genes [10 min]
1. Variant gene product effect
2. Missense impact prediction and beyond
3. Genes with significant somatic burden
3. From genes to functions, pathways & networks [40 min]
1. Gene-set analysis [30 min]
1. Overview
2. Gene-set types, Gene Ontology & pathway resources
3. Gene-set results visualization: Cytoscape Enrichment Map
4. Types of gene-set analysis tests
5. Competitive tests: GSEA for gene expression data
6. Self-contained tests: gene-set somatic burden
7. General tips
2. Network analysis [10 min]
1. Network visualization and gene network types
2. GeneMANIA
3. Reactome FI
4. Q&A [5 min]

1. Functional Interpretation of
Somatic Variants: Overview

Criteria to Interpret Somatic Variants
• What’s the effect on the gene product?
• Stop-gain, frameshift, splice site alteration, missense, splicing consensus sequence, synonymous, 5’UTR, 3’UTR,
intronic, upstream, downstream, ncRNA exon, ncRNA intron
• Truncating loss-of-function, missense (loss-of-function or gain-of-function?)
• Is a missense variant recurrent, or overlapping a known mutation hotspot?
• Is a missense variant predicted damaging by impact predictors?
• Is the gene an established oncogene or tumour suppressor?
• Is the gene significantly mutated à could act as a novel oncogene or tumour suppressor?
• Otherwise, is the gene under negative selection for truncating loss-of-function or missense variants?
• Does the gene belong to a pathway or subnetwork with other cancer driver genes or enriched in
somatic mutation à could act as a novel oncogene or tumour suppressor?

Cancer somatic
mutation data
Established
cancer genes
Novel
cancer genes
Tumour suppressor Oncogene Significant burden
(or genetic constraint)
Truncating LOF Missense
Truncating LOF Missense LOF?
Gene-set and
network analysis
Missense GOF?
y
Established hostpotRecurrent Impact prediction
y h h
h

2.1. Variant gene product effect

SNV and Indel Variant Annotations
• Variant database mapping
• Germline allele frequencies, dbSNP
• COSMIC (somatic variant database)
• Gene mapping
• Gene product effect type
• Stop-gain, frameshift, splice site alteration, missense, splicing consensus sequence, synonymous,
5’UTR, 3’UTR, intronic, upstream, downstream, ncRNA exon, ncRNA intron
• Stop-gain, frameshift, splice site alteration à expected to cause complete loss-of-function (LOF)
• Missense, other à can act as gain-of-function
• Missense impact prediction
• SIFT, PolyPhen2, MutationAssessor, …
• Other impact predictions
• Splicing (e.g. MaxEntScan, dbscSNV, SPIDEX, …)
• Genomic conservation (e.g. phyloP, PhastCons, …)
• Omnibus meta-predictors (CADD, Eigen, …)

2.2. Missense impact prediction
and beyond

SIFT
• Broadly used, relatively old (2001)
• Based uniquely on protein sequence (amino acid) conservation
1. Start from query protein sequence
2. Identify similar protein sequences (PSI-BLAST)
3. Multiple alignment of protein sequences (orthologs and paralogs)
4. Amino acid x residue probability matrix (PSSM)
5. For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of
frequency rank * frequency)
à Score: probability of observing amino acid normalized by residue conservation
cut-off: 0.05 (based on case studies)
Predicting deleterious amino acid substitutions.
Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.

PolyPhen2
• Integrates multiple features
• 8 sequence-based, 3 structure-based (nucleotide and amino acid level)
(e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics)
• Supervised machine learning method (Naïve Bayes) à Requires training set
• Set 1: HumDiv
• Positive: damaging alleles for known Mendelian disorders (Uniprot)
• Negative: nondamaging differences between human proteins and related mammalian homologs
• Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%)
• Set 2: HumVar
• Positive: all human disease causing mutations (Uniprot)
• Negative: non-synonymous SNPs without disease association
àRicher model than SIFT
àMore biased towards training set(s) than SIFT
A method and server for predicting damaging missense mutations.
Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.

CADD
• Intended as a measure of “deleteriousness” for coding and non-coding sequence,
not biased to known disease variation
• However non particularly effective for non-coding regulatory sequence (see lecture)
• Supervised machine learning model (Linear SVM)
• Negative training set: nearly fixed human alleles, variant if compared to inferred human-
chimp ancestral genome
• Positive training set: simulated variants based on mutation model aware of sequence context
and primate substitution rates
• Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks à
includes missense predictions and nucleotide-level conservation
• Performance assessment: using pathogenic variants from ClinVar performs a bit better
PhyloP for all sites and PolyPhen/SIFT for missense coding
A general framework for estimating the relative pathogenicity of human genetic variants.
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.

Example of Mutation Hotspots
L858R
G12D, V, C, A, S, R
G13D
EGFR
KRAS

2.3. Genes with significant
somatic burden

MutSigCV
• Goal: identify significantly mutated genes
à Important to model mutational background model
• Tumour-specific global mutation rate
• Trinucleotide context and substitution
• Expression level (impacting transcription-couple repair)
• Replication timing (later-replicating regions have higher tumour rates)
• Residual local genomic region mutation rate
Lawrence MS, ..., Getz G. Mutational heterogeneity in cancer and the
search for new cancer-associated genes. Nature 2013. PMID: 23770567

3. From genes to functions,
pathways & networks

Activity Maps
Spindle
Apoptosis
Gene.A
Gene.B
Gene.C
Gene.D
Gene.E
Gene.F
GENE SETS NETWORKS PATHWAYS
Ca++ Channels
MAPK
Gene.G
Gene.H
Gene.I
Gene.L
Gene.M
Gene.N
Activity Profiles /
Somatic Mutations
Prior Knowledge about genes
Spindle
Apoptosis
Gene.A
Gene.B
Gene.C
Gene.D
Gene.E
Gene.F
GENE SETS NETWORKS PATHWAYS
Ca++ Channels
MAPK
Gene.G
Gene.H
Gene.I
Gene.L
Gene.M
Gene.N
Scoring models
Search algorithms
Informatics

3.1.1. Gene-set analysis overview

Set p-value
Spindle 0.00001
Apoptosis 0.00025
Experiment
Gene-set
Databases
ENRICHMENT
TEST
Enrichment TableExperimentally
“positive” genes
(e.g UP-regulated)
Experimentally
“detectable” genes
(aka background set)
Gene-set Analysis Overview

Gene-sets for Gene-set Analysis
Nuclear Pore
Cell Cycle
Gene.AAA
Gene.ABA
Gene.ABC
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Ribosome
P53 signaling
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
Gene.CC1
Gene.CK1
Gene.PPP
From cell biology to gene-sets
Gene-set
Databases

Gene-set Analysis: Overview
Spindle 0.00001
Apoptosis 0.00025
Enrichment Table
FADD
TRADD
CYTC1
BAX
BAXL
CASP9
CASP10
….
SPP1
SPP2
CCCP
MTC1
…
Gene-sets
Experimental data
(e.g. gene expression table)

Gene-set Enrichment Test
The P-value assesses
the probability that,
by random sampling
the “detectable”
genes,
the overlap is at least
as large as observed.
Random samples
of array genes
The output of an enrichment test is a P-value
Most used statistical model:
Fisher’s Exact Test
Fisher’s Exact Test does not require to actually
perform the random sampling, it is based on
a theoretical null-hypothesis distribution
(Hypergeometric Distribution)

Fisher’s Exact Test (FET)
b a
d c
Exp_positive=yes Exp_positive=no
Gene-Set=yes a b
Gene-Set=no c d
Fisher’s Exact Test:
2 x 2 Contingency Table
Probability of one table to occur by random sampling:
Hypergeometric distribution formula:
Test p-value: sum of random sampling probabilities for tables
as extreme or more extreme than the real table

The Background is Important!
b a
d c
• Inappropriate modeling of the background will lead to
incorrectly biased results
– What genes are detectable by the experiment? E.g.: in a kinase
phosphorylation assay, only kinases can be detected
– The Fisher’s Exact Test, GSEA and other tests assume all genes have
the same “prior” probability of being experimentally positive à
they can be used only in absence of systematic selection biases
(example of bias: if you select genes with at least one mutation,
then longer genes are systematically more likely to be selected)

Gene-set Enrichment Analysis:
Multiple Test Correction by BH-FDR
• FDR (false discovery rate) is the expected proportion of tests passing the
significance threshold due to random sampling
• Benjamini-Hochberg (BH) FDR:
for a given FDR q-value threshold alpha (e.g. 25%),
for m total tests (e.g. 1,000 gene-sets),
find the largest k number of tests, so that:
P-value (k) <= k / m * alpha
so alpha >= P-value (k) * m / k
(e.g. 0.0125 * 1,000 / 50 <= 0.25)

Gene-set Enrichment Analysis:
Multiple Test Correction by BH-FDR
P-valueCategory P-value * m / kRank
FDR
q-value
1
2
3
4
5
…
52
53
Transcriptional regulation
Transcription factor
Initiation of transcription
Nuclear localization
Chromatin modification
…
Cytoplasmic localization
Translation
0.001 x 53/1 = 0.053
0.002 x 53/2 = 0.053
0.003 x 53/3 = 0.053
0.0031 x 53/4 = 0.040
0.005 x 53/5 = 0.053
…
0.985 x 53/52 = 1.004
0.99 x 53/53 = 0.99
In other words: (1) walk the list of tests from most significant, (2) estimate how many
tests would pass at each p-value if they were random draws, (3) compute fraction of
false positives, transform to monotonic 1 <= q-value <= 0
0.040
0.040
0.040
0.040
0.053
…
0.99
0.99
P-value threshold for FDR < 0.05
0.001
0.002
0.003
0.0031
0.005
…
0.97
0.99
Red: non-significant
Green: significant at FDR < 0.05

3.1.2. Gene-set types, Gene Ontology &
pathway resources

Gene-set Types
• Functions (e.g. Gene Ontology)
• Pathways (e.g. KEGG, Reactome)
• Genotype-phenotype/disease association (e.g. HPO)
• Protein Families / Domains (e.g. PFAM)
• Genomic position (e.g. cytobands)
• Gene expression signatures (e.g. MSigDB Cancer Hallmarks)
• Up/down after treatment or in relation to disease
• Targets of regulators
• Transcription factor targets
• miRNA targets
• Network-derived modules, e.g. protein-protein interactions
• Drug targets

Gene Ontology (GO) / 1
• Effort to standardize functional description of eukaryotic gene products
• Launched in 1998
• Many organism species supported
• Normal function (e.g. cell cycle), not disorder / disease (e.g. metastasis formation)
• Ontology defined by core team of curators who receive input from domain experts
• Corpus of gene annotations based on expert curation of the literature (> 140,000 published
papers in 2018), review of high-throughput data, or annotations in existing databases;
performed by curators at specific organism genome databases (human: UniProtKB)

Gene Ontology (GO) / 2
• Ontology, intended as controlled structured vocabulary
• Terms = functional concepts (e.g. cell cycle, proteasome)
• Three main ontologies: molecular function (i.e. biochemical activity), cellular component,
biological process (pathways and other processes)
• Relations between terms: is-a, part-of / has-part, regulates, occurs-in
à DAG (directed acyclic graph), supports logical inference
• Most of the relations are within each main ontology, ongoing effort to link processes and
molecular functions to components using occurs-in

Biological Process:
DNA repair
Cellular Component:
Replication fork
Cellular Component:
Single-strand break
containing DNA
binding

CHILD
PARENT
ABB1
ACAP3
TRAC1
LUC2
POF5
ZUMM
C5A75
DUCZ

Pathways
• Depict mechanistic details of metabolic, signaling and other biological processes
• Can be computationally exported as complex graph, but often just analyzed as gene-sets
• Advantages:
• Curated, accurate, cause and effect captured
• Human-interpretable visualizations
• Disadvantages:
• More sparse coverage of genome than functional sets
• More complex models are required to score pathways
• Static model of dynamic systems
• Main resources: KEGG, Reactome

Reactome
Cell cycle,
G1/S transition

Resources to Download Gene-sets
BaderLab (University of Toronto)
http://baderlab.org/GeneSets
• Gene Ontology; Reactome, Panther, NetPath, NCI, MSigDB C2 (Biocarta, ...), HumanCyc pathways; MSigDB cancer
hallmarks; MSigDB C3 (miRNA and TF targets)
• updated on a monthly basis
MSigDB (Broad Institute)
https://software.broadinstitute.org/gsea/msigdb/
• Gene Ontology; KEGG, Reactome, Biocarta, other pathways; cancer hallmarks; expression signatures; miRNA and TF
targets; interaction modules; Cytobands (positional)
• last update Oct 2017, several gene-set collections are derived from old research works (2004-2005)
Bioconductor org.Hs.eg.db
http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
• Gene Ontology; KEGG pathways; PFAM (protein domains); Cytobands (positional)
• updated every 4 months
Notes:
• KEGG stopped being freely available on 2011, so freely-available resources have largely outdated gene-sets
• Carefully check how GO annotations are exported (e.g. all evidence codes, or excluding IEA)

3.1.3. Gene-set results visualization:
Cytoscape Enrichment Map

GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr
GO:0042330 taxis 2.18E-06 23 0.056930693 54.94499375 9.139238998
GO:0006935 chemotaxis 2.18E-06 23 0.060209424 54.94499375 9.139238998
GO:0002460 adaptive immune response based on somatic recombination 7.10E-05 25 0.111111111 57.32306955 16.97054864
GO:0002250 adaptive immune response 7.10E-05 25 0.111111111 57.32306955 16.97054864
GO:0002443 leukocyte mediated immunity 0.000419328 23 0.097046414 58.27890582 15.58333739
GO:0019724 B cell mediated immunity 0.000683758 20 0.114285714 57.84161096 15.03496347
GO:0030099 myeloid cell differentiation 0.000691589 24 0.089219331 62.22171598 10.35284833
GO:0002252 immune effector process 0.000775626 31 0.090116279 58.27890582 23.86214773
GO:0050764 regulation of phagocytosis 0.000792138 8 0.2 53.54786293 5.742849971
GO:0050766 positive regulation of phagocytosis 0.000792138 8 0.216216216 53.54786293 5.742849971
GO:0002449 lymphocyte mediated immunity 0.00087216 22 0.101851852 57.84161096 16.13171132
GO:0019838 growth factor binding 0.000913285 15 0.068181818 83.0405088 10.58734852
GO:0051258 protein polymerization 0.00108876 17 0.080952381 57.97543252 17.31639968
GO:0005789 endoplasmic reticulum membrane 0.001178198 18 0.036072144 64.02284752 12.05209158
GO:0016064 immunoglobulin mediated immune response 0.001444464 19 0.113095238 58.27890582 15.58333739
GO:0007507 heart development 0.001991562 26 0.052313883 84.02538284 18.60761304
GO:0009617 response to bacterium 0.002552999 10 0.027173913 52.75249873 23.23104637
GO:0030100 regulation of endocytosis 0.002658555 11 0.099099099 56.38041132 16.02486889
GO:0002526 acute inflammatory response 0.002660742 24 0.103004292 57.80098769 24.94311116
GO:0045807 positive regulation of endocytosis 0.002903401 9 0.147540984 54.94499375 6.769909171
GO:0002274 myeloid leukocyte activation 0.002969661 7 0.077777778 54.94499375 16.07042339
GO:0008652 amino acid biosynthetic process 0.003502921 7 0.017241379 45.19797271 31.18248579
GO:0050727 regulation of inflammatory response 0.004999055 7 0.084337349 54.94499375 7.737346076
GO:0002253 activation of immune response 0.00500146 23 0.116161616 60.29679989 18.41103376
GO:0002684 positive regulation of immune system process 0.006581245 27 0.111570248 60.29679989 22.05051447
GO:0050778 positive regulation of immune response 0.006581245 27 0.113924051 60.29679989 22.05051447
GO:0019882 antigen processing and presentation 0.007244488 7 0.029661017 54.94499375 16.58797889
GO:0002682 regulation of immune system process 0.007252134 29 0.099656357 61.05645008 22.65935206
GO:0050776 regulation of immune response 0.007252134 29 0.102112676 61.05645008 22.65935206
GO:0043086 negative regulation of enzyme activity 0.008017022 9 0.040723982 53.28031076 17.48904224
GO:0006909 phagocytosis 0.008106069 10 0.080645161 55.66270253 12.47536747
GO:0002573 myeloid leukocyte differentiation 0.008174948 10 0.092592593 62.86577216 9.401887596
GO:0006959 humoral immune response 0.008396095 16 0.044568245 55.05654091 18.94209565
GO:0046649 lymphocyte activation 0.009044401 29 0.059917355 61.92213317 21.03553355
GO:0030595 leukocyte chemotaxis 0.009707319 7 0.101449275 56.33116709 6.945510559
GO:0006469 negative regulation of protein kinase activity 0.010782155 7 0.046357616 52.22863516 12.58524145
GO:0051348 negative regulation of transferase activity 0.010782155 7 0.04516129 52.22863516 12.58524145
GO:0007179 transforming growth factor beta receptor signaling pathw 0.012630825 13 0.071038251 83.49440788 12.63256309
GO:0005520 insulin-like growth factor binding 0.012950071 9 0.097826087 81.41963394 7.528247832
GO:0042110 T cell activation 0.013410548 20 0.064516129 59.77891783 26.06174863
GO:0002455 humoral immune response mediated by circulating immunogl 0.016780163 10 0.125 54.70766244 14.2572143
GO:0005830 cytosolic ribosome (sensu Eukaryota) 0.016907351 8 0.01843318 61.68933284 7.814673781
GO:0006487 protein amino acid N-linked glycosylation 0.01791078 7 0.044585987 56.50635337 6.780726553
GO:0051240 positive regulation of multicellular organismal process 0.017931228 31 0.096573209 62.2953212 23.86214773
GO:0042379 chemokine receptor binding 0.018849666 12 0.095238095 55.13915015 19.08254406
GO:0008009 chemokine activity 0.018849666 12 0.096774194 55.13915015 19.08254406
GO:0016055 Wnt receptor signaling pathway 0.020088086 18 0.04400978 85.47935979 20.92435897
Need
visualization
solution..!

Visualization: Cytoscape Enrichment Map
• Visualization framework for gene-set
analysis results
• Cytoscape network: nodes correspond to
gene-sets, edges correspond to gene-set
overlaps (i.e. share a fraction of their genes)
• Intuitive clustering of gene-sets that
converge on the same functional themes
• Determined by automatic network layout
algorithm, based on edge weights
• Overlaps < threshold are pruned, otherwise
network layout would work poorly
• Important: don’t confuse with gene
networks
• Nodes do not represent genes, they represent
gene-sets/pathways
• Edges do not represent physical interactions, they
represent overlaps between gene-sets
A
B
Edges represent
gene-set overlap
Merico D, Isserlin R, Stueker O, Emili A, Bader GD.
Enrichment map: a network-based method for gene-set
enrichment visualization and interpretation.
PLoS One 2010. PMID: 21085593

Visualization: Cytoscape Enrichment Map
ABB1
ACAP3
TRAC1
LUC2
POF5
ZUMM
C5A75
DUCZ
TP53
NTRK1
MAPK3
ANAAT
PIK1
PRKCA
gs1
gs2
gs3
gs4 gs5
PIRL2
TAZ
CAZ1
gs1
gs3
gs4
gs2
gs5

Example: Differential expression after estrogen treatment of breast cancer cells, GSEA competitive gene-set analysis

• Using the native Gene Ontology
relations results in a more
disconnected graph

3.1.4. Types of gene-set analysis tests

Competitive vs Self-contained
CASES CONTROLS
GENE
TEST
GENE-SETS
ENRICHED IN SCORE
(e.g. gene-sets enriched
in up-regulated genes)
CASES CONTROLS
GENE-SET
TEST
GENE-SET SCORE
(e.g. significant
mutation burden
difference)
GENE SCORE
(e.g. differential
expression)
COMPETITIVE
(aka ENRICHMENT
aka OVER-REPRESENTATION)
SELF-CONTAINED
SUPPORTING
GENES
Nam D, Kim SY. Gene-set approach for expression
pattern analysis. Brief Bioinform 2008. PMID:
18202032
• Competitive à gene-set genes “compete”
with all other genes (for enrichment)
• Self-contained à gene-set scored
independently of other genes

Competitive Test Types
UP
DOWN
ENRICHMENT
TEST
Threshold-
dependent
e.g.
FET,
g:Profiler *
Threshold-
independent
e.g. GSEA
UP
DOWN
• More suitable for
significantly
mutated genes
• More suitable for
differential gene
expression
* g:Profiler also contains a “hybrid” approach that selects
the most optimal cutoff for gene-set analysis

3.1.5. Competitive tests:
GSEA for gene expression data

Gene Expression Analysis Workflow
Generate the expression
data
Collect the biological
samples
Identify the
Differential Genes
Identify the
Functional Groups
Define the experimental
design

GSEA: Gene-Set Enrichment Analysis
• Popular threshold-free gene-set test
• Identifies gene-sets enriched in top- or bottom-ranking genes
• Suggest typically used as competitive test (see permutation settings), which takes
in input a ranked gene list
• Statistical test: empirical test based on permutations; includes permutation-
based FDR
• The NES (normalized enrichment score) is a particularly valuable measure of
enrichment effect size for visualization

GSEA: Gene-Set Enrichment Analysis
High ES score <--> High local enrichment
ES score calculation
Distribution of ES from
N permutations (e.g. 2000)
Number of
instances
Real ES score value
Randomized with
ES ≥ real: 4 / 2000
==> Empirical p-
value = 0.002
ES Score

GSEA Permutation Settings
• The permutation setting completely changes the nature of the GSEA test
• Gene-set permutations (aka pre-ranked)
• Takes in input a ranked gene list and permutes the genes in the gene-sets
• à competitive
• Recommended in presence of differential gene expression data for small or medium-scale
experiments (2-4 biological replicates per condition) with modest expression heterogeneity
• Phenotype permutation
• Permute the phenotype labels (e.g. treated, untreated), then repeat gene scoring; gene
scoring is performed within GSEA
• à competitive / self-contained hybrid
• Recommended for larger scale gene expression data (> 10 biological replicates per condition)
with high expression heterogeneity
• As an alternative, consider a pure self-contained test, or a self-contained test with a different
competitive correction

3.1.6. Self-contained tests:
gene-set somatic burden

OICR PanCuRx: Dataset Summary
• 200 primary tumours and 41 metastases (pancreatic cancer)
• Whole genome sequencing à detection of SNVs, indels, SVs, copy number gains and losses
• Mutation load outlier removal criterion: median + 2 IQR
à Samples retained: 190/200 primaries and 41/41 metastases
Met Pri
3.54.04.55.0
SNV count
Log10(SNVcount)
Met Pri
1.52.02.53.03.54.04.5
Indel count
Log10(indelcount)
Met Pri
0.00.51.01.52.02.53.0
SV count
Log10(SVcount)
Unpublished data

OICR PanCuRx: Gene-set Analysis Strategy
1. Perform gene-set burden test, primaries vs metastases
• Logistic regression (metastases vs. primary), separating each variant type:
M0 = y ~ ns_tot + ms_tot + ss_tot + sv_tot + cL_tot + cG_tot
M1 = y ~ ns_tot + ms_tot + ss_tot + sv_tot + cL_tot + cG_tot +
ns_gs + ms_gs + ss_gs + sv_gs + cL_gs + cG_gs
• Multiple test correction by BH-FDR (significant when BH-FDR < 27.5%)
2. For significant gene-sets, categorize driver variant type(s) and extract genes
more often mutated in metastases for such variant types (“leading edge” gene)
3. Cluster pathways based on leading gene overlaps, visualize using Cytoscape
enrichment map plugin
4. Overlay key genes (even more stringent filter: mutation rate met/pri > 4.5)
5. Formulate hypotheses à correlation with other tumour properties
• RNA-seq based proliferation index (CCP) and missense mutations in cell cycle genes
Unpublished results;
Gallinger, PanCuRx TRI, Toronto

REACT:TELOMERE MAINTENANCE
REACT:ION CHANNEL TRANSPORT
KEGG:BASE EXCISION REPAIR
REACT:RESOLUTION OF ABASIC SITES
(AP SITES)
KEGG:MINERAL ABSORPTION
REACT:CHROMOSOME MAINTENANCE
REACT:BASE EXCISION REPAIR
REACT:TRANSMEMBRANE TRANSPORT
OF SMALL MOLECULES
REACT:NUCLEOSOME ASSEMBLY
REACT:HDACS DEACETYLATE HISTONES
REACT:DEPOSITION OF NEW
CENPA-CONTAINING NUCLEOSOMES AT
THE CENTROMERE
REACT:DNA REPLICATION
PRE-INITIATION
REACT:FORMATION OF THE
BETA-CATENIN:TCF TRANSACTIVATING
COMPLEX
REACT:G2/M CHECKPOINTS
KEGG:ECM-RECEPTOR INTERACTION
REACT:CELL CYCLE, MITOTIC
REACT:M/G1 TRANSITION
REACT:G1/S TRANSITION
REACT:MITOTIC METAPHASE AND
ANAPHASE
REACT:TRANSCRIPTION-COUPLED
NUCLEOTIDE EXCISION REPAIR (TC-NER)
REACT:GAP-FILLING DNA REPAIR
SYNTHESIS AND LIGATION IN TC-NER
KEGG:SEROTONERGIC SYNAPSE
KEGG:GNRH SIGNALING PATHWAY
KEGG:CIRCADIAN ENTRAINMENT
Missense
(gain and loss of function?)
Nonsense + missense
(loss of function?)
Nonsense
Nonsense +
copy number loss
Other combination
Driver variants
Copy number gain
Missense + SV
(loss and gain of function?)
For all clusters, only variants driving corresponding gene-sets
and with counts met >= pri are reported; considering the number
of met and pri, this is corresponds to an enrichment ratio > 4.5
Gallinger, PanCuRx TRI, Toronto

REACT:TELOMERE MAINTENANCE
REACT:ION CHANNEL TRANSPORT
KEGG:BASE EXCISION REPAIR
REACT:RESOLUTION OF ABASIC SITES
(AP SITES)
KEGG:MINERAL ABSORPTION
REACT:CHROMOSOME MAINTENANCE
REACT:BASE EXCISION REPAIR
REACT:TRANSMEMBRANE TRANSPORT
OF SMALL MOLECULES
REACT:NUCLEOSOME ASSEMBLY
REACT:HDACS DEACETYLATE HISTONES
REACT:DEPOSITION OF NEW
CENPA-CONTAINING NUCLEOSOMES AT
THE CENTROMERE
REACT:DNA REPLICATION
PRE-INITIATION
REACT:FORMATION OF THE
BETA-CATENIN:TCF TRANSACTIVATING
COMPLEX
REACT:G2/M CHECKPOINTS
KEGG:ECM-RECEPTOR INTERACTION
REACT:CELL CYCLE, MITOTIC
REACT:M/G1 TRANSITION
REACT:G1/S TRANSITION
REACT:MITOTIC METAPHASE AND
ANAPHASE
REACT:TRANSCRIPTION-COUPLED
NUCLEOTIDE EXCISION REPAIR (TC-NER)
REACT:GAP-FILLING DNA REPAIR
SYNTHESIS AND LIGATION IN TC-NER
KEGG:SEROTONERGIC SYNAPSE
KEGG:GNRH SIGNALING PATHWAY
KEGG:CIRCADIAN ENTRAINMENT
Missense
(gain and loss of function?)
Nonsense + missense
(loss of function?)
Nonsense
Nonsense +
copy number loss
Other combination
Driver variants Cell cycle (cell cycle progression and checkpoints), DNA replication (polymerase, replication initiation,
replication fork complexes), chromosome maintenance and segregation (centromere components,
centrosome components, spindle checkpoint) – missense, sometimes also sv [labelled]
CDT1 (4,0): prevents initiation of replication when DNA replication is ongoing
POLA1 (1,0) : DNA polymerases [POLD1, POLD3 and other DNA polymerases listed only in repair cluster]
MCM8 (2,0), MCM3 (1,0), MCM10 (1,1), MCM7 (1,1): replication fork complex – [MCM10 in CCP]
CENPA (1,0), CENPL (1,0), CENPJ (1,1), : centromere (chromosome segregation) – [CENPM, CENPF in CCP]
NCAPD3 (1,0), NIPBL (1,1): chromosome condensation and/or segregation
CEP57 (2,0), CEP152 (2,1), CNTRL (1,1): microtubule centrosome (chromosome segregation) – [CEP55 in CCP]
ERCC6L (2,1): spindle checkpoint; CKAP5 (2,2): spindle formation; CASC5/KNL1 (sv 1,0): kinetochore
E2F1 (1,0), E2F4 (1,0), TFDP1 (1,0; sv 1,1): TFs regulating cell cycle progression
ANAPC11 (1,0), ANAPC2 (1,0): anaphase promoting complex (cell cycle progression); FBXO5 (1,0; sv 1,0):
anaphase promoting complex inhibitor
ATM (sv 1,1), TP53BP1 (1,1): TP53 pathway and DNA damage response; HMG20B (1,0): DNA damage response
[histone and histone (de)acetylation listed for the separate subcluster]
Other: AHCTF1 (2,2; sv 2,1), B9D2 (1,0), BARD1 (1,0), GORASP1 (1,0), LEMD2 (1,0), NEDD1 (1,0), NUP205 (1,0),
NUP88 (1,1), NUP133 (1,1), PPP1R12A (1,0), PSMA3 (1,1), PSMD1 (1,1), SDCCAG8 (1,1), SGOL2 (1,1), TUBGCP5
(1,0), UBB (1,0), YWHAH (1,0), XPO1 (1,0), WRAP53 (sv 1,0), ZW10 (1,0)
DNA base excision repair – missense, sv
PARP1 (sv 1,0), PARP2 (ms 1,0), PARP4 (ms 1,0), POLD3 (sv 1,0), MPG (ms 1,0), RPA1 (sv 2,0),
RPA2 (1,0), TDG (ms 1,0)
Transcription-coupled nucleotide excision repair – only missense
COPS2 (ms 1,0), EP300 (ms 2,0), ERCC3 (ms 2,0), POLK (ms 1,0), UBB (ms 1,0)
Both – missense, sv
LIG3 (ms 1,0), POLD1 (ms 1,1; sv 1,0), XRCC1 (ms 1,0; sv 1,0)
Beta catenin pathway – only missense
CTNNB1 (2,2): beta catenin
TCF7L2 (2,0): TF that partners with CTNNB1 and
activates target genes
Extracellular matrix–receptor interactions
– only missense
LAMB4 (1,1), LAMC1 (1,1), LAMC2 (1,0)
COL4A2 (1,1), COL6A3 (2,2), COL9A2 (1,1), COL6A5
(4,2), HSPG2 (2,0)
COMP (1,0), TNR (1,1)
ITGA1 (2,0), ITGB4 (2,0), ITGB3 (2,0), ITGA2B (1,0),
ITGA11 (1,1), ITGAV (1,1)
CD47 (1,0), CD36 (1,0)
Histones and histone (de)acetylation
– only missense
HIST1H2BB (2,1), HIST1H2BD (1,0), HIST1H2BL (1,0),
HIST1H2BO (sv 1,0),: transcriptional activation,
response to DNA damage and other processes
H2AFB1 (sv 2,1)
CHD4 (1,0): nucleosome remodeling and histone
deacetylase complex
EP300 (2,0): histone acetyltransferase recognizing
enhancers, involved in cell cycle, DNA damage
response, …
KAT5 (1,0): histone acetyltransferase
ARID4B (1,1): histone deacetylase
WHSC1 (1,1; sv 1,0): histone methyltransferase
NCOR1 (1,1), TBL1XR1 (1,1): nuclear receptor
corepressor (N-CoR) and histone deacetylase 3
(HDAC 3) complexes
Misc. signalling
– only cnGain
ITPR2 (2,0)
ALOX12 (1,0)
GNAS (1,0)
MAP3K3 (1,0)
PRKCG (1,0)
Misc. signalling
– only nonsense
ADCY2 (1,0)
ADCY10 (1,1)
GUCY1A3 (1,0)
RYR3 (2,1)
Copy number gain
Missense + SV
(loss and gain of function?)
For all clusters, only variants driving corresponding gene-sets
and with counts met >= pri are reported; considering the number
of met and pri, this is corresponds to an enrichment ratio > 4.5
Gallinger, PanCuRxTRI, Toronto

All samples
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.42557 0.08620 4.937 2.39e-05 ***
# gsCC_ms_bin_stdz 0.14522 0.08739 1.662 0.1063
# gCDKN2ALOF_bin_stdz 0.16066 0.09449 1.700 0.0988 .
# vc_ms_tot_stdz 0.13934 0.08962 1.555 0.1298
Samples with <= 60 missense
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.31231 0.10020 3.117 0.00455 **
# gsCC_ms_bin_stdz 0.14051 0.09719 1.446 0.16068
# gCDKN2ALOF_bin_stdz 0.25673 0.11289 2.274 0.03181 *
# vc_ms_tot_stdz -0.09684 0.11187 -0.866 0.39489
Cell cycle missense x CDKN2A LOF (ns, sv, cL)
Met_CDKN2Ay_CCMSy
Met_CDKN2Ay_CCMSn
Met_CDKN2An_CCMSy
Met_CDKN2An_CCMSn
Pri_CDKN2Ay_CCMSy
Pri_CDKN2Ay_CCMSn
Pri_CDKN2An_CCMSy
Pri_CDKN2An_CCMSn
-3-2-1012
Met/pri x CDKN2A y/n x Cell Cycle ms y/n: ccp
ccpRNAindex
Met_CDKN2Ay_CCMSy
Met_CDKN2Ay_CCMSn
Met_CDKN2An_CCMSy
Met_CDKN2An_CCMSn
Pri_CDKN2Ay_CCMSy
Pri_CDKN2Ay_CCMSn
Pri_CDKN2An_CCMSy
Pri_CDKN2An_CCMSn
-3-2-1012
Met/pri x CDKN2A y/n x Cell Cycle ms y/n: ccp
ccpRNAindex
Gallinger, PanCuRxTRI, Toronto

General Applications of Self-Contained Tests
• Compare different tumour subtypes
• Compare tumours by survival or other properties (e.g. clinical grade, response to
therapy)
• Important to address systematic differences between tumours from different
groups (mutation load, mutation signatures, etc.)
• Relatively minor differences can be corrected for, whereas large differences will likely prevent
the analysis from working properly
• Correcting for total number of variants is typically recommended, and it can be considered a
“competitive” correction of the self-contained test (i.e. the gene-set is more predictive of the
difference between sample groups than all genes)

General Tips for Gene-set Analysis / 1
• Carefully design your experiment
• Flaws in experimental design, like presence of hidden confounders or insufficient number of
replicates, will result in confounded or negative gene-set results
• For gene expression experiments, perform exploratory analysis (PCA, MDS, hierarchical
clustering) to check relations among samples and validate the experimental design
• Choose gene-set types and filter gene-sets by size
• Start from most informative gene-sets: Gene Ontology, KEGG and Reactome pathways,
MSigDB cancer hallmarks
• Remove small gene-sets to improve power after multiple test correction (e.g. < 15 genes for
competitive tests applied to differential gene expression)
• For Gene Ontology, remove large gene-sets (e.g. > 500 genes) as they tend to be
uninformative

• Chose a competitive of self-contained test
Competitive:
• requires meaningful gene seletion or ranking à typically suitable for differential gene
expression or genes with significant mutation burden
• if analyzing other –omics, model carefully the background distribution, do not simply assume
Fisher’s Exact Test or GSEA will be suitable (e.g. use GREAT for ChIP-seq, etc.)
Self-contained:
• typically suitable for sparser mutations, when differences are significant at gene-set level only
• ensure that different sample groups are comparable, correct for confounders
• Proper visualization is important to interpret results and to identify issues
• Use visualization solution like Enrichment Map
• Visualize the full gene-set results, do not cherry-pick based on prior expectation
• Unexpected results can suggest issues (e.g. contamination, statistical bias)

• Do not forget to carefully evaluate genes with limited or no gene-set annotations and
network interactions…!

Time1
...
Zz34
13.56Aabc
Ranked List
1.07
...
Time3
PIK3CA
TP53
Gene List
VisualizeInterpret
Extractgenelist
froman'omics
experiment
Performpathway
enrichment
analysis
clusterMaker
Word
Cloud
Annotate
Auto
Cytoscape EnrichmentMap
REGULATION OF INTERFERON-GAMMA-MEDIATED
SIGNALING PATHWAY%GOBP%GO:0060334
Pathway P-value Q-value
POSITIVE REGULATION OF RHO PROTEIN
SIGNAL TRANSDUCTION%GOBP%GO:0035025
POSITIVE REGULATION OF RAS PROTEIN SIGNAL
TRANSDUCTION%GOBP%GO:0046579
0.00304414
0.0
0.004622496
0.0056384853
0.0038799183
0.008516296
positive regulation of small
GTPase mediated signal
transduction
positive regulation of Ras protein
signal transduction
regulation of
interferon-gamma-mediated
signaling pathwaypositive regulation of Rho protein
signal transduction
regulation of response to
interferon-gamma
gtpase signal transduction
regulation interferon gamma
Outputs
• Published on bioRxiv Jan 2017,
provisionally accepted by
Nature Protocols
• General concepts and
resources
• Step-by-step instructions for
gene-set analysis of gene
expression data

3.2.1. Network visualization and gene
network types

Network Representation and Visualization
Merico D, Gfeller D, Bader GD. How to visually interpret biological data
using networks. Nature Biotechnology 2009. PMID: 19816451

Network Visualization: Automatic Layout
Before layout After layout
• Yeast proteins annotated to GO cellular component "chromosome”
• Colored based on sub-component (nucleosome, kinetochore, replication fork)
• The layout (force directed) meaningfully arranges nodes (genes/proteins) and edges (interactions)
Merico D, Gfeller D, Bader GD. How to visually interpret biological data
using networks. Nature Biotechnology 2009. PMID: 19816451

Network Visualization: Cytoscape
• Rich GUI to map visual markup to data
• Imports tabular data (computational biologist friendly)
• Default functions for visualization, search, layout
• Lots of “apps” implementing specific algorithms and functionalities (e.g. Enrichment Map)

Gene Network Types
• Protein-protein (physical) interactions
• Biochemical reaction adjacency (mainly shared output /input in metabolic pathways)
• Regulator-target interactions (e.g. TF/miRNA-target)
• Co-expression
• Genetic interactions (e.g. synthetic lethality in double KO)
• Semantic similarity (e.g. similarity of Gene Ontology annotations)
• Publication co-citation
• Aggregate functional similarity (based on multi-omics)

Networks vs Pathways
Pathways
• Hand-curated à more accurate
• Represent biochemical
reactions, or molecular events,
or regulatory relations among
proteins, protein complexes,
metabolites and other bio-
entities
Networks
• Derived from experimental high
throughput methods or text
mining à more noisy
• Represent simple relations
among genes (e.g. binds, is
similar to, is co-expressed with,
regulates)
• Cover a larger number of genes

Gene Network Resources
iRefWeb/iRefIndex wodaklab.org/iRefWeb
• Resource integrating different databases
• Mainly protein interactions
• Useful to explore specific interactions, or bulk download
GeneMANIA www.genemania.org
• Multiple networks available (including iRefIndex protein interactions)
• Useful to construct, visualize, and evaluate networks from “seed” genes (network propagation
algorithm)
STRING string-db.org
• Integrated network, based on algorithm for function prediction
• Protein interactions, pathway interactions, co-expression, etc..

Network Analysis Overview
Most common analysis types:
• Subnetwork construction from seed genes à GeneMANIA
• Network clustering / module finding à ClusterMaker2 (MCODE, MCL, …)
• Enriched sub-network identification à Reactome FI, HyperModules, HotNet
Other types of analysis:
• Network inference from expression data à ARACNE
• Pathway/network activity inference à SPIA, PARADIGM
• Overall analysis of network topology
• Motif identification, motif content analysis

Gene-set vs Network Analysis
• Gene-set pros
• Better coverage of genes and known biological processes / components
• Simple algorithmics, a few well-established analysis options
• Gene-set cons
• Simple and flat structure, do not represent mechanistic details
• Pre-constructed based on “general biology”
• Network pros
• More structured, more insight on mechanistic details
• Can reveal new gene-gene associations
• Network cons
• More limited coverage of genes and known biological processes / components
• More complex algorithmic, more analysis options

Component 1:
Weighted network combination
• Gene Ontology prediction
• Input gene connectivity
Component 2:
Label propagation algorithm
INPUT =
Query gene list
(e.g. DLG1, SHANK)
OUTPUT =
Query genes +
interaction neighbour
network
GeneMANIA

Reactome FIViz
Components:
• Functional Interaction (FI) Network
• Use experimental protein interactions in human, protein interactions in model organism,
gene expression, to predict “functional interactions”
• Positive set: pathway-based interactions from Reactome
• Subnetwork construction algorithm
• Classical: only direct connections, or additionally linkers
• HotNet: heat kernel
• Clustering Algorithm
• Edge-betweenness used to find “local interaction communities” in the sub-network

Cell cycle checkpoints, DNA damage response
Adhesion molecules
NOTCH pathway
Glioblastoma Subnetwork
a. DNA copy number detection for 206
glioblastomas
b. detection of somatic mutations in
601 selected genes for 91 matched
tumor-normal pairs
Growth factor signaling
Wu G, Feng X, Stein L. A human functional protein
interaction network and its application to cancer
data analysis. Genome Biol 2010. PMID: 20482850

GeneMANIA or Reactome FIViz?
• GeneMANIA: start from experimental genes, construct a larger network of
related genes (without further using the same experimental data); typically works
well when initial genes form one cluster, when genes are too diverse tends to
connect them using less specific hubs
• Reactome FIViz: start from experimental genes, inter-connect them using
functional interactions and potentially including some linker genes, cluster them
into modules

Nature Methods 2015
For More Reading…

Thanks for your attention!
Baked by Ruth Isserlin

CDAC 2018 Merico making sense of cancer somatic snv

Recomendados

Recomendados

Más contenido relacionado

Similar a CDAC 2018 Merico making sense of cancer somatic snv

Similar a CDAC 2018 Merico making sense of cancer somatic snv (20)

Más de Marco Antoniotti

Más de Marco Antoniotti (14)

Último

Último (20)

CDAC 2018 Merico making sense of cancer somatic snv