Variation graphs and population assisted genome inference copy

Human Genome Variation Graphs
Benedict Paten - UC Santa Cruz Genomics Institute

benedict@soe.ucsc.edu

https://cgl.genomics.ucsc.edu/

Twitter: @BenedictPaten

Triumph of the reference human genome
• The publication of the human reference genome unleashed
the ﬁeld of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...

• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772

• However, the primary ref genome represents only a single
instance among billions of unique germline human genomes...
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;

The problem with the reference
• Avg. 4-5 m point variations /
individual
• 80 m point variants w/>= 0.1%
freq.
• Avg. > 10 megabases (MB) in copy-
number variants (CNVs) / individual
• 350-400 MB in CNVs w/ >=
0.1% freq.
• Avg. > 6 MB in large indels /
individual
• > 100 MB in large indels w/>=
0.1% freq.

• Avg. 4-5 m point variations /
individual
• 80 m point variants w/>= 0.1%
freq.
• Avg. > 10 megabases (MB) in copy-
number variants (CNVs) / individual
• 350-400 MB in CNVs w/ >=
0.1% freq.
• Avg. > 6 MB in large indels /
individual
• > 100 MB in large indels w/>=
0.1% freq.
ANRV285-GG07-17 ARI 3 August 2006 8:58
Structural Variation of the
Human Genome
Andrew J. Sharp, Ze Cheng, and Evan E. Eichler
Department of Genome Sciences, University of Washington, Howard Hughes
Medical Institute, Seattle, Washington 98195; email: eee@gs.washington.edu
edfromwww.annualreviews.org
.Forpersonaluseonly.
Characterization of Missing Human Genome Sequences and
Copy-number Polymorphic Insertions
Jeffrey M. Kidd1, Nick Sampas2, Francesca Antonacci1, Tina Graves3, Robert Fulton3,
Hillary S. Hayden1, Can Alkan1, Maika Malig1, Mario Ventura4, Giuliana Giannuzzi4, Joelle
Kallicki3, Paige Anderson2, Anya Tsalenko2, N. Alice Yamada2, Peter Tsang2, Rajinder
Kaul1, Richard K. Wilson3, Laurakay Bruhn2, and Evan E. Eichler1,5,6
1Department of Genome Sciences, University of Washington School of Medicine, Seattle,
Washington 98195, USA
2Agilent Laboratories, Santa Clara, California 95051, USA
3Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri
63108, USA
4Department of Genetics and Microbiology, University of Bari, Bari 70126, Italy
5Howard Hughes Medical Institute, Seattle, Washington 98195, USA
Abstract
NIH Public Access
Author Manuscript
Nat Methods. Author manuscript; available in PMC 2010 November 1.
Published in final edited form as:
Nat Methods. 2010 May ; 7(5): 365–371.
NIH-PAAuthorManuscriptNIH-PAAuthor

• These differences create a failure of
representation, for example:
• Some functional (transcribed) genes
are either present in disabled form or
absent from the current reference (e.g.
some HLA genes)
• Reference Allele Bias: Mapping
algorithms are intrinsically biased
towards ignoring evidence of variants
• The current reference is largely derived
from one individual, making it less
suitable for the study of genomes that
derive from other subpopulations
• In summary: the current reference genome
has become an impediment to personal
genomics

RESEARCH Open Access
The GENCODE pseudogene resource
Baikang Pei1†
, Cristina Sisu1,2†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
, Xinmeng Jasmine Mu1
,
Rachel Harte5
, Suganthi Balasubramanian1,2
, Andrea Tanzer6
, Mark Diekhans5
, Alexandre Reymond4
,
Tim J Hubbard3
, Jennifer Harrow3
and Mark B Gerstein1,2,7*
Abstract
Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent
evidence suggests that many of them might have some form of biological activity, and the possibility of
functionality has increased interest in their accurate annotation and integration with functional genomics data.
Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide
pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico
pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased
fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations
with the extensive ENCODE functional genomics information. In particular, we determine the expression level,
transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based
on their distribution, we develop simple statistical models for each type of activity, which we validate with large-
scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from
primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may
represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which
may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each
pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of
potentially functional pseudogenes.
Background
Pseudogenes are defined as defunct genomic loci with
sequence similarity to functional genes but lacking cod-
ing potential due to the presence of disruptive muta-
tions such as frame shifts and premature stop codons
[1–4]. The functional paralogs of pseudogenes are often
referred to as parent genes. Based on the mechanism of
their creation, pseudogenes can be categorized into
three large groups: (1) processed pseudogenes, created
by retrotransposition of mRNA from functional protein-
coding loci back into the genome; (2) duplicated (also
referred to as unprocessed) pseudogenes, derived from
duplication of functional genes; and (3) unitary
pseudogenes, which arise through in situ mutations in
previously functional protein-coding genes [1,4–6].
Different types of pseudogenes exhibit different geno-
mic features. Duplicated pseudogenes have intron-exon-
like genomic structures and may still maintain the
upstream regulatory sequences of their parents. In con-
trast, processed pseudogenes, having lost their introns,
contain only exonic sequence and do not retain the
upstream regulatory regions. Processed pseudogenes
may preserve evidence of their insertion in the form of
polyadenine features at their 3’ end. These features of
processed pseudogenes are shared with other genomic
elements commonly known as retrogenes [7]. However,
retrogenes differ from pseudogenes in that they have
intact coding frames and encode functional proteins [8].
The composition of different types of pseudogenes var-
ies among organisms [9]. In the human genome, pro-
cessed pseudogenes are the most abundant type due to
* Correspondence: mark.gerstein@yale.edu
† Contributed equally
1
Program in Computational Biology and Bioinformatics, Yale University, Bass
432, 266 Whitney Avenue, New Haven, CT 06520, USA
Full list of author information is available at the end of the article
Pei et al. Genome Biology 2012, 13:R51
http://genomebiology.com/2012/13/9/R51
© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
some HLA genes)
genomics

Baikang Pei1†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
,
Rachel Harte5
, Andrea Tanzer6
, Mark Diekhans5
,
Tim J Hubbard3
, Jennifer Harrow3
Abstract
Background
1
BIOINFORMATICS ORIGINAL PAPER
Vol. 25 no. 24 2009, pages 320
doi:10.1093/bioinformatics
Genome analysis
Effect of read-mapping biases on detecting allele-specific
expression from RNA-sequencing data
Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1,
Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗
1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu
Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA
Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009
Advance Access publication October 6, 2009
Associate Editor: Limsoon Wong
ABSTRACT
Motivation: Next-generation sequencing has become an important
tool for genome-wide quantification of DNA and RNA. However,
a major technical hurdle lies in the need to map short sequence
reads back to their correct locations in a reference genome. Here,
we investigate the impact of SNP variation on the reliability of
read-mapping in the context of detecting allele-specific expression
(ASE).
Results: We generated 16 million 35 bp reads from mRNA of each
of two HapMap Yoruba individuals. When we mapped these reads
to the human genome we found that, at heterozygous SNPs, there
was a significant bias toward higher mapping rates of the allele
in the reference sequence, compared with the alternative allele.
Masking known SNP positions in the genome sequence eliminated
the reference bias but, surprisingly, did not lead to more reliable
results overall. We find that even after masking, ∼ 5–10% of SNPs
still have an inherent bias toward more effective mapping of one
allele. Filtering out inherently biased SNPs removes 40% of the top
signals of ASE. The remaining SNPs showing ASE are enriched in
mechanisms can be uncovered through the identification o
specific expression (ASE). For example, studies investigati
have uncovered both genes harboring cis-regulatory variat
imprinted genes that are epigenetically silenced in one copy
the other (Babak et al., 2008; Serre et al., 2008; Wang et al.
Recently developed sequencing technologies such as the I
Genome Analyzer, Roche 454 GS FLX sequencer and A
Biosystems SOLiD sequencer have the potential to greatly i
our ability to detect ASE and to improve our understan
cis-regulatory variation and epigenetic imprinting. Howe
detection of ASE depends critically on accurate mapping
reads in the presence of sequence variation. Here, using
Seq data from two HapMap individuals, along with sim
experiments, we characterize the effects of individual SNP
quantification of expression levels. Our results are also r
to other applications of next-generation sequencing, such
discovery, expression QTL mapping and detection of allele-
differences in transcription factor binding.
some HLA genes)
genomics

Vol. 25 no. 24 2009, pages 320
Genome analysis
ABSTRACT
(ASE).
A Bacterial Artificial Chromosome Library
for Sequencing the Complete Human Genome
Kazutoyo Osoegawa,1
Aaron G. Mammoser, Chenyan Wu,2
Eirik Frengen,3
Changjiang Zeng, Joseph J. Catanese,1,2
and Pieter J. de Jong1,2,4
Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA
A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17
kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor
The DNA was obtained from a single anonymous volunteer, whose identity was protected through
double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar
segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector
respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an
artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar
designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon
fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an
functional studies.
Resource
Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from
Baikang Pei1†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
,
Rachel Harte5
, Andrea Tanzer6
, Mark Diekhans5
,
Tim J Hubbard3
, Jennifer Harrow3
Abstract
Background
1
some HLA genes)
• The current primary reference is largely
derived from one individual, making it
less suitable for the study of genomes
that derive from other subpopulations
genomics

some HLA genes)
• The current primary reference is largely
derived from one individual, making it
less suitable for the study of genomes
that derive from other subpopulations
• In summary: the current primary reference
genome is an imperfect lens for personal
genomics
Vol. 25 no. 24 2009, pages 320
Genome analysis
ABSTRACT
(ASE).
A Bacterial Artificial Chromosome Library
for Sequencing the Complete Human Genome
Kazutoyo Osoegawa,1
Aaron G. Mammoser, Chenyan Wu,2
Eirik Frengen,3
Changjiang Zeng, Joseph J. Catanese,1,2
and Pieter J. de Jong1,2,4
Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA
A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17
kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor
The DNA was obtained from a single anonymous volunteer, whose identity was protected through
double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar
segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector
respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an
artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar
designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon
fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an
functional studies.
Resource
Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from
Baikang Pei1†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
,
Rachel Harte5
, Andrea Tanzer6
, Mark Diekhans5
,
Tim J Hubbard3
, Jennifer Harrow3
Abstract
Background
1

Alternate haplotypes
GRCh38 is a graph!

Human Genome Variation Graph Project
• Goals:
• Develop next generation human genetic reference that
includes known variation from all human ethnic
populations
• Provide tools to map, call, phase and represent genomes
Figure courtesy Kiran Garimella & Gil McVean

Existing Variation is Fragmented
Variants associated with phenotype
Genome- and locus-speciﬁc variation databases
Sequencing projects
Human reference genome

A Rosetta Stone for
human genomics

Merge diverse genomes into one graph
The major histocompatibility complex− Kiran Garimella & Gil McVean

Zooming in, you see local structure

At base level we assign unique position identiﬁers

Variation Graphs – The Essentials
GTCCCAA
ACGTGG
ACTACCA
TTACTAC
Set of sequences (nodes)
Joins (edges) connect sides of sequences.

Variation Graphs – The Essentials
GTCCCAAACGTGG TTACTAC
Joins can connect either side of a sequence (bidirected edges)
Walks encode DNA strings, with side of entry determining strand

Essential operations on variation graphs
• To switch to
variation graphs a
complete
ecosystem must be
redeveloped
• “rebooting
genomics” - Erik
Garrison
“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.”
Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)
variation
graph
another
variation
graph

variation
graph
another
variation
graph
Essential operations on variation graphs
• To switch to
variation graphs a
complete
ecosystem must be
redeveloped
“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.”
Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)
https://github.com/vgteam/vg

Now lots of good genome graph development …

Genome Graph Vignettes
• Read mapping
• Haplotypes vs. graphs
• Visualization
• Alleles and sites
• Variant calling

Variation graph mapping GRCh38 alts in B-3106 from
human MHC

Simulation Study - Human
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
1010
10
0
0
0
0
0
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.95
0.96
0.97
0.98
0.99
1.00
1e−06 1e−05 1e−04 1e−03 1e−02
FPR
TPR
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
number
●
●
●
●
250000
500000
750000
1000000
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
10
10
10
0
0
0
0
0
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.94
0.96
0.98
1e−06 1e−05 1e−04 1e−03 1e−02
FPR
TPR number
●
●
●
2500000
5000000
7500000
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
• 10 M reads from a
genome with 1%
error
• Subset of reads with
>=1 match to non-
primary ref match

Simulation Study - Human
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
1010
10
0
0
0
0
0
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.95
0.96
0.97
0.98
0.99
1.00
1e−06 1e−05 1e−04 1e−03 1e−02
FPR
TPR
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
number
●
●
●
●
250000
500000
750000
1000000
• 10 M reads from a
genome with 1%
error
• Subset of reads with
>=1 match to non-
primary ref match

Human - Indel Mapping Bias Alleviated
curve
0
0
●
●
●
●
●
●
2
number
●
●
●
2500000
5000000
7500000
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
(b) allele fraction vs variant size

Mapping improvements differ by population
1000 Genomes Super Population
MHC
%Diff.inperfectmap.
primaryvs.1KG

1: 82 bp
2: A
3: G
4: 38 bp
5: C
6: T
7: 24 bp
1: 82 bp
2: A
3: G 4': 38 bp
5: C
6: T
7: 24 bp
4: 38 bp
Embedding Haplotypes
• Genome graphs do not encode linkage
• To restrict linkage, natural solution is to duplicate paths:
• But duplication creates mapping ambiguity

1: 82 bp
2: A
3: G
4: 38 bp
5: C
6: T
7: 24 bp
1': 82 bp
2: A
3: G 4': 38 bp
5: C
6: T
7: 24 bp4: 38 bp1: 82 bp
7': 24 bp
• Instead maintain projection from haplotypes to graph:
• The question then becomes how to encode this projection?

• The Graph Positional Burrows Wheeler Transform
(gPBWT)
From “Novak et al, A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications (PBWT), WABI 2016” 
3
counting of the number of threads in T that contain a given new thread as a
subthread. Figure 2 and Table 1 give a worked example.
1
2
3
2
1
3
1
1
2
2
B0
· · ·
· · ·
· · ·
· · ·
· · ·
· · ·
Fig. 1. An illustration of the B0[] array for a single side numbered 0. Threads visiting
this side may enter their next nodes on sides 1, 2, or 3. The B0[] array records, for each
visit of a thread to side 0, the side on which it enters its next node. This determines
through which of the available edges it should leave the current node. Because threads
tend to be similar to each other, they are likely to run in “ribbons” of multiple threads
.CC-BY 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/051409doi:bioRxiv preprint first posted online May. 2, 2016;
gPBWTk[]
• Reversible, compressible, enables efﬁcient indexed queries

gPBWT Performance
• Experiment:
• chr22
• 50,818,468 bp
• 5004 Haplotypes
• Result:
• 356 MB gPBWT + vg graph
• 0.011 bits per base -
200x compression
• ~336 GB for whole
genome w/80 million
point variants @ 100,000
diploid genomes

• Tube Maps
Wolfgang Beyer

Prototype: Wolfgang Beyer https://vgteam.github.io/
sequenceTubeMap/

Haplotype Probabilities
• Li & Stephens: Efﬁciently compute P(h|H), where h is
haplotype and H is population
nd Stephens” on sequence graphs
Stephens: sequences h are generated by walks x across the space of all haplotyp
H
x
h

• Graph Li & Stephens: Efﬁciently compute P(x|H), where x
is haplotype walk in a genome graph
nd Stephens: sequences h are generated by walks x across the space of all hap
model: sequences h are generated by walks x through G which follow segmen
otypes in H
h
x c/w h
g1
, g2
, g3
ε H

• Applied to vg mapped reads:
Single
recombinants, 9%
Double
recombinants, 1%
Non
recombinants,
90%

What’s a site and an allele in a genome graph?
What’s a site and an allele in a variation graph?
Bubble: Superbubble:
• Use subgraph decomposition to ﬁnd single source/sink
subgraphs, set of paths are the alleles
A T
C
A
T C A T
C
A
T C A T

A haplotype phasing pipeline
Read
mapping
Variant
calling
Haplotype
phasing
Known population
information
Population Assisted Variant Calling
h
Haplotype
likelihood
Read
likelihood
genome posterior
probability
Haplotype
likelihood
Read
likelihood
A haplotype phasing pipeline
Read
mapping
Variant
calling
Haplotype
phasing
Known population
information

Genome Variation Graphs Summary
• A shared reference graph will provide a single canonical naming scheme
for human variants: either it is already a (named) path in the graph, or it is a
new canonically named augmentation
• A better prior: Clear beneﬁts for simplifying and improving read
mapping and variant calling - could ultimately lower cost of genome
inference
• Additional haplotype data can be embedded (gPBWT)
• The natural reference is a population cohort - we should build a public
cohort for hundreds of thousands of individuals - let’s change the
culture of de-identiﬁed sharing
• True population assisted genome inference is coming
• Still many open problems: repeatome, annotations, RNA

Thanks!
UCSC
Adam Novak
Glenn Hickey
Sean Blum
Yohei Rosen
Jordan Eizenga
Wolfgang Beyer
Karen Hayden
David Haussler
Team VG:

Erik Garrison
Eric Dawson
Mike Lin
Jouni Siren
(and many more)

GA4GH ref-var group:

Andres Kahles
Ben Murray
Goran Rakocevic
Alex Dilthey
Sarah Guthrie
Jerome Kelleher
Heng Li
Stephen Keenan
Richard Durbin
Gil McVean
Opportunities: https://cgl.genomics.ucsc.edu/ benedict@soe.ucsc.edu

Variation graphs and population assisted genome inference copy

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Variation graphs and population assisted genome inference copy

Similar a Variation graphs and population assisted genome inference copy (20)

Más de Genome Reference Consortium

Más de Genome Reference Consortium (20)

Último

Último (20)

Variation graphs and population assisted genome inference copy