Presentation by Karen Miga at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on centromere assemblies.
4. PROGRESS
UPDATE:
CENTROMERE
SEQUENCE
ASSEMBLIES
1.
GRCh38
Reference
Models
for
Human
Centromere
Arrays
2.
Efforts
to
Generate
True,
Linear
Assemblies
of
Centromeric
regions:
Chromosome
Y
3.
Future
PerspecSve
5. p-arm q-arm... ...
Multi-megabase sized arrays of satellite DNA
...ATCCGATTACG ATCCGATTACGATCCGATTACG... ...ATCCGATTACG ATCCGATTACGATCCGATTACG...
CHALLENGE
OF
ASSEMBLING
LONG
TRACTS
OF
(NEAR
IDENTICAL)
TANDEM
REPEATS
6. p-arm q-arm
... ...ALPHA SATELLITE
~171bp
Tandem Repeat
Wide Range of Percent ID: ~60-100%
1 2 3 4
HUMAN
CENTROMERES:
ALPHA
SATELLITTE
7. Narrow Range of Percent ID: 94% - 100%
“Higher Order Repeat”
Multi-monomeric Repeat Unit
p-arm q-arm
... ...
1 2 3 4 1 2 3 4 1 2 3 4
HIGHER
ORDER
REPEATS
9. p-arm q-arm
... ...
... ...-A- -T-
GENOME
MODEL
OF
SEQUENCE
ORGANIZATION
IN
CENTROMERE-‐ASSIGNED
GAPS
10. p-arm q-arm
... ...
... ...-A- -T-
GENOME
MODEL
OF
SEQUENCE
ORGANIZATION
IN
CENTROMERE-‐ASSIGNED
GAPS
LINE
SINE
OTHER
NON-ALPHA SATELLITE
11. p-arm q-arm
... ...
... ...-A- -T-
GENOME
MODEL
OF
SEQUENCE
ORGANIZATION
IN
CENTROMERE-‐ASSIGNED
GAPS
LINE
SINE
OTHER
NON-ALPHA SATELLITE
Unmapped
(Yet Assembled) Scaffolds
13. A B C D E F
Characterize HORs in Human Genome1
1. GRCh38
Alpha
Satellite
Reference
Models
1
14. >200 ENCODE datasets
A B C D E F
Characterize HORs in Human Genome1
1. GRCh38
Alpha
Satellite
Reference
Models
>200 ENCODE datasets
y Step Example For Single P-read, I
α-Centauri
(centromeric automated repeat identification)
5’…
…3’
10x
10
B
C
D
EF
A
10
10
10
10
10
5’ 3’
1
http://github.com/volkansevim/alpha-
CENTAURI.
16. Experimental Evidence:
FISH Hybridization/Mapping and Screening Somatic
Cell Hybrid Panel
B
C
D
EF
A
D7Z1
6-mer
Waye
et
al
(1987)
98%
GenBank:
M16101
Flow Sorted Chromosome
Alignment/Enrichment
Sequence enrichment analysis of isolated
human chromosomes
Long Range Paired Read Support
“Anchor” to mapped to the assembled p-arm and/
or q-arm
Chromosome specific assignment
23. CENTROMERE
SEQUENCE
ASSEMBLY
1. GRCh38
Alpha
Satellite
Reference
Models
2. Linear
Assembly
of
a
Human
Centromere
Miga, KH., et al. Genome research 24.4 (2014): 697-707.l 20
24. LINEAR
ASSEMBLY
OF
A
HUMAN
CENTROMERE
ON
THE
Y
CHROMOSOME
Small, haploid satellite array
with well-characterized 5.8 kb repeat
p-arm q-arm
30. UCSC LONGBOARD 1D PROTOCOL
In total, we have generated 3500+ reads
greater than 150 kb
NANOPORE
SEQUENCING:
LONGBOARD
(1D)
31. MULTIPLE ALIGNMENT STRATEGY TO IMPROV
QUALITY BY CONSENSUS
High Qualit
Consensus Req
Modest Cove
UCSC LONGBOARD 1D PROTOCOL
MULTIPLE
ALIGNMENT
STRATEGY
TO
IMPROVE
QUALITY
BY
CONSENSUS
32. RP11 718M18
221.4 kb
Vector
Insert
634 Predicted
Nucleotide Variants
2 Tandem Structural
Rearrangements
38 CENY RPTS (>99% Identity
to published consensus)
Homopolymers
[A]n
Homopolymers
[T]n
33. Identify informative, single
copy sites in the array useful
for overlap BAC-based
assembly
Y SINGLE COPY VARIANTS USING ILLUMINA DATA
RP11 718M18
221.4 kb
VALIDATE
HIGH-‐CONFIDENT
SINGLE
COPY
VARIANTS
WITH
ILLUMINA
RP11 718M18
221.4 kb
36. Future
PerspecSve
1.
Linear
assemblies
of
human
centromeric
regions
improve
in
step
with
sequencing
technology
(i.e.
read
length
and
quality)
2.
One
genome
is
not
enough:
Highly
variable
3.
Linear
CEN
assemblies
present
a
mapping
challenge
to
most
genomic
applicaSons
37. True Linear Maps of Human CEN Regions
Y CEN
True Linear
Arrangement
Informatics/Analysis
Data Structure
38. Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
39. Key Advantages of Satellite DNA Graphs
Improves Unambiguous Short Read Mapping
REPEAT REPEAT REPEAT
?
5’ 3’REPEAT
Benedict Paten Adam Novak
Centromere Graphs
Demonstrate unambiguous mapping
the majority ( > 98%) of
1000 genome alpha satellite reads
1. Eliminates sequence redundancy
40. Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are
retained as defined “paths” in the graph:
41. Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are
retained as defined “paths” in the graph
3. Graph data structure and sequence analysis tools
will be consistent with the rest of the human genome
The major histocompatibility complex (Kiran Garimella & Gil McVean)
42. Creating (and mapping to) a
Universal Reference Genome
Benedict Paten, Adam Novak, David
Haussler, UC Santa Cruz
Mark Akeson
Miten Jain
Hugh Olsen
Benedict Paten
Dave Deamer
Robin AbuShumays
Andrew Smith
Ian Fiddes
Art Rand
Logan Mulroney
Jordan Eizenga
Rojin Safavi
Rachel Lawton
Andrew Bailey
Ariah Mackie
David Haussler
Benedict Paten
Jim Kent
Sofie Salama
UCSC Nanopore Analysis Group
Miten Jain Hugh Olsen Mark Akeson
Dan Turner
David Stoddart
Oxford Nanopore Technologies
Huntington F. Willard
David Page
Product Version
Device MinION MK1
Flow cell FLO-MIN106
Kits Rapid Sequencing Kit
Data
analysis
Albacore 1.0.1
Metrichor 1D
Acknowledgements