Inference and informatics in a 'sequenced' world

Informatics and inference in a
sequenced world
Dr. Joe Parker
Early Career Research Fellow (Phylogenomics)
Royal Botanic Gardens, Kew
@lonelyjoeparker:

Joe Parker - background 2
VL 4 length
Average
VL 1 length
≤
3
4
≤
2
7
>
2
7
>
3
4
Neut -
Neut +

Incredible times for bioscience 3
Images – Wikimedia commons CC BY-SA
(clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)

Step back: molecular evolution 4
“Horizontal gene transfer occurs x more frequently in these lineages, because of
this biology”
“Convergent evolution is rare in most genes, in most organisms, but y times
greater in these gene families …because of this biology”
“New chomosomes are created & destroyed at z, q, rates in this reproductive
strategy …because of this biology”

Snowdonia, HelloWorld & ‘tent-seq’ 6
A. thaliana Arabidopsis lyrata
Congeneric species;
Reference genomes available
Field-sequenced (MinION) &
Lab-sequenced (Illumina™)
Orthogonal BLAST:
4 sample*sequencer combinations
Compare TRUE & FALSE rates for
varying ID statistic cutoffs

Tasty pics 7
Conditions
100% humidity; 6-13ºC
Essential kit
800w generator
3x laptops
Centrifuge
Waterbath
Polystyrene boxes (lots)
Kettle(…!)
Yield
>400Mbp data in three days;
A. thaliana ~2.01x coverage

Field- vs. lab-sequenced sample ID 8
Match individual reads to each
reference with BLAST
Compare match lengths in
TRUE and FALSE cases
‘Length bias’ ID stat:
lengthTRUE - lengthFALSE
Compare TRUE & FALSE rates as
length bias cutoff varies
MiSeq (lab)
MinION (field)

Bitty data (1) partial queries 9
Subsample MinION output
Repeat ID pipeline, record
mean ID stat sbias
Replicates: N = 30
Simulate from 100 – 104 reads
(≈instant → hours)

Bitty data (2) partial references 10
Take reference genome at
high contiguity
Fragment randomly to
target (low) contiguity
Repeat read identification
using fragmented DB
Simulate N50 ≈1,000bp
to N50 ≈ 10Mbp

Keeping it simple: Kew Science Festival 11
Six species: whole genome-
skim samples with MinION in
preparation
Build BLAST DBs from skimmed
data
Select ‘unknown’ (blinded)
sample, extract DNA and
resequence in real-time
Compare to partial DBs in six-
way BLAST competition
Live ID ?

de novo genome assembly 12
Data MiSeq only MiSeq + MinION
Assembler Abyss hybridSPAdes
Illumina reads, 300bp paired-end 8,033,488 8,033,488
Illumina data (yield) 2,418 Mbp 2,418 Mbp
MinION reads, R7.3 + R9 kits,
N50 ~ 4,410bp
- 96,845
MinION data (yield) - 240 Mbp
Approx. coverage 19.49x 19.49x + 2.01x
Assembly key statistics:
# contigs 24,999 10,644
Longest contig 90 Kbp 414 Kbp
N50 contiguity 7,853 bp 48,730 bp
Fraction of reference genome (%) 82 88
Errors, per 100 kbp: #N’s 1.7 5.4
# mismatches 518 588
# indels 120 130
Largest alignment 76,935 bp 264,039 bp
CEGMA gene completeness estimate:
# genes 219 of 248 245 of 248
% genes 88% 99%

Wait – genes? 13
Entire chloroplast
genome (~150kbp)
Plastid
coding loci
Individual field-
sequenced MinION
reads

Real-time phylogenomics 14
Filtered
reads
Gene
models
TAIR10
CDS code
Annotation SNAP 1:1 reciprocal BLAST
Multiple sequence
alignments
MUSCLE
Trimal
Gene trees → Consensus tree
*BEAST
RAxML, TreeAnnotator
Cumulative counts:
Unique genes
All genes
(‘Lab’ being
transported!)

Emerging health threats & globalisation 15
Acute oak decline:
A syndrome-type oak disease
• Unknown cause, no treatment
• ca. 200 million oaks in GB
…amenity & timber value: ~£500/tree
• Emerged ca. 2004, spreading rapidly
• Significant morbidity and mortality
Defra ‘Futureproofing Plant Health’ initiative
• Test field-based methods
• Balanced survey of microbial community composition
(healthy & affected individuals)
• Overcome ascertainment bias
• Pilot training of non-experts.
• Draw conclusions relevant to rapid-response plant
health monitoring in the UK.
© 2016 Katy Reed / Forest Research

Recap 16
From lab-based…
… to ‘app store’ genomics

Problems with phylogeny… and comparative genomics 17
Suh (2016) Zool. Scripta.
doi:10.1111/zsc.12213
Zapata et al. (2016) PNAS
113:E4052-E4060
©2016 National Academy of Sciences

Key:
Extant node
Inferred node
Synteny edge (physical connection
Phylogeny edge (evolutionary connection)
Identity edge (organismal connection)
Three-colour graphs: phylogeny, synteny & identity 18
a b c d
x y
z
e
a
a

Three-colour graphs: phylogeny, synteny & identity 19
a1 b1 a2 b2 a3 b3 b’3
a4 b4
a5 b5
Duplication
a1 b1 a2 b2
a3 b3
a4 b4
x4 y4
x3 y3
x1 y1x2
Tetraploid
hybrid
formed
Diploidization
Key:
Extant node
Inferred node
Synteny edge (physical connection)
Phylogeny edge (evolutionary connection)
Identity edge (organismal connection)a1 b1
b2 a2
a3 b3
c1
c3
c1
Inversion
a1 b1
a2 b2
x1
x5
x2
x3
x4
x7
x6
HGT

Final thoughts 20
bionode.js
bioboxes.org
Singularity
Portable sequencing, by anyone means
really Big Data
Informatics connecting this data through
explicit models is inference
Scalable, reproducible, sustainable research:

Thanks, funders, contacts and questions 21
Oxford Nanopore
Technologies Ltd.
Dan Turner, Richard
Ronan, Gerrard CoyneRBG Kew:
Alexander S.T. Papadopulos (@metallophyte)
Andrew Helmstetter (@ajhelmstetter)
Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix Forest, Bill Baker, Jan T.
Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike Chester, Ester Gaya, Lisa Pokorny, Laszlo
Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark Chase, Ilia Leitch
QMUL
Laura Kelly, Kalina Davies, Steve Rossiter
Oxford
Aris Katzourakis, Oli Pybus, Jayna Raghwani
Others
Forest Research: Daegan Inward, Katy Reed
Dstl: Claire Lonstale, James Taylor
Birmingham: Nick Loman, Josh Quick
U. Utah: Bryn Dentinger
Imperial: James Rosindell
This research was
conducted in the
Sackler Phylogenomics
Laboratory and was
supported by the
Calleva Foundation
Phylogenomic Research
Programme and the
Sackler Trust
@lonelyjoeparker:
joe.parker@kew.org

Inference and informatics in a 'sequenced' world

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Inference and informatics in a 'sequenced' world

Similar a Inference and informatics in a 'sequenced' world (20)

Más de Joe Parker

Más de Joe Parker (7)

Último

Último (20)

Inference and informatics in a 'sequenced' world

Notas del editor