Short lecture relating my recent work on real-time phylogenomics, implications for bioinformatics research and future directions of genomic/phylogenetic modelling to explicitly account for phylogeny, synteny and identity through coloured graphs.
University of Reading, 2nd August 2017
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Inference and informatics in a 'sequenced' world
1. Informatics and inference in a
sequenced world
Dr. Joe Parker
Early Career Research Fellow (Phylogenomics)
Royal Botanic Gardens, Kew
@lonelyjoeparker:
3. Incredible times for bioscience 3
Images – Wikimedia commons CC BY-SA
(clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)
4. Step back: molecular evolution 4
“Horizontal gene transfer occurs x more frequently in these lineages, because of
this biology”
“Convergent evolution is rare in most genes, in most organisms, but y times
greater in these gene families …because of this biology”
“New chomosomes are created & destroyed at z, q, rates in this reproductive
strategy …because of this biology”
6. Snowdonia, HelloWorld & ‘tent-seq’ 6
A. thaliana Arabidopsis lyrata
Congeneric species;
Reference genomes available
Field-sequenced (MinION) &
Lab-sequenced (Illumina™)
Orthogonal BLAST:
4 sample*sequencer combinations
Compare TRUE & FALSE rates for
varying ID statistic cutoffs
7. Tasty pics 7
Conditions
100% humidity; 6-13ºC
Essential kit
800w generator
3x laptops
Centrifuge
Waterbath
Polystyrene boxes (lots)
Kettle(…!)
Yield
>400Mbp data in three days;
A. thaliana ~2.01x coverage
8. Field- vs. lab-sequenced sample ID 8
Match individual reads to each
reference with BLAST
Compare match lengths in
TRUE and FALSE cases
‘Length bias’ ID stat:
lengthTRUE - lengthFALSE
Compare TRUE & FALSE rates as
length bias cutoff varies
MiSeq (lab)
MinION (field)
9. Bitty data (1) partial queries 9
Subsample MinION output
Repeat ID pipeline, record
mean ID stat sbias
Replicates: N = 30
Simulate from 100 – 104 reads
(≈instant → hours)
10. Bitty data (2) partial references 10
Take reference genome at
high contiguity
Fragment randomly to
target (low) contiguity
Repeat read identification
using fragmented DB
Simulate N50 ≈1,000bp
to N50 ≈ 10Mbp
11. Keeping it simple: Kew Science Festival 11
Six species: whole genome-
skim samples with MinION in
preparation
Build BLAST DBs from skimmed
data
Select ‘unknown’ (blinded)
sample, extract DNA and
resequence in real-time
Compare to partial DBs in six-
way BLAST competition
Live ID ?
12. de novo genome assembly 12
Data MiSeq only MiSeq + MinION
Assembler Abyss hybridSPAdes
Illumina reads, 300bp paired-end 8,033,488 8,033,488
Illumina data (yield) 2,418 Mbp 2,418 Mbp
MinION reads, R7.3 + R9 kits,
N50 ~ 4,410bp
- 96,845
MinION data (yield) - 240 Mbp
Approx. coverage 19.49x 19.49x + 2.01x
Assembly key statistics:
# contigs 24,999 10,644
Longest contig 90 Kbp 414 Kbp
N50 contiguity 7,853 bp 48,730 bp
Fraction of reference genome (%) 82 88
Errors, per 100 kbp: #N’s 1.7 5.4
# mismatches 518 588
# indels 120 130
Largest alignment 76,935 bp 264,039 bp
CEGMA gene completeness estimate:
# genes 219 of 248 245 of 248
% genes 88% 99%
18. Key:
Extant node
Inferred node
Synteny edge (physical connection
Phylogeny edge (evolutionary connection)
Identity edge (organismal connection)
Three-colour graphs: phylogeny, synteny & identity 18
a b c d
x y
z
e
a
a
21. Thanks, funders, contacts and questions 21
Oxford Nanopore
Technologies Ltd.
Dan Turner, Richard
Ronan, Gerrard CoyneRBG Kew:
Alexander S.T. Papadopulos (@metallophyte)
Andrew Helmstetter (@ajhelmstetter)
Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix Forest, Bill Baker, Jan T.
Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike Chester, Ester Gaya, Lisa Pokorny, Laszlo
Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark Chase, Ilia Leitch
QMUL
Laura Kelly, Kalina Davies, Steve Rossiter
Oxford
Aris Katzourakis, Oli Pybus, Jayna Raghwani
Others
Forest Research: Daegan Inward, Katy Reed
Dstl: Claire Lonstale, James Taylor
Birmingham: Nick Loman, Josh Quick
U. Utah: Bryn Dentinger
Imperial: James Rosindell
This research was
conducted in the
Sackler Phylogenomics
Laboratory and was
supported by the
Calleva Foundation
Phylogenomic Research
Programme and the
Sackler Trust
@lonelyjoeparker:
joe.parker@kew.org
Notas del editor
Welcome, thanks, menu
Formal introduction and thanks;
Lay out the menu / journey
I’ll mainly be talking about work in the last 2.5 yrs since taking up my ECRF at Kew
Wide range of taxa, techniques and questions.
Enough to set my scene without taking ages, confusing/losing audience, or giving the impression I’m just a tools-bot.
Start of… Incredible times
Traditional to start bioinformatics talks with a slide about Moore’s law, sequencing costs, and the data deluge
Actually this is a fantastic age to be living in, ever bigger analyses – and I’ll talk a lot about “real-time” phylogenomics
But why? What are we attempting to discover?
We need enough data to turn obervations, into empirical comparisons, into models and laws
We know a lot about evolutionary mechanisms
And a lot about (a handful of genomes)
What we know tells us “it’s complicated”
Most genes don’t have simple orthologues etc etc etc, hotizonatl etc
But we don’t, really, have an empirical understanding of how they fit together, e.g.:
- ”horizontal gene transfer occurs x more frequently in these lineages, because of this biology”
- adaptive molecularconvergence is rare in most genes, in most organisms, but y times greater in these gene families because of this biology
- new chomosomes are created (by duplication, endogenisation, polyploidy) and destroyed (by diploidization) at z rates in this reproductive strategy because of biology
Portable sequencing: also long reads and real-time
Direct, explicit, orthogonal test – and can it work?
Picture of experimental design
Outline of the study
In terms of bioinformatics questions
Funding: a first pot and timeline…
Data in terrible conditions but anyone can do it
Social media reach The Atlantic, Economist
We compare match lengths, and minon allows long matches
EXPLAIN AXES: precision improves rapidly
EXPLAIN AXES: a partial REFERENCE would work, too
MORE FUNDING. SO simple a kid could do it? Yes
The challenge I set myself: OK, it’s a simple experiment. Can I buid a trest simple ehough a child can understand it?
SOCIAL MEDIA
Funding: NANOPORE
Data from one time and place can and should be useful elsewhere
lash a bit of proper genomics
Single reads match whole genes – meat & drink
EXPLAIN AXES postdoc-years PAPER ACCEPTED
FUNDED tailor made for health research/application
need to mention it somewhere because of:
strategic links
Building the ‘momentum narrative’
Other related stuff; VIPs etc
Plant health and emerging threats
A connected world means new diseases can spread globally, fast.
Lay out the problem, e.g. opportunities – look! Health! Ascertainment bias! Field-portable! etc
Funding: yet another pot, this one also bigger.
Software etc to improve UI (ahem)
HPCs to apps: Exponential data, linear understanding.
Pause – to recap
This is important because it’s where we tie it together and show my contribution:
Portable, mass sequencing is really here
Massive potential for de novo genomics; phylogenomics
But while we’re accumulating information at an exponential rate, we’re integrating it linearly, in essence
… where are we going?
Nature is cruel: more data only muddies the water
Bifurcating phylogenies are decreasingly useful and complicated to get
‘Comparative’ genomics actually uses relatively few datapoints (e.g. Encode…)
In part because most phylogenetic methods require variations on homology assumption
Here’s a common framework for all these studies
How to infer – sounds like a nightmare
Many of the edges in this network are really there already
Shifting paradigms, making linking easier
Explicitly model phylogeny, synteny and identity
Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
Any nodes connecting to an identity edge are considered completely connected
Maximum # edges ~n (2n-1)/2
Digraphs ~n!!
Possible ancestors from one locus on n taxa essentially inverse func of when they coalesce (can have m generations of n ancestors until an event where n(m)<n(t)
EXAMPLES
Gene duplication e.g. paralogue in animal
Tetraploid formed then secondary diploidization, e.g. plant
Inversion in a genome
Unlinked loci (e.g. bacterial plasmids) and HGT.
How to infer – sounds like a nightmare
Many of the edges in this network are really there already
Shifting paradigms, making linking easier
Explicitly model phylogeny, synteny and identity
Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
Formally linking datasets and models is inferring the network of life
Shifts the job for bioinformatics from something it’s good at – sophisiticated analysis incemental
To sometheing computers in gerneral are great at: linking elements
In this case informatics doesn’t enable research , it is the process of inference
It’s relatively easy to write a new standalone app to do x, or analyse some big dataset
Reproducibility and scaling-up science mean we must work harder on the links
Informatics as inference.
The lonely astronomers.