Talk by Will Trimble of Argonne National Laboratory on April 29, 2014, at UIC's department of Ecology & Evolution on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.
Sequencing run grief counseling: counting kmers at MG-RAST
1. Sequencing
run
grief
counseling:
coun0ng
kmers
at
MG-‐RAST
Will
Trimble
metagenomic
annota0on
group
Argonne
Na0onal
Laboratory
April
29,
2014
UIC
2. Apology:
I
speak
biology
with
an
accent
• I
spent
six
years
in
dark
rooms
with
lasers
• Now
I
use
computers
to
analyze
high-‐throughput
sequence
data.
• I
introduce
myself
as
an
applied
mathema0cian.
• Finding
scoring
func0ons
to
use
ambiguous
data
to
answer
life’s
persistent
ques0ons.
3. Apology:
I
speak
biology
with
an
accent
• I
spent
six
years
in
dark
rooms
with
lasers
• Now
I
use
computers
to
analyze
high-‐throughput
sequence
data.
• I
introduce
myself
as
an
applied
mathema0cian.
• Finding
scoring
func0ons
to
use
ambiguous
data
to
answer
life’s
persistent
ques0ons.
• Shoveling
data
from
the
data
producing
machine
into
the
data-‐consuming
furnace.
4. • Sequences
are
different
• Sequencing
is
like
photography
• Sequencing
is
beau0ful
thumbnailpolish
• How
diverse
are
my
shotgun
sequences?
nonpareil-k!
kmerspectrumanalyzer!
!
!
Outline
5. • Sequences
are
different
(math)
• Sequencing
is
like
photography
(pictures)
• Sequencing
is
beau0ful
thumbnailpolish (micrographs)
• How
diverse
are
my
shotgun
sequences?
nonpareil-k (graphs)
kmerspectrumanalyzer!
(graphs)
Outline
6. Sequences
are
different
• Sequencing
produces
sequences.
Sequences
are
qualita0vely
different
from
all
other
data
types.
Low-‐throughput
categorical
data
Categories
are
sound
7. Sequences
are
different
• Sequencing
produces
sequences.
Sequences
are
qualita0vely
different
from
all
other
data
types.
Instrument
readings,
spectra,
micrographs
Not
categorical.
Low-‐throughput
categorical
data
Categories
are
sound
8. Sequences
are
different
• Sequencing
produces
sequences.
Sequences
are
qualita0vely
different
from
all
other
data
types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument
readings,
spectra,
micrographs
Not
categorical.
Low-‐throughput
categorical
data
Categories
are
sound
High
throughput
sequence
data
Categories
uncertain
9. Sequences
are
different
• Sequencing
produces
sequences.
Sequences
are
qualita0vely
different
from
all
other
data
types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument
readings,
spectra,
micrographs
Not
categorical.
Low-‐throughput
categorical
data
Categories
are
sound
High
throughput
sequence
data
Categories
uncertain
100-‐102
102-‐107
1012-‐1080
10. Experiment
design
Sequencing
run
Sequence
data
Assembly,
Annota0on
SEED
M5NR
489 !Sensory box/GGDEF family!
470 !hyphothetical protein!
241 !Co-Zn-Cd resistance CzcA!
202 !Transposase!
200 !homocysteine methyltransferase (EC 2.1.1.13)!
175 !cyclase/phosphodiesterase !
164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!
156 !Methyl-accepting chemotaxis protein!
149 !ABC transporter, ATP-binding protein!
147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!
133 !Ferrous iron transport protein B!
So
we
reduce
sequence
data
to
categorical
data.
11. Forward-‐backward
problem
Experiment
design
Sequencing
run
Sequence
data
Assembly,
Annota0on
SEED
M5NR
489 !Sensory box/GGDEF family!
470 !hyphothetical protein!
241 !Co-Zn-Cd resistance CzcA!
202 !Transposase!
200 !homocysteine methyltransferase (EC 2.1.1.13)!
175 !cyclase/phosphodiesterase !
164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!
156 !Methyl-accepting chemotaxis protein!
149 !ABC transporter, ATP-binding protein!
147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!
133 !Ferrous iron transport protein B!
1012
103-‐105
100-‐101
So
we
reduce
sequence
data
to
categorical
data.
12. Sequences
are
different
• Sequencing
produces
sequences.
Sequences
are
qualita0vely
different
from
all
other
data
types.
• Each
sequence
is
an
informa0on-‐rich
(possibly
corrupted)
quota9on
from
the
catalog
of
gene0c
polymers.
13. What
is
this
sequence
?
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?
“be regarded as unproved until it has been
checked against more exact results”
Searching
14. What
is
this
sequence
?
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?
“be regarded as unproved until it has been
checked against more exact results”
Searching
Same
answer
for
both
puzzles:
you
go
to
this
website…
15. What
is
this
sequence
?
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?
“be regarded as unproved until it has been
checked against more exact results”
Searching
How
long
do
reads
need
to
be
to
recognize
them?
How
long
do
phrases
need
to
be
to
recognize
them?
16. How
long
do
reads
need
to
be?
Informa9on
(Shannon,
1949,
BSTJ):
is
a
quan0ta0ve
summary
of
the
uncertainty
of
a
probability
distribu9on
–
a
model
of
the
data
Profound
applicability
in
machine
learning
and
probabilis0c
modeling
H =
X
i
pi log2
✓
1
pi
◆
17. How
long
do
phrases
need
to
be?
Exercise:
Pick
a
book
from
your
bookshelf.
Pick
an
arbitrary
page
and
arbitrary
line.
for n in 1..10 !
type the first n words into google
books, quoted.!
break if google identifies your book.!
18. • Informa0on
content
of
English
words:
Hword
ca.
12
bits
per
word.
• Size
of
google
books?
Big
libraries
have
few
107
books,
each
one
has
105
indexed
words
….so
a
database
size
of
1012
words.
log(database
size)
=
1012
=
239.9
=
40
bits
• So
we
expect
on
average
40
/
12
=
3.3
=
4
words
to
be
enough
to
find
a
phrase
in
google’s
index.
Try
it.
How
long
do
phrases
need
to
be?
19. How
long
do
phrases
need
to
be?
Exercise:
Pick
a
book
from
your
bookshelf.
Pick
an
arbitrary
page
and
arbitrary
line.
for n in 1..10 !
type the first n words into google books, quoted.!
break if google identifies your book.!
20. How
long
do
phrases
need
to
be?
Exercise:
Pick
a
book
from
your
bookshelf.
Pick
an
arbitrary
page
and
arbitrary
line.
for n in 1..10 !
type the first n words into google books, quoted.!
break if google identifies your book.!
Usually
nails
your
source
in
four
words.
21. • Maximum
informa0on
content
of
base
pairs
Hread
2
bits
per
length-‐
sequence
• Most
long
kmers
are
dis0nct:
genome
of
size
G
(ca
1010
bp)
log(G)
=
1010
=
233.2
=
34
bits
• So
we
expect
that
when
2
>
34
bits,
we
should
be
able
to
place
any
sequence.
• That
means
we
need
at
least
17
base
pairs
(seems
small)
to
deliver
mail
anywhere
in
the
genome.
How
long
do
reads
need
to
be?
`
`
`
`
22. • Maximum
informa0on
content
of
base
pairs
Hread
2
bits
per
length-‐
sequence
• Most
long
kmers
are
dis0nct:
genome
of
size
G
(ca
1010
bp)
log(G)
=
1010
=
233.2
=
34
bits
• So
we
expect
that
when
2
>
34
bits,
we
should
be
able
to
place
any
sequence.
• That
means
we
need
at
least
17
base
pairs
(seems
small)
to
deliver
mail
anywhere
in
the
genome.
How
long
do
reads
need
to
be?
`
`
`
`
Short
sequences
end
up
being
very
dis0nc0ve,
even
fingerprint-‐like.
24. The
data
deluge
• There
were
some
technological
breakthroughs
in
the
mid-‐2000s
that
led
to
inexpensive
collec0on
of
10s
of
Gbytes
of
sequence
data
at
once.
• The
data
has
outgrown
some
favorite
algorithms
from
the
1990s
(BLAST)
26. Rarefac0on
of
a
photograph
A
camera
records
the
number
of
photons
that
land
on
each
of
millions
of
pixels.
A
sequencer
records
the
number
of
sequences
that
land
in
each
possible
sequence.
I
actually
think
of
a
sequencer
like
a
mul0channel
gene0c
spectrometer.
27. Rarefac0on
of
a
photograph
A
camera
records
the
number
of
photons
that
land
on
each
of
millions
of
pixels.
A
sequencer
records
the
number
of
sequences
that
land
in
each
possible
sequence.
I
actually
think
of
a
sequencer
like
a
mul0channel
gene0c
spectrometer.
28. The
gene0c
spectrometer
With
my
1012-‐channel
gene0c
spectrometer,
I
am
trying
to
ar0culate
the
diversity
of
what
the
sequencer
sees.
Species
diversity
ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAA 459!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!
29. The
gene0c
spectrometer
With
my
1012-‐channel
gene0c
spectrometer,
I
am
trying
to
ar0culate
the
diversity
of
what
the
sequencer
sees.
Species
diversity
Gene
diversity
ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAA 459!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!
30. The
gene0c
spectrometer
With
my
1012-‐channel
gene0c
spectrometer,
I
am
trying
to
ar0culate
the
diversity
of
what
the
sequencer
sees.
Species
diversity
Gene
diversity
Sequence
diversity
ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAA 459!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!
31. Rarefac0on
of
a
photograph
Sampling
only
a
few
sequences
is
like
exposing
the
camera
for
too
short
a
0me.
Not
enough
photons
to
make
out
the
picture.
48. Image
/
sequencing
analogy
Analogy
to
sequencing:
• Most
of
field
is
black
• Bright
objects
have
halos
• Contains
camera
ar0facts
• We
can’t
know
what
we
didn’t
see
without
longer
exposures.
49. Opportunity
cost
of
deep
sequencing
This
took
two
weeks
to
acquire
on
a
one-‐
of-‐a-‐kind
telescope.
Consider
the
opportunity
cost
of
studying
a
single
sample
for
two
weeks.
STSI
did
only
four
long
exposures
like
this
in
23
years.
50. Image
/
sequencing
analogy
Analogy
to
sequencing:
• Most
of
field
is
black
• Bright
objects
have
halos
• Contains
camera
ar0facts
• We
can’t
know
what
we
didn’t
see
without
longer
exposures.
Sampling
effort
interacts
with
sequence
diversity
to
produce
a
“horizon”
Inferences
are
supported
on
the
bright
parts
first,
on
the
dim
parts
only
at
higher
depth.
Not
all
the
sequences,
abundant
or
rare,
are
real.
Dim
targets
come
at
great
cost
in
sample
number.
51. How
much
novelty
is
in
my
dataset?
How
many
sequences
do
you
need
to
see
before
you
start
seeing
the
same
ones
over
and
over
again?
52. How
much
novelty
is
in
my
dataset?
How
many
sequences
do
you
need
to
see
before
you
start
seeing
the
same
ones
over
and
over
again?
Ini0ally,
everything
is
novel,
but
there
will
come
a
point
at
which
less
than
half
of
your
new
observa0ons
are
already
in
the
catalog.
53. How
much
novelty
is
in
my
dataset?
Luis Rodriguez-Rojas and Kostas Konstantinidis developed a
subset-against-all alignment approach to address the question
“how quickly do we encounter novelty in shotgun datasets?”
Nonpareil
I found a way to answer almost the same question 300x faster.
Nonpareil-k
54. Nonuniqefraction(✏; {r}, {n}) =
X
i
ni · ri
P
j nj · rj
(1 Poisscdf (✏ · ri, 1))
(1 Poisscdf (✏ · ri, 0))
How
much
novelty
is
in
my
dataset?
Nonpareil-k
56. Nonpareil: model of sequence coverage
Georgia Tech
Nonpareil-k: kmer rarefaction
Argonne + Georgia Tech
summary of sequence diversity
57. Nonpareil-‐k:
stra0fy
datasets
by
coverage
distribu0on
most
of
dataset
likely
contained
in
assembly
assembly
is
likely
to
miss
or
alenuate
the
large
unique
frac0on
of
dataset.
64. Redundancy
is
good
• OMG!
Check
out
these
three
sequences!
I’ve
found
the
fourth,
figh,
and
sixth
domains
of
life.
• OMG!
I
see
this
sequence
10
million
0mes.
• OMG!
There
are
more
than
10
billion
dis0nct
31mers
in
my
dataset.
I
only
have
128
Gbases
of
memory.
• Error
correc0on
/
clustering
/
assembly
works
on
subsets
of
the
data
with
high
sequence
depth.
65. Redundancy
is
good
• OMG!
Check
out
these
three
sequences!
I’ve
found
the
fourth,
figh,
and
sixth
domains
of
life.
• OMG!
I
see
this
sequence
10
million
0mes.
• OMG!
There
are
more
than
10
billion
dis0nct
31mers
in
my
dataset.
I
only
have
128
Gbases
of
memory.
• Error
correc0on
/
clustering
/
assembly
works
on
subsets
of
the
data
with
high
sequence
depth.
Abundance-‐based
inferences
are
beler
in
the
high-‐
abundance
part
of
the
data.
66. But
I
want
to
sequence
everything!
Ok,
we
can
count
kmers
in
everything
too..
kmerspectrumanalyzer
summarizes
distribu0on,
es0mates
genome
size,
coverage
depth,
…
but
what
it’s
really
good
at
67. Kmers
show
problems
in
datasets
• Amok
PCR
–
seemingly
random
sequences
• Amok
MDA
–
10
Gbases
of
sequence,
one
gene
• PCR
duplicates:
en0re
sequencing
run
was
50x
exact-‐
and
near-‐exact
duplicate
reads
• Unusually
high
error
rate:
indicated
by
low
frac0on
of
“solid”
kmers
(for
isolate
genomes)
• Contaminated
samples:
95%
E.
coli
5%
E.
faecalis
• Many
datasets
have
as
much
as
5-‐45%
of
the
sequence
yield
in
adapters.
68. Generali0es
from
the
kmer
coun0ng
mines
• FEW
DATASETS
have
well-‐separated
abundance
peaks
(of
the
sort
metavelvet
was
engineered
to
find)
• Diverse
datasets
have
a
featureless,
geometric
rela9onship
between
kmer
rank
and
kmer
abundance
(but
I’m
not
about
to
write
a
paper
fipng
kmers
to
the
Yule,
Mandelbrot,
Levy,
or
Pareto
distribu0ons)
69. HMP
/
quan0le
norm
/
euclidean
/
colored
by
alpha
MG-‐RAST
API
R-‐package
matR
Hey
kid,
you
want
some
unlabeled
data?
Kevin
Keegan,
Argonne
Na0onal
Laboratory
70. HMP
/
quan0le
norm
/
euclidean
/
colored
by
alpha
MG-‐RAST
API
R-‐package
matR
Hey
kid,
you
want
some
unlabeled
data?
Kevin
Keegan,
Argonne
Na0onal
Laboratory
I’m
not
sure
how
to
do
science
with
an
unlabeld
pile
of
datasets.
71. Figure'2a!
Hey
kid,
you
want
some
prely
ordina0ons?
Kevin
Keegan,
Argonne
Na0onal
Laboratory
72. Observa0on:
Most
scien0sts
seem
to
be
self-‐taught
in
compu0ng.
Observa0on:
Most
scien0sts
waste
a
lot
of
0me
using
computers
inefficiently.
Rachel
and
I
volunteer
with
73. We
teach
scien0sts
how
to
get
more
done
Woods
Hole
Tugs
U.
Chicago
U.
Chicago
UIC
74.
75. Metagenomic
annota0on
group
Folker
Meyer
Elizabeth
Glass
Narayan
Desai
Kevin
Keegan
Adina
Howe
Wolfgang
Gerlach
Wei
Tang
Travis
Harrison
Jared
Bishof
Dan
Braithwaite
Hunter
Malhews
Sarah
Owens
Formerly
of
Yale:
Howard
Ochman
David
Williams
Georgia
Tech:
Kostas
Konstan0nidis
Luis
Rodriguez-‐Rojas