Sequencing run grief counseling: counting kmers at MG-RAST

Sequencing
run
grief
counseling:

coun0ng
kmers
at
MG-‐RAST

Will
Trimble

metagenomic
annota0on
group

Argonne
Na0onal
Laboratory

April
29,
2014

UIC

Apology:
I
speak
biology

with
an
accent

•  I
spent
six
years
in
dark
rooms
with
lasers

•  Now
I
use
computers
to
analyze
high-‐throughput

sequence
data.

•  I
introduce
myself
as
an
applied
mathema0cian.

•  Finding
scoring
func0ons
to
use
ambiguous
data
to

answer
life’s
persistent
ques0ons.

Apology:
I
speak
biology

with
an
accent

•  I
spent
six
years
in
dark
rooms
with
lasers

•  Now
I
use
computers
to
analyze
high-‐throughput

sequence
data.

•  I
introduce
myself
as
an
applied
mathema0cian.

•  Finding
scoring
func0ons
to
use
ambiguous
data
to

answer
life’s
persistent
ques0ons.

•  Shoveling
data
from
the
data
producing
machine
into

the
data-‐consuming
furnace.

•  Sequences
are
diﬀerent

•  Sequencing
is
like
photography

•  Sequencing
is
beau0ful

thumbnailpolish

•  How
diverse
are
my
shotgun
sequences?

nonpareil-k!
kmerspectrumanalyzer!
!
!
Outline

•  Sequences
are
diﬀerent

(math)

•  Sequencing
is
like
photography

(pictures)

•  Sequencing
is
beau0ful

thumbnailpolish (micrographs)

•  How
diverse
are
my
shotgun
sequences?

nonpareil-k (graphs)

kmerspectrumanalyzer!

(graphs)

Outline

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita0vely
diﬀerent
from
all
other
data

types.

Low-‐throughput

categorical
data

Categories
are
sound

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita0vely
diﬀerent
from
all
other
data

types.

Instrument
readings,

spectra,
micrographs

Not
categorical.

Low-‐throughput

categorical
data

Categories
are
sound

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita0vely
diﬀerent
from
all
other
data

types.

@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument
readings,

spectra,
micrographs

Not
categorical.

Low-‐throughput

categorical
data

Categories
are
sound

High
throughput

sequence
data

Categories
uncertain

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita0vely
diﬀerent
from
all
other
data

types.

@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument
readings,

spectra,
micrographs

Not
categorical.

Low-‐throughput

categorical
data

Categories
are
sound

High
throughput

sequence
data

Categories
uncertain

100-‐102

102-‐107
1012-‐1080

Experiment

design
Sequencing
run
Sequence
data

Assembly,

Annota0on

SEED
M5NR

489 !Sensory box/GGDEF family!
470 !hyphothetical protein!
241 !Co-Zn-Cd resistance CzcA!
202 !Transposase!
200 !homocysteine methyltransferase (EC 2.1.1.13)!
175 !cyclase/phosphodiesterase !
164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!
156 !Methyl-accepting chemotaxis protein!
149 !ABC transporter, ATP-binding protein!
147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!
133 !Ferrous iron transport protein B!
So
we
reduce
sequence
data
to

categorical
data.

Forward-‐backward
problem

Experiment

design
Sequencing
run
Sequence
data

Assembly,

Annota0on

SEED
M5NR

489 !Sensory box/GGDEF family!
470 !hyphothetical protein!
241 !Co-Zn-Cd resistance CzcA!
202 !Transposase!
200 !homocysteine methyltransferase (EC 2.1.1.13)!
175 !cyclase/phosphodiesterase !
164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!
156 !Methyl-accepting chemotaxis protein!
149 !ABC transporter, ATP-binding protein!
147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!
133 !Ferrous iron transport protein B!
1012

103-‐105
100-‐101

So
we
reduce
sequence
data
to

categorical
data.

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita0vely
diﬀerent
from
all
other
data

types.

•  Each
sequence
is
an
informa0on-‐rich
(possibly

corrupted)
quota9on
from
the
catalog
of

gene0c
polymers.

What
is
this
sequence
?

>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?

“be regarded as unproved until it has been
checked against more exact results”
Searching

What
is
this
sequence
?

>mystery_sequence
Who
wrote
this
line
?

Searching

Same
answer
for
both
puzzles:

you
go
to
this
website…

What
is
this
sequence
?

>mystery_sequence
Who
wrote
this
line
?

Searching

How
long
do
reads
need
to
be

to
recognize
them?

How
long
do
phrases
need
to
be
to

recognize
them?

How
long
do
reads
need
to
be?

Informa9on

(Shannon,
1949,
BSTJ):

is
a
quan0ta0ve
summary
of
the
uncertainty
of
a

probability
distribu9on
–
a
model
of
the
data

Profound
applicability
in
machine
learning
and

probabilis0c
modeling

H =
X
i
pi log2
✓
1
pi
◆

How
long
do
phrases
need
to
be?

Exercise:

Pick
a
book
from
your
bookshelf.

Pick
an
arbitrary
page
and
arbitrary
line.

for n in 1..10 !
type the first n words into google
books, quoted.!
break if google identifies your book.!

•  Informa0on
content
of
English
words:

Hword

ca.
12
bits
per
word.

•  Size
of
google
books?

Big
libraries
have
few
107
books,

each
one
has
105
indexed
words

….so
a
database
size
of
1012
words.

log(database
size)

=

1012

=
239.9

=
40
bits

•  So
we
expect
on
average
40
/
12
=
3.3
=
4
words

to
be
enough
to
ﬁnd
a
phrase
in
google’s
index.

Try
it.

How
long
do
phrases
need
to
be?

How
long
do
phrases
need
to
be?

Exercise:

Pick
a
book
from
your
bookshelf.

Pick
an
arbitrary
page
and
arbitrary
line.

for n in 1..10 !
type the first n words into google books, quoted.!

How
long
do
phrases
need
to
be?

Exercise:

Pick
a
book
from
your
bookshelf.

Pick
an
arbitrary
page
and
arbitrary
line.

for n in 1..10 !
type the first n words into google books, quoted.!
Usually
nails
your
source
in

four
words.

•  Maximum
informa0on
content
of

base
pairs

Hread

2

bits

per
length-‐

sequence

•  Most
long
kmers
are
dis0nct:

genome
of
size
G
(ca
1010
bp)

log(G)

=

1010

=

233.2

=

34
bits

•  So
we
expect
that
when
2

>
34
bits,
we
should
be

able
to
place
any
sequence.

•  That
means
we
need
at
least

17
base
pairs

(seems
small)
to
deliver
mail
anywhere
in
the

genome.

How
long
do
reads
need
to
be?

`
`
`
`

•  Maximum
informa0on
content
of

base
pairs

Hread

2

bits

per
length-‐

sequence

•  Most
long
kmers
are
dis0nct:

genome
of
size
G
(ca
1010
bp)

log(G)

=

1010

=

233.2

=

34
bits

•  So
we
expect
that
when
2

>
34
bits,
we
should
be

able
to
place
any
sequence.

•  That
means
we
need
at
least

17
base
pairs

(seems
small)
to
deliver
mail
anywhere
in
the

genome.

How
long
do
reads
need
to
be?

`
`
`
`
Short
sequences
end
up
being
very

dis0nc0ve,
even
ﬁngerprint-‐like.

`
Check:
Human
reference
genome

The
data
deluge

•  There
were
some
technological

breakthroughs
in
the
mid-‐2000s
that

led
to
inexpensive
collec0on
of
10s

of
Gbytes
of
sequence
data
at
once.

•  The
data
has
outgrown
some

favorite
algorithms
from
the
1990s

(BLAST)

http://www.mcs.anl.gov/~trimble/flowcell/!
thumbnailpolish!

Rarefac0on
of
a
photograph

A
camera
records
the

number
of
photons
that

land
on
each
of
millions

of
pixels.

A
sequencer
records
the

number
of
sequences

that
land
in
each

possible
sequence.

I
actually
think
of
a
sequencer
like
a

mul0channel
gene0c
spectrometer.

The
gene0c
spectrometer

With
my
1012-‐channel

gene0c
spectrometer,
I

am
trying
to
ar0culate

the
diversity
of
what
the

sequencer
sees.

Species
diversity

ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAA 459!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!

The
gene0c
spectrometer

With
my
1012-‐channel

gene0c
spectrometer,
I

am
trying
to
ar0culate

the
diversity
of
what
the

sequencer
sees.

Species
diversity

Gene
diversity

ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!

The
gene0c
spectrometer

With
my
1012-‐channel

gene0c
spectrometer,
I

am
trying
to
ar0culate

the
diversity
of
what
the

sequencer
sees.

Species
diversity

Gene
diversity

Sequence
diversity

ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!

Rarefac0on
of
a
photograph

Sampling
only
a
few

sequences
is
like

exposing
the
camera

for
too
short
a
0me.

Not
enough
photons

to
make
out
the

picture.

Rarefac0on
of
a
photograph

some
parts
seem
to
be
dark.

Rarefac0on
of
a
photograph

Rarefac0on
of
a
photograph

This
looks
like
a
portrait

Rarefac0on
of
a
photograph

Start
to
see
the
mood

Rarefac0on
of
a
photograph

A
0ny
bit
of
graininess
leg

Rarefac0on
of
a
photograph

“shot
noise”
in
electrical

engineering

Rarefac0on
of
a
photograph

A
studio
portrait
of
Jane
Goodall

A
scien0ﬁc
image

This
is
a
famous

scien0ﬁc
image.

Anybody
recognize
it?

A
scien0ﬁc
image

Does
this
help?

A
scien0ﬁc
image

There
are
small
patches
of
brightness

A
scien0ﬁc
image

Were
you
expec0ng
x-‐ray
diﬀrac0on?

A
scien0ﬁc
image

At
longer
exposures

A
scien0ﬁc
image

more
objects,
smaller
and
dimmer,
appear.

A
scien0ﬁc
image

This
is
a
part
of
the
Hubble
Deep
Field
image

Image
/
sequencing
analogy

Analogy
to
sequencing:

•  Most
of
ﬁeld
is
black

•  Bright
objects
have

halos

•  Contains
camera

ar0facts

•  We
can’t
know
what

we
didn’t
see

without
longer

exposures.

Opportunity
cost
of
deep
sequencing

This
took
two
weeks

to
acquire
on
a
one-‐
of-‐a-‐kind
telescope.

Consider
the

opportunity
cost
of

studying
a
single

sample
for
two

weeks.

STSI
did
only
four
long
exposures
like
this
in
23
years.

Image
/
sequencing
analogy

Analogy
to
sequencing:

•  Most
of
field
is
black

•  Bright
objects
have

halos

•  Contains
camera

ar0facts

•  We
can’t
know
what

we
didn’t
see

without
longer

exposures.

Sampling
effort
interacts
with
sequence
diversity
to

produce
a
“horizon”

Inferences
are
supported
on
the
bright
parts
first,
on

the
dim
parts
only
at
higher
depth.

Not
all
the
sequences,
abundant
or
rare,

are
real.

Dim
targets
come
at
great
cost
in
sample
number.

How
much
novelty
is
in
my
dataset?

How
many
sequences
do
you
need
to
see
before
you
start
seeing

the
same
ones
over
and
over
again?

How
much
novelty
is
in
my
dataset?

How
many
sequences
do
you
need
to
see
before
you
start
seeing

the
same
ones
over
and
over
again?

Ini0ally,
everything
is
novel,
but
there
will
come
a
point
at
which

less
than
half
of
your
new
observa0ons
are
already
in
the
catalog.

How
much
novelty
is
in
my
dataset?

Luis Rodriguez-Rojas and Kostas Konstantinidis developed a

subset-against-all alignment approach to address the question

“how quickly do we encounter novelty in shotgun datasets?”

Nonpareil

I found a way to answer almost the same question 300x faster.

Nonpareil-k

Nonuniqefraction(✏; {r}, {n}) =
X
i
ni · ri
P
j nj · rj
(1 Poisscdf (✏ · ri, 1))
(1 Poisscdf (✏ · ri, 0))
How
much
novelty
is
in
my
dataset?

Nonpareil-k

Nonpareil: model of sequence coverage

Georgia Tech

Nonpareil: model of sequence coverage

Georgia Tech

Nonpareil-k: kmer rarefaction

Argonne + Georgia Tech

summary of sequence diversity

Nonpareil-‐k:
stra0fy
datasets
by

coverage
distribu0on

most
of
dataset

likely
contained
in

assembly

assembly
is
likely

to
miss
or

alenuate
the

large
unique

frac0on
of
dataset.

Looking
for
abundance
palerns

Looking
for
abundance
palerns

Let’s
look
at
the

greyscale
histogram

Looking
for
abundance
palerns

Shadows

Background
Jacket
Face
and

hands

We
can
even
tease
out

a
few
palerns
in
the
histogram

Kmers
can
tell
you
genome
size
and

coverage
depth

Redundancy
is
good

•  OMG!

Check
out
these
three
sequences!

I’ve

found
the
fourth,
ﬁgh,
and
sixth
domains
of
life.

•  OMG!

I
see
this
sequence
10
million
0mes.

•  OMG!

There
are
more
than
10
billion
dis0nct

31mers
in
my
dataset.

I
only
have
128
Gbases
of

memory.

•  Error
correc0on
/
clustering
/
assembly
works
on

subsets
of
the
data
with
high
sequence
depth.

Redundancy
is
good

•  OMG!

Check
out
these
three
sequences!

I’ve

found
the
fourth,
ﬁgh,
and
sixth
domains
of
life.

•  OMG!

I
see
this
sequence
10
million
0mes.

•  OMG!

There
are
more
than
10
billion
dis0nct

31mers
in
my
dataset.

I
only
have
128
Gbases
of

memory.

•  Error
correc0on
/
clustering
/
assembly
works
on

subsets
of
the
data
with
high
sequence
depth.

Abundance-‐based
inferences

are
beler
in
the
high-‐
abundance
part
of
the
data.

But
I
want
to
sequence
everything!

Ok,
we
can
count
kmers
in
everything
too..

kmerspectrumanalyzer
summarizes
distribu0on,
es0mates

genome
size,
coverage
depth,
…
but
what
it’s
really
good
at

Kmers
show
problems
in
datasets

•  Amok
PCR
–
seemingly
random
sequences

•  Amok
MDA
–
10
Gbases
of
sequence,
one
gene

•  PCR
duplicates:
en0re
sequencing
run
was
50x

exact-‐
and
near-‐exact
duplicate
reads

•  Unusually
high
error
rate:
indicated
by
low
frac0on

of
“solid”
kmers
(for
isolate
genomes)

•  Contaminated
samples:
95%
E.
coli
5%
E.
faecalis

•  Many
datasets
have
as
much
as
5-‐45%
of
the

sequence
yield
in
adapters.

Generali0es
from
the

kmer
coun0ng
mines

•  FEW
DATASETS
have
well-‐separated

abundance
peaks
(of
the
sort
metavelvet
was

engineered
to
ﬁnd)

•  Diverse
datasets
have
a
featureless,

geometric
rela9onship
between
kmer
rank

and
kmer
abundance
(but
I’m
not
about
to

write
a
paper
ﬁpng
kmers
to
the
Yule,

Mandelbrot,
Levy,
or
Pareto
distribu0ons)

HMP
/
quan0le
norm
/
euclidean
/
colored
by
alpha

MG-‐RAST
API

R-‐package
matR

Hey
kid,
you
want
some
unlabeled
data?

Kevin
Keegan,
Argonne
Na0onal
Laboratory

HMP
/
quan0le
norm
/
euclidean
/
colored
by
alpha

MG-‐RAST
API

R-‐package
matR

Hey
kid,
you
want
some
unlabeled
data?

Kevin
Keegan,
Argonne
Na0onal
Laboratory

I’m
not
sure
how
to
do

science
with
an
unlabeld
pile

of
datasets.

Figure'2a!
Hey
kid,
you
want
some
prely
ordina0ons?

Kevin
Keegan,
Argonne
Na0onal
Laboratory

Observa0on:
Most
scien0sts
seem
to

be
self-‐taught
in
compu0ng.

Observa0on:

Most
scien0sts
waste
a

lot
of
0me
using
computers

ineﬃciently.

Rachel
and
I
volunteer
with

We
teach
scien0sts

how
to
get
more
done

Woods
Hole

Tugs

U.
Chicago

U.
Chicago

UIC

Metagenomic
annota0on
group

Folker
Meyer

Elizabeth
Glass

Narayan
Desai

Kevin
Keegan

Adina
Howe

Wolfgang
Gerlach

Wei
Tang

Travis
Harrison

Jared
Bishof

Dan
Braithwaite

Hunter
Malhews

Sarah
Owens

Formerly
of
Yale:

Howard
Ochman

David
Williams

Georgia
Tech:

Kostas
Konstan0nidis

Luis
Rodriguez-‐Rojas

Sequencing run grief counseling: counting kmers at MG-RAST

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Sequencing run grief counseling: counting kmers at MG-RAST

Similar a Sequencing run grief counseling: counting kmers at MG-RAST (20)

Último

Último (20)

Sequencing run grief counseling: counting kmers at MG-RAST