SlideShare una empresa de Scribd logo
1 de 75
Descargar para leer sin conexión
Sequencing	
  run	
  grief	
  counseling:	
  
coun0ng	
  kmers	
  at	
  MG-­‐RAST	
  
Will	
  Trimble	
  
metagenomic	
  annota0on	
  group	
  
Argonne	
  Na0onal	
  Laboratory	
  
April	
  29,	
  2014	
  	
  	
  	
  UIC	
  
Apology:	
  I	
  speak	
  biology	
  	
  
with	
  an	
  accent	
  
•  I	
  spent	
  six	
  years	
  in	
  dark	
  rooms	
  with	
  lasers	
  
•  Now	
  I	
  use	
  computers	
  to	
  analyze	
  high-­‐throughput	
  
sequence	
  data.	
  
•  I	
  introduce	
  myself	
  as	
  an	
  applied	
  mathema0cian.	
  
•  Finding	
  scoring	
  func0ons	
  to	
  use	
  ambiguous	
  data	
  to	
  
answer	
  life’s	
  persistent	
  ques0ons.	
  
	
  
Apology:	
  I	
  speak	
  biology	
  	
  
with	
  an	
  accent	
  
•  I	
  spent	
  six	
  years	
  in	
  dark	
  rooms	
  with	
  lasers	
  
•  Now	
  I	
  use	
  computers	
  to	
  analyze	
  high-­‐throughput	
  
sequence	
  data.	
  
•  I	
  introduce	
  myself	
  as	
  an	
  applied	
  mathema0cian.	
  
•  Finding	
  scoring	
  func0ons	
  to	
  use	
  ambiguous	
  data	
  to	
  
answer	
  life’s	
  persistent	
  ques0ons.	
  
•  Shoveling	
  data	
  from	
  the	
  data	
  producing	
  machine	
  into	
  
the	
  data-­‐consuming	
  furnace.	
  
	
  
•  Sequences	
  are	
  different	
  
•  Sequencing	
  is	
  like	
  photography	
  
•  Sequencing	
  is	
  beau0ful	
  
thumbnailpolish	
  
•  How	
  diverse	
  are	
  my	
  shotgun	
  sequences?	
  
nonpareil-k!
kmerspectrumanalyzer!
!
!
Outline	
  
•  Sequences	
  are	
  different	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (math)	
  
•  Sequencing	
  is	
  like	
  photography	
  	
  	
  	
  (pictures)	
  	
  	
  	
  
•  Sequencing	
  is	
  beau0ful	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
thumbnailpolish (micrographs)	
  	
  	
  	
  	
  	
  	
  
•  How	
  diverse	
  are	
  my	
  shotgun	
  sequences?	
  
nonpareil-k (graphs)	
  	
  
kmerspectrumanalyzer!
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (graphs)	
  
Outline	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita0vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
	
  
Low-­‐throughput	
  	
  
categorical	
  data	
  
	
  
Categories	
  are	
  sound	
  
	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita0vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
	
  
Instrument	
  readings,	
  
spectra,	
  micrographs	
  
	
  
Not	
  categorical.	
  
Low-­‐throughput	
  	
  
categorical	
  data	
  
	
  
Categories	
  are	
  sound	
  
	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita0vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
	
  
@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument	
  readings,	
  
spectra,	
  micrographs	
  
	
  
Not	
  categorical.	
  
Low-­‐throughput	
  	
  
categorical	
  data	
  
	
  
Categories	
  are	
  sound	
  
	
  
High	
  throughput	
  
sequence	
  data	
  
	
  
Categories	
  uncertain	
  
	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita0vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
	
  
@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument	
  readings,	
  
spectra,	
  micrographs	
  
	
  
Not	
  categorical.	
  
Low-­‐throughput	
  	
  
categorical	
  data	
  
	
  
Categories	
  are	
  sound	
  
	
  
High	
  throughput	
  
sequence	
  data	
  
	
  
Categories	
  uncertain	
  
	
  
100-­‐102	
  
102-­‐107	
   1012-­‐1080	
  
Experiment	
  
design	
   Sequencing	
  run	
   Sequence	
  data	
  	
  
Assembly,	
  
Annota0on	
  
SEED	
  M5NR	
  
489 !Sensory box/GGDEF family!
470 !hyphothetical protein!
241 !Co-Zn-Cd resistance CzcA!
202 !Transposase!
200 !homocysteine methyltransferase (EC 2.1.1.13)!
175 !cyclase/phosphodiesterase !
164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!
156 !Methyl-accepting chemotaxis protein!
149 !ABC transporter, ATP-binding protein!
147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!
133 !Ferrous iron transport protein B!
So	
  we	
  reduce	
  sequence	
  data	
  to	
  
categorical	
  data.	
  
Forward-­‐backward	
  problem	
  
Experiment	
  
design	
   Sequencing	
  run	
   Sequence	
  data	
  	
  
Assembly,	
  
Annota0on	
  
SEED	
  M5NR	
  
489 !Sensory box/GGDEF family!
470 !hyphothetical protein!
241 !Co-Zn-Cd resistance CzcA!
202 !Transposase!
200 !homocysteine methyltransferase (EC 2.1.1.13)!
175 !cyclase/phosphodiesterase !
164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!
156 !Methyl-accepting chemotaxis protein!
149 !ABC transporter, ATP-binding protein!
147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!
133 !Ferrous iron transport protein B!
1012	
  
103-­‐105	
  100-­‐101	
  
So	
  we	
  reduce	
  sequence	
  data	
  to	
  
categorical	
  data.	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita0vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
•  Each	
  sequence	
  is	
  an	
  informa0on-­‐rich	
  (possibly	
  
corrupted)	
  quota9on	
  from	
  the	
  catalog	
  of	
  
gene0c	
  polymers.	
  
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who	
  wrote	
  this	
  line	
  ?	
  
“be regarded as unproved until it has been
checked against more exact results”
Searching	
  
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who	
  wrote	
  this	
  line	
  ?	
  
“be regarded as unproved until it has been
checked against more exact results”
Searching	
  
Same	
  answer	
  for	
  both	
  puzzles:	
  
you	
  go	
  to	
  this	
  website…	
  
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who	
  wrote	
  this	
  line	
  ?	
  
“be regarded as unproved until it has been
checked against more exact results”
Searching	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be	
  	
  
to	
  recognize	
  them?	
  
How	
  long	
  do	
  phrases	
  need	
  to	
  be	
  to	
  
recognize	
  them?	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be?	
  
Informa9on	
  	
  	
  (Shannon,	
  1949,	
  BSTJ):	
  	
  	
  
	
  
	
  
	
  
	
  
is	
  a	
  quan0ta0ve	
  summary	
  of	
  the	
  uncertainty	
  of	
  a	
  
probability	
  distribu9on	
  –	
  a	
  model	
  of	
  the	
  data	
  
	
  
Profound	
  applicability	
  in	
  machine	
  learning	
  and	
  	
  
probabilis0c	
  modeling	
  
	
  
H =
X
i
pi log2
✓
1
pi
◆
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Exercise:	
  	
  Pick	
  a	
  book	
  from	
  your	
  bookshelf.	
  
Pick	
  an	
  arbitrary	
  page	
  and	
  arbitrary	
  line.	
  
	
  
for n in 1..10 !
type the first n words into google
books, quoted.!
break if google identifies your book.!
•  Informa0on	
  content	
  of	
  English	
  words:	
  
	
  	
  	
  	
  	
  	
  	
  	
  Hword	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ca.	
  12	
  bits	
  per	
  word.	
  
•  Size	
  of	
  google	
  books?	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  Big	
  libraries	
  have	
  few	
  107	
  books,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  each	
  one	
  has	
  105	
  indexed	
  words	
  
	
  	
  	
  	
  	
  	
  	
  	
  ….so	
  a	
  database	
  size	
  of	
  1012	
  words.	
  
	
  	
  	
  	
  	
  log(database	
  size)	
  	
  	
  	
  	
  	
  	
  	
  =	
  	
  
	
  	
  	
  	
  	
  1012	
  	
  =	
  239.9	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  40	
  bits	
  
•  So	
  we	
  expect	
  on	
  average	
  40	
  /	
  12	
  =	
  3.3	
  =	
  4	
  words	
  
to	
  be	
  enough	
  to	
  find	
  a	
  phrase	
  in	
  google’s	
  index.	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Try	
  it.	
  	
  	
  
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Exercise:	
  	
  Pick	
  a	
  book	
  from	
  your	
  bookshelf.	
  
Pick	
  an	
  arbitrary	
  page	
  and	
  arbitrary	
  line.	
  
	
  
for n in 1..10 !
type the first n words into google books, quoted.!
break if google identifies your book.!
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Exercise:	
  	
  Pick	
  a	
  book	
  from	
  your	
  bookshelf.	
  
Pick	
  an	
  arbitrary	
  page	
  and	
  arbitrary	
  line.	
  
	
  
for n in 1..10 !
type the first n words into google books, quoted.!
break if google identifies your book.!
Usually	
  nails	
  your	
  source	
  in	
  
four	
  words.	
  
•  Maximum	
  informa0on	
  content	
  of	
  	
  	
  	
  	
  base	
  pairs	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Hread	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  bits	
  	
  per	
  length-­‐	
  	
  	
  sequence	
  
•  Most	
  long	
  kmers	
  are	
  dis0nct:	
  	
  
	
  	
  	
  	
  	
  genome	
  of	
  size	
  G	
  (ca	
  1010	
  bp)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  log(G)	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  
	
  	
  	
  	
  	
  1010	
  	
  =	
  	
  	
  	
  233.2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  	
  34	
  bits	
  
•  So	
  we	
  expect	
  that	
  when	
  2	
  	
  	
  	
  >	
  34	
  bits,	
  we	
  should	
  be	
  
able	
  to	
  place	
  any	
  sequence.	
  
•  That	
  means	
  we	
  need	
  at	
  least	
  	
  	
  17	
  base	
  pairs	
  
	
  	
  	
  	
  (seems	
  small)	
  to	
  deliver	
  mail	
  anywhere	
  in	
  the	
  
genome.	
  	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be?	
  
`
`
`
`
•  Maximum	
  informa0on	
  content	
  of	
  	
  	
  	
  	
  base	
  pairs	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Hread	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  bits	
  	
  per	
  length-­‐	
  	
  	
  sequence	
  
•  Most	
  long	
  kmers	
  are	
  dis0nct:	
  	
  
	
  	
  	
  	
  	
  genome	
  of	
  size	
  G	
  (ca	
  1010	
  bp)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  log(G)	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  
	
  	
  	
  	
  	
  1010	
  	
  =	
  	
  	
  	
  233.2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  	
  34	
  bits	
  
•  So	
  we	
  expect	
  that	
  when	
  2	
  	
  	
  	
  >	
  34	
  bits,	
  we	
  should	
  be	
  
able	
  to	
  place	
  any	
  sequence.	
  
•  That	
  means	
  we	
  need	
  at	
  least	
  	
  	
  17	
  base	
  pairs	
  
	
  	
  	
  	
  (seems	
  small)	
  to	
  deliver	
  mail	
  anywhere	
  in	
  the	
  
genome.	
  	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be?	
  
`
`
`
`
Short	
  sequences	
  end	
  up	
  being	
  very	
  
dis0nc0ve,	
  even	
  fingerprint-­‐like.	
  
`
Check:	
  Human	
  reference	
  genome	
  
The	
  data	
  deluge	
  
•  There	
  were	
  some	
  technological	
  
breakthroughs	
  in	
  the	
  mid-­‐2000s	
  that	
  
led	
  to	
  inexpensive	
  collec0on	
  of	
  10s	
  
of	
  Gbytes	
  of	
  sequence	
  data	
  at	
  once.	
  
•  The	
  data	
  has	
  outgrown	
  some	
  
favorite	
  algorithms	
  from	
  the	
  1990s	
  
(BLAST)	
  	
  
http://www.mcs.anl.gov/~trimble/flowcell/!
thumbnailpolish!
Rarefac0on	
  of	
  a	
  photograph	
  
A	
  camera	
  records	
  the	
  
number	
  of	
  photons	
  that	
  
land	
  on	
  each	
  of	
  millions	
  
of	
  pixels.	
  
	
  
A	
  sequencer	
  records	
  the	
  
number	
  of	
  sequences	
  
that	
  land	
  in	
  each	
  
possible	
  sequence.	
  
	
  
I	
  actually	
  think	
  of	
  a	
  sequencer	
  like	
  a	
  	
  
mul0channel	
  gene0c	
  spectrometer.	
  
Rarefac0on	
  of	
  a	
  photograph	
  
A	
  camera	
  records	
  the	
  
number	
  of	
  photons	
  that	
  
land	
  on	
  each	
  of	
  millions	
  
of	
  pixels.	
  
	
  
A	
  sequencer	
  records	
  the	
  
number	
  of	
  sequences	
  
that	
  land	
  in	
  each	
  
possible	
  sequence.	
  
	
  
I	
  actually	
  think	
  of	
  a	
  sequencer	
  like	
  a	
  	
  
mul0channel	
  gene0c	
  spectrometer.	
  
The	
  gene0c	
  spectrometer	
  
With	
  my	
  1012-­‐channel	
  
gene0c	
  spectrometer,	
  I	
  
am	
  trying	
  to	
  ar0culate	
  
the	
  diversity	
  of	
  what	
  the	
  
sequencer	
  sees.	
  
	
  
Species	
  diversity	
  
ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAA 459!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!
The	
  gene0c	
  spectrometer	
  
With	
  my	
  1012-­‐channel	
  
gene0c	
  spectrometer,	
  I	
  
am	
  trying	
  to	
  ar0culate	
  
the	
  diversity	
  of	
  what	
  the	
  
sequencer	
  sees.	
  
	
  
Species	
  diversity	
  
	
  
Gene	
  diversity	
  
	
  
ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAA 459!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!
The	
  gene0c	
  spectrometer	
  
With	
  my	
  1012-­‐channel	
  
gene0c	
  spectrometer,	
  I	
  
am	
  trying	
  to	
  ar0culate	
  
the	
  diversity	
  of	
  what	
  the	
  
sequencer	
  sees.	
  
	
  
Species	
  diversity	
  
	
  
Gene	
  diversity	
  
	
  
Sequence	
  diversity	
  
ATCGCGAAAAGTCCC 2!
AAAAAAAAAAAAAAA 459!
AAAAAAAAAAAAAAC 71!
AAAATAAAAAAAATA 1!
AAAAAAAAAAAAAAG 36!
ACATGAAAAACAACT 1!
AAAAAAAAAAAAAAT 23!
AAAAAAAAAAAAACA 95!
GTAGGAAAAGCCCAC 1!
AAAAAAAAAAAAACC 7!
AAAAAAAAAAAAACG 8!
AAAAAAAAAAAAACT 9!
AAAAAAAAAAAAAGA 36!
AACAAGAAAAACAAA 1!
AAAAAAAAAAAAAGC 10!
AAATAAAAAAAATAG 1!
AACAGAAAAAACACG 1!
AAAAAAAAAAAAAGG 2!
AAAAAAAAAAAAAGT 6!
Rarefac0on	
  of	
  a	
  photograph	
  
Sampling	
  only	
  a	
  few	
  
sequences	
  is	
  like	
  
exposing	
  the	
  camera	
  
for	
  too	
  short	
  a	
  0me.	
  	
  	
  
	
  
Not	
  enough	
  photons	
  
to	
  make	
  out	
  the	
  
picture.	
  
Rarefac0on	
  of	
  a	
  photograph	
  
some	
  parts	
  seem	
  to	
  be	
  dark.	
  
Rarefac0on	
  of	
  a	
  photograph	
  
Rarefac0on	
  of	
  a	
  photograph	
  
This	
  looks	
  like	
  a	
  portrait	
  
Rarefac0on	
  of	
  a	
  photograph	
  
Rarefac0on	
  of	
  a	
  photograph	
  
Start	
  to	
  see	
  the	
  mood	
  
Rarefac0on	
  of	
  a	
  photograph	
  
Rarefac0on	
  of	
  a	
  photograph	
  
A	
  0ny	
  bit	
  of	
  graininess	
  leg	
  
Rarefac0on	
  of	
  a	
  photograph	
  
“shot	
  noise”	
  in	
  electrical	
  
engineering	
  
Rarefac0on	
  of	
  a	
  photograph	
  
A	
  studio	
  portrait	
  of	
  Jane	
  Goodall	
  
A	
  scien0fic	
  image	
  
This	
  is	
  a	
  famous	
  
	
  scien0fic	
  image.	
  
Anybody	
  recognize	
  it?	
  
A	
  scien0fic	
  image	
  
Does	
  this	
  help?	
  
A	
  scien0fic	
  image	
  
There	
  are	
  small	
  patches	
  of	
  brightness	
  	
  
A	
  scien0fic	
  image	
  
Were	
  you	
  expec0ng	
  x-­‐ray	
  diffrac0on?	
  
A	
  scien0fic	
  image	
  
At	
  longer	
  exposures	
  
A	
  scien0fic	
  image	
  
more	
  objects,	
  smaller	
  and	
  dimmer,	
  appear.	
  
A	
  scien0fic	
  image	
  
This	
  is	
  a	
  part	
  of	
  the	
  Hubble	
  Deep	
  Field	
  image	
  
Image	
  /	
  sequencing	
  analogy	
  
Analogy	
  to	
  sequencing:	
  
•  Most	
  of	
  field	
  is	
  black	
  
•  Bright	
  objects	
  have	
  
halos	
  
•  Contains	
  camera	
  
ar0facts	
  
•  We	
  can’t	
  know	
  what	
  
we	
  didn’t	
  see	
  
without	
  longer	
  
exposures.	
  
Opportunity	
  cost	
  of	
  deep	
  sequencing	
  
This	
  took	
  two	
  weeks	
  
to	
  acquire	
  on	
  a	
  one-­‐
of-­‐a-­‐kind	
  telescope.	
  
	
  
Consider	
  the	
  
opportunity	
  cost	
  of	
  
studying	
  a	
  single	
  
sample	
  for	
  two	
  
weeks.	
  
STSI	
  did	
  only	
  four	
  long	
  exposures	
  like	
  this	
  in	
  23	
  years.	
  
Image	
  /	
  sequencing	
  analogy	
  
Analogy	
  to	
  sequencing:	
  
•  Most	
  of	
  field	
  is	
  black	
  
•  Bright	
  objects	
  have	
  
halos	
  
•  Contains	
  camera	
  
ar0facts	
  
•  We	
  can’t	
  know	
  what	
  
we	
  didn’t	
  see	
  
without	
  longer	
  
exposures.	
  
Sampling	
  effort	
  interacts	
  with	
  sequence	
  diversity	
  to	
  
produce	
  a	
  “horizon”	
  
	
  
Inferences	
  are	
  supported	
  on	
  the	
  bright	
  parts	
  first,	
  on	
  
the	
  dim	
  parts	
  only	
  at	
  higher	
  depth.	
  
	
  
Not	
  all	
  the	
  sequences,	
  abundant	
  or	
  rare,	
  	
  
are	
  real.	
  
	
  
Dim	
  targets	
  come	
  at	
  great	
  cost	
  in	
  sample	
  number.	
  
How	
  much	
  novelty	
  is	
  in	
  my	
  dataset?	
  
How	
  many	
  sequences	
  do	
  you	
  need	
  to	
  see	
  before	
  you	
  start	
  seeing	
  	
  
the	
  same	
  ones	
  over	
  and	
  over	
  again?	
  
How	
  much	
  novelty	
  is	
  in	
  my	
  dataset?	
  
How	
  many	
  sequences	
  do	
  you	
  need	
  to	
  see	
  before	
  you	
  start	
  seeing	
  	
  
the	
  same	
  ones	
  over	
  and	
  over	
  again?	
  
	
  
Ini0ally,	
  everything	
  is	
  novel,	
  but	
  there	
  will	
  come	
  a	
  point	
  at	
  which	
  	
  
less	
  than	
  half	
  of	
  your	
  new	
  observa0ons	
  are	
  already	
  in	
  the	
  catalog.	
  
How	
  much	
  novelty	
  is	
  in	
  my	
  dataset?	
  
Luis Rodriguez-Rojas and Kostas Konstantinidis developed a 	

subset-against-all alignment approach to address the question 	

“how quickly do we encounter novelty in shotgun datasets?” 	

Nonpareil	

	

I found a way to answer almost the same question 300x faster.	

Nonpareil-k
Nonuniqefraction(✏; {r}, {n}) =
X
i
ni · ri
P
j nj · rj
(1 Poisscdf (✏ · ri, 1))
(1 Poisscdf (✏ · ri, 0))
How	
  much	
  novelty	
  is	
  in	
  my	
  dataset?	
  
Nonpareil-k	
  
Nonpareil: model of sequence coverage	

 Georgia Tech
Nonpareil: model of sequence coverage	

 Georgia Tech 	

Nonpareil-k: kmer rarefaction 	

 Argonne + Georgia Tech	

summary of sequence diversity
Nonpareil-­‐k:	
  stra0fy	
  datasets	
  by	
  
coverage	
  distribu0on	
  
most	
  of	
  dataset	
  
likely	
  contained	
  in	
  	
  
assembly	
  
	
  
assembly	
  is	
  likely	
  
to	
  miss	
  or	
  	
  
alenuate	
  the	
  	
  
large	
  unique	
  	
  
frac0on	
  of	
  dataset.	
  
	
  
Looking	
  for	
  abundance	
  palerns	
  
Looking	
  for	
  abundance	
  palerns	
  
Let’s	
  look	
  at	
  the	
  
greyscale	
  histogram	
  
Looking	
  for	
  abundance	
  palerns	
  
Looking	
  for	
  abundance	
  palerns	
  
Shadows	
  
Background	
  Jacket	
   Face	
  and	
  	
  
hands	
  
We	
  can	
  even	
  tease	
  out	
  	
  
a	
  few	
  palerns	
  in	
  the	
  histogram	
  
Kmers	
  can	
  tell	
  you	
  genome	
  size	
  and	
  
coverage	
  depth	
  
Kmers	
  can	
  tell	
  you	
  genome	
  size	
  and	
  
coverage	
  depth	
  
Redundancy	
  is	
  good	
  
•  OMG!	
  	
  	
  Check	
  out	
  these	
  three	
  sequences!	
  	
  I’ve	
  
found	
  the	
  fourth,	
  figh,	
  and	
  sixth	
  domains	
  of	
  life.	
  
	
  	
  	
  	
  	
  
•  OMG!	
  	
  I	
  see	
  this	
  sequence	
  10	
  million	
  0mes.	
  	
  	
  
•  OMG!	
  	
  There	
  are	
  more	
  than	
  10	
  billion	
  dis0nct	
  
31mers	
  in	
  my	
  dataset.	
  	
  I	
  only	
  have	
  128	
  Gbases	
  of	
  
memory.	
  
•  Error	
  correc0on	
  /	
  clustering	
  /	
  assembly	
  works	
  on	
  
subsets	
  of	
  the	
  data	
  with	
  high	
  sequence	
  depth.	
  
Redundancy	
  is	
  good	
  
•  OMG!	
  	
  	
  Check	
  out	
  these	
  three	
  sequences!	
  	
  I’ve	
  
found	
  the	
  fourth,	
  figh,	
  and	
  sixth	
  domains	
  of	
  life.	
  
	
  	
  	
  	
  	
  
•  OMG!	
  	
  I	
  see	
  this	
  sequence	
  10	
  million	
  0mes.	
  	
  	
  
•  OMG!	
  	
  There	
  are	
  more	
  than	
  10	
  billion	
  dis0nct	
  
31mers	
  in	
  my	
  dataset.	
  	
  I	
  only	
  have	
  128	
  Gbases	
  of	
  
memory.	
  
•  Error	
  correc0on	
  /	
  clustering	
  /	
  assembly	
  works	
  on	
  
subsets	
  of	
  the	
  data	
  with	
  high	
  sequence	
  depth.	
  
Abundance-­‐based	
  inferences	
  
are	
  beler	
  in	
  the	
  high-­‐
abundance	
  part	
  of	
  the	
  data.	
  
But	
  I	
  want	
  to	
  sequence	
  everything!	
  
Ok,	
  we	
  can	
  count	
  kmers	
  in	
  everything	
  too..	
  
kmerspectrumanalyzer	
  summarizes	
  distribu0on,	
  es0mates	
  	
  
genome	
  size,	
  coverage	
  depth,	
  …	
  but	
  what	
  it’s	
  really	
  good	
  at	
  
Kmers	
  show	
  problems	
  in	
  datasets	
  
•  Amok	
  PCR	
  –	
  seemingly	
  random	
  sequences	
  
•  Amok	
  MDA	
  –	
  10	
  Gbases	
  of	
  sequence,	
  one	
  gene	
  
•  PCR	
  duplicates:	
  en0re	
  sequencing	
  run	
  was	
  50x	
  
exact-­‐	
  and	
  near-­‐exact	
  duplicate	
  reads	
  
•  Unusually	
  high	
  error	
  rate:	
  indicated	
  by	
  low	
  frac0on	
  
of	
  “solid”	
  kmers	
  (for	
  isolate	
  genomes)	
  
•  Contaminated	
  samples:	
  95%	
  E.	
  coli	
  5%	
  E.	
  faecalis	
  
•  Many	
  datasets	
  have	
  as	
  much	
  as	
  5-­‐45%	
  of	
  the	
  
sequence	
  yield	
  in	
  adapters.	
  	
  	
  
Generali0es	
  from	
  the	
  	
  
kmer	
  coun0ng	
  mines	
  
•  FEW	
  DATASETS	
  have	
  well-­‐separated	
  
abundance	
  peaks	
  (of	
  the	
  sort	
  metavelvet	
  was	
  
engineered	
  to	
  find)	
  	
  	
  
•  Diverse	
  datasets	
  have	
  a	
  featureless,	
  
geometric	
  rela9onship	
  between	
  kmer	
  rank	
  
and	
  kmer	
  abundance	
  (but	
  I’m	
  not	
  about	
  to	
  
write	
  a	
  paper	
  fipng	
  kmers	
  to	
  the	
  Yule,	
  
Mandelbrot,	
  Levy,	
  or	
  Pareto	
  distribu0ons)	
  
HMP	
  /	
  quan0le	
  norm	
  /	
  euclidean	
  /	
  colored	
  by	
  alpha	
  	
  
	
  
MG-­‐RAST	
  API	
  
R-­‐package	
  matR	
  
Hey	
  kid,	
  you	
  want	
  some	
  unlabeled	
  data?	
  
Kevin	
  Keegan,	
  Argonne	
  Na0onal	
  Laboratory	
  
HMP	
  /	
  quan0le	
  norm	
  /	
  euclidean	
  /	
  colored	
  by	
  alpha	
  	
  
	
  
MG-­‐RAST	
  API	
  
R-­‐package	
  matR	
  
Hey	
  kid,	
  you	
  want	
  some	
  unlabeled	
  data?	
  
Kevin	
  Keegan,	
  Argonne	
  Na0onal	
  Laboratory	
  
I’m	
  not	
  sure	
  how	
  to	
  do	
  
science	
  with	
  an	
  unlabeld	
  pile	
  
of	
  datasets.	
  
Figure'2a!
Hey	
  kid,	
  you	
  want	
  some	
  prely	
  ordina0ons?	
  
Kevin	
  Keegan,	
  Argonne	
  Na0onal	
  Laboratory	
  
Observa0on:	
  Most	
  scien0sts	
  seem	
  to	
  
be	
  self-­‐taught	
  in	
  compu0ng.	
  
	
  
Observa0on:	
  	
  Most	
  scien0sts	
  waste	
  a	
  	
  
lot	
  of	
  0me	
  using	
  computers	
  
inefficiently.	
  
Rachel	
  and	
  I	
  volunteer	
  with	
  	
  
We	
  teach	
  scien0sts	
  
	
  how	
  to	
  get	
  more	
  done	
  
Woods	
  Hole	
  
Tugs	
  
U.	
  Chicago	
  
U.	
  Chicago	
  
UIC	
  
Metagenomic	
  annota0on	
  group	
  
	
  
Folker	
  Meyer	
  
Elizabeth	
  Glass	
  
Narayan	
  Desai	
  
Kevin	
  Keegan	
  	
  
Adina	
  Howe	
  
Wolfgang	
  Gerlach	
  
Wei	
  Tang	
  
Travis	
  Harrison	
  
Jared	
  Bishof	
  
Dan	
  Braithwaite	
  
Hunter	
  Malhews	
  
Sarah	
  Owens	
  
Formerly	
  of	
  Yale:	
  
Howard	
  Ochman	
  	
  
David	
  Williams	
  
	
  
Georgia	
  Tech:	
  
Kostas	
  Konstan0nidis	
  
Luis	
  Rodriguez-­‐Rojas	
  
	
  

Más contenido relacionado

Destacado

תנו למתים לחיות - שנאת האחר בבאפי
תנו למתים לחיות - שנאת האחר בבאפיתנו למתים לחיות - שנאת האחר בבאפי
תנו למתים לחיות - שנאת האחר בבאפיIdo Adler
 
שיעור 3 פגיעות לפי מפרקים
שיעור 3   פגיעות לפי מפרקיםשיעור 3   פגיעות לפי מפרקים
שיעור 3 פגיעות לפי מפרקיםDave4488
 
ארגון המניפה 2010
ארגון המניפה 2010ארגון המניפה 2010
ארגון המניפה 2010hamenifa
 
Ichs Comm Profile
Ichs Comm ProfileIchs Comm Profile
Ichs Comm ProfileInnovator65
 
אלבום תמונות נוקסוויל
אלבום תמונות   נוקסווילאלבום תמונות   נוקסוויל
אלבום תמונות נוקסווילhilamail
 
From Passivity to a sense of Agency- Axes of change among women who gave birth
From Passivity to a sense of Agency- Axes of change among women who gave birthFrom Passivity to a sense of Agency- Axes of change among women who gave birth
From Passivity to a sense of Agency- Axes of change among women who gave birthJacklin Eshaya
 
Jordan\ S Side Of The Dead Sea
Jordan\ S Side Of The Dead SeaJordan\ S Side Of The Dead Sea
Jordan\ S Side Of The Dead SeaEllyJ
 
נשים במעגלי החיים
נשים במעגלי החייםנשים במעגלי החיים
נשים במעגלי החייםavilevant
 
נוקסוויל
נוקסווילנוקסוויל
נוקסווילhilamail
 
Family Life Final
Family Life FinalFamily Life Final
Family Life FinalInnovator65
 
The Dead Sea
The Dead SeaThe Dead Sea
The Dead Seaad3pic
 
HEADACHE: how to relieve a headache or migraine with reflexology
HEADACHE: how to relieve a headache or migraine with reflexologyHEADACHE: how to relieve a headache or migraine with reflexology
HEADACHE: how to relieve a headache or migraine with reflexologyLeemore Benron Levi
 
אגורפוביה
אגורפוביהאגורפוביה
אגורפוביהCbtisrael
 
לעצב מצגת מנצחת - לילך גל
לעצב מצגת מנצחת - לילך גללעצב מצגת מנצחת - לילך גל
לעצב מצגת מנצחת - לילך גלLilach Gal
 
שיעור 4 ספורט ובריאות, מניעת פציעות וסיכום
שיעור 4   ספורט ובריאות, מניעת פציעות וסיכוםשיעור 4   ספורט ובריאות, מניעת פציעות וסיכום
שיעור 4 ספורט ובריאות, מניעת פציעות וסיכוםDave4488
 
שיעור ראשון
שיעור ראשוןשיעור ראשון
שיעור ראשוןmedia-seo
 
Innovations™ Magazine July - September 2014 Russian
Innovations™ Magazine July - September 2014 RussianInnovations™ Magazine July - September 2014 Russian
Innovations™ Magazine July - September 2014 RussianT.D. Williamson
 

Destacado (20)

תנו למתים לחיות - שנאת האחר בבאפי
תנו למתים לחיות - שנאת האחר בבאפיתנו למתים לחיות - שנאת האחר בבאפי
תנו למתים לחיות - שנאת האחר בבאפי
 
שיעור 3 פגיעות לפי מפרקים
שיעור 3   פגיעות לפי מפרקיםשיעור 3   פגיעות לפי מפרקים
שיעור 3 פגיעות לפי מפרקים
 
ארגון המניפה 2010
ארגון המניפה 2010ארגון המניפה 2010
ארגון המניפה 2010
 
Ichs Comm Profile
Ichs Comm ProfileIchs Comm Profile
Ichs Comm Profile
 
Fanky ideas
Fanky ideasFanky ideas
Fanky ideas
 
אלבום תמונות נוקסוויל
אלבום תמונות   נוקסווילאלבום תמונות   נוקסוויל
אלבום תמונות נוקסוויל
 
From Passivity to a sense of Agency- Axes of change among women who gave birth
From Passivity to a sense of Agency- Axes of change among women who gave birthFrom Passivity to a sense of Agency- Axes of change among women who gave birth
From Passivity to a sense of Agency- Axes of change among women who gave birth
 
Jordan\ S Side Of The Dead Sea
Jordan\ S Side Of The Dead SeaJordan\ S Side Of The Dead Sea
Jordan\ S Side Of The Dead Sea
 
נשים במעגלי החיים
נשים במעגלי החייםנשים במעגלי החיים
נשים במעגלי החיים
 
נוקסוויל
נוקסווילנוקסוויל
נוקסוויל
 
Family Life Final
Family Life FinalFamily Life Final
Family Life Final
 
promo ppt of ICamp 2009 in Moscow, Russia
promo ppt of ICamp 2009 in Moscow, Russiapromo ppt of ICamp 2009 in Moscow, Russia
promo ppt of ICamp 2009 in Moscow, Russia
 
The Dead Sea
The Dead SeaThe Dead Sea
The Dead Sea
 
HEADACHE: how to relieve a headache or migraine with reflexology
HEADACHE: how to relieve a headache or migraine with reflexologyHEADACHE: how to relieve a headache or migraine with reflexology
HEADACHE: how to relieve a headache or migraine with reflexology
 
אגורפוביה
אגורפוביהאגורפוביה
אגורפוביה
 
לעצב מצגת מנצחת - לילך גל
לעצב מצגת מנצחת - לילך גללעצב מצגת מנצחת - לילך גל
לעצב מצגת מנצחת - לילך גל
 
שיעור 4 ספורט ובריאות, מניעת פציעות וסיכום
שיעור 4   ספורט ובריאות, מניעת פציעות וסיכוםשיעור 4   ספורט ובריאות, מניעת פציעות וסיכום
שיעור 4 ספורט ובריאות, מניעת פציעות וסיכום
 
מצגת תודה מקסימה משירה
מצגת תודה מקסימה משירהמצגת תודה מקסימה משירה
מצגת תודה מקסימה משירה
 
שיעור ראשון
שיעור ראשוןשיעור ראשון
שיעור ראשון
 
Innovations™ Magazine July - September 2014 Russian
Innovations™ Magazine July - September 2014 RussianInnovations™ Magazine July - September 2014 Russian
Innovations™ Magazine July - September 2014 Russian
 

Similar a Sequencing run grief counseling: counting kmers at MG-RAST

PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Creating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerCreating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerNacho Caballero
 
What might a spoken corpus tell us about language
What might a spoken corpus tell us about languageWhat might a spoken corpus tell us about language
What might a spoken corpus tell us about languageUCLDH
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaAndre Pemmelaar
 
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale DataPredicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Dataphilippbayer
 
Small Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPSmall Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPindico data
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Torsten Seemann
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityPeterMorrell4
 

Similar a Sequencing run grief counseling: counting kmers at MG-RAST (20)

PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Creating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerCreating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designer
 
What might a spoken corpus tell us about language
What might a spoken corpus tell us about languageWhat might a spoken corpus tell us about language
What might a spoken corpus tell us about language
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in julia
 
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale DataPredicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
 
Small Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPSmall Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLP
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinity
 

Último

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 

Último (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 

Sequencing run grief counseling: counting kmers at MG-RAST

  • 1. Sequencing  run  grief  counseling:   coun0ng  kmers  at  MG-­‐RAST   Will  Trimble   metagenomic  annota0on  group   Argonne  Na0onal  Laboratory   April  29,  2014        UIC  
  • 2. Apology:  I  speak  biology     with  an  accent   •  I  spent  six  years  in  dark  rooms  with  lasers   •  Now  I  use  computers  to  analyze  high-­‐throughput   sequence  data.   •  I  introduce  myself  as  an  applied  mathema0cian.   •  Finding  scoring  func0ons  to  use  ambiguous  data  to   answer  life’s  persistent  ques0ons.    
  • 3. Apology:  I  speak  biology     with  an  accent   •  I  spent  six  years  in  dark  rooms  with  lasers   •  Now  I  use  computers  to  analyze  high-­‐throughput   sequence  data.   •  I  introduce  myself  as  an  applied  mathema0cian.   •  Finding  scoring  func0ons  to  use  ambiguous  data  to   answer  life’s  persistent  ques0ons.   •  Shoveling  data  from  the  data  producing  machine  into   the  data-­‐consuming  furnace.    
  • 4. •  Sequences  are  different   •  Sequencing  is  like  photography   •  Sequencing  is  beau0ful   thumbnailpolish   •  How  diverse  are  my  shotgun  sequences?   nonpareil-k! kmerspectrumanalyzer! ! ! Outline  
  • 5. •  Sequences  are  different                                        (math)   •  Sequencing  is  like  photography        (pictures)         •  Sequencing  is  beau0ful                                       thumbnailpolish (micrographs)               •  How  diverse  are  my  shotgun  sequences?   nonpareil-k (graphs)     kmerspectrumanalyzer!                                                                                                                              (graphs)   Outline  
  • 6. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita0vely  different  from  all  other  data   types.       Low-­‐throughput     categorical  data     Categories  are  sound    
  • 7. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita0vely  different  from  all  other  data   types.       Instrument  readings,   spectra,  micrographs     Not  categorical.   Low-­‐throughput     categorical  data     Categories  are  sound    
  • 8. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita0vely  different  from  all  other  data   types.       @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrument  readings,   spectra,  micrographs     Not  categorical.   Low-­‐throughput     categorical  data     Categories  are  sound     High  throughput   sequence  data     Categories  uncertain    
  • 9. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita0vely  different  from  all  other  data   types.       @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrument  readings,   spectra,  micrographs     Not  categorical.   Low-­‐throughput     categorical  data     Categories  are  sound     High  throughput   sequence  data     Categories  uncertain     100-­‐102   102-­‐107   1012-­‐1080  
  • 10. Experiment   design   Sequencing  run   Sequence  data     Assembly,   Annota0on   SEED  M5NR   489 !Sensory box/GGDEF family! 470 !hyphothetical protein! 241 !Co-Zn-Cd resistance CzcA! 202 !Transposase! 200 !homocysteine methyltransferase (EC 2.1.1.13)! 175 !cyclase/phosphodiesterase ! 164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)! 156 !Methyl-accepting chemotaxis protein! 149 !ABC transporter, ATP-binding protein! 147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)! 133 !Ferrous iron transport protein B! So  we  reduce  sequence  data  to   categorical  data.  
  • 11. Forward-­‐backward  problem   Experiment   design   Sequencing  run   Sequence  data     Assembly,   Annota0on   SEED  M5NR   489 !Sensory box/GGDEF family! 470 !hyphothetical protein! 241 !Co-Zn-Cd resistance CzcA! 202 !Transposase! 200 !homocysteine methyltransferase (EC 2.1.1.13)! 175 !cyclase/phosphodiesterase ! 164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)! 156 !Methyl-accepting chemotaxis protein! 149 !ABC transporter, ATP-binding protein! 147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)! 133 !Ferrous iron transport protein B! 1012   103-­‐105  100-­‐101   So  we  reduce  sequence  data  to   categorical  data.  
  • 12. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita0vely  different  from  all  other  data   types.     •  Each  sequence  is  an  informa0on-­‐rich  (possibly   corrupted)  quota9on  from  the  catalog  of   gene0c  polymers.  
  • 13. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results” Searching  
  • 14. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results” Searching   Same  answer  for  both  puzzles:   you  go  to  this  website…  
  • 15. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results” Searching   How  long  do  reads  need  to  be     to  recognize  them?   How  long  do  phrases  need  to  be  to   recognize  them?  
  • 16. How  long  do  reads  need  to  be?   Informa9on      (Shannon,  1949,  BSTJ):               is  a  quan0ta0ve  summary  of  the  uncertainty  of  a   probability  distribu9on  –  a  model  of  the  data     Profound  applicability  in  machine  learning  and     probabilis0c  modeling     H = X i pi log2 ✓ 1 pi ◆
  • 17. How  long  do  phrases  need  to  be?   Exercise:    Pick  a  book  from  your  bookshelf.   Pick  an  arbitrary  page  and  arbitrary  line.     for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
  • 18. •  Informa0on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.   •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits   •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words   to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                                                                                                      Try  it.       How  long  do  phrases  need  to  be?  
  • 19. How  long  do  phrases  need  to  be?   Exercise:    Pick  a  book  from  your  bookshelf.   Pick  an  arbitrary  page  and  arbitrary  line.     for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
  • 20. How  long  do  phrases  need  to  be?   Exercise:    Pick  a  book  from  your  bookshelf.   Pick  an  arbitrary  page  and  arbitrary  line.     for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.! Usually  nails  your  source  in   four  words.  
  • 21. •  Maximum  informa0on  content  of          base  pairs                            Hread                                            2        bits    per  length-­‐      sequence   •  Most  long  kmers  are  dis0nct:              genome  of  size  G  (ca  1010  bp)                            log(G)                                =            1010    =        233.2                                    =    34  bits   •  So  we  expect  that  when  2        >  34  bits,  we  should  be   able  to  place  any  sequence.   •  That  means  we  need  at  least      17  base  pairs          (seems  small)  to  deliver  mail  anywhere  in  the   genome.     How  long  do  reads  need  to  be?   ` ` ` `
  • 22. •  Maximum  informa0on  content  of          base  pairs                            Hread                                            2        bits    per  length-­‐      sequence   •  Most  long  kmers  are  dis0nct:              genome  of  size  G  (ca  1010  bp)                            log(G)                                =            1010    =        233.2                                    =    34  bits   •  So  we  expect  that  when  2        >  34  bits,  we  should  be   able  to  place  any  sequence.   •  That  means  we  need  at  least      17  base  pairs          (seems  small)  to  deliver  mail  anywhere  in  the   genome.     How  long  do  reads  need  to  be?   ` ` ` ` Short  sequences  end  up  being  very   dis0nc0ve,  even  fingerprint-­‐like.  
  • 24. The  data  deluge   •  There  were  some  technological   breakthroughs  in  the  mid-­‐2000s  that   led  to  inexpensive  collec0on  of  10s   of  Gbytes  of  sequence  data  at  once.   •  The  data  has  outgrown  some   favorite  algorithms  from  the  1990s   (BLAST)    
  • 26. Rarefac0on  of  a  photograph   A  camera  records  the   number  of  photons  that   land  on  each  of  millions   of  pixels.     A  sequencer  records  the   number  of  sequences   that  land  in  each   possible  sequence.     I  actually  think  of  a  sequencer  like  a     mul0channel  gene0c  spectrometer.  
  • 27. Rarefac0on  of  a  photograph   A  camera  records  the   number  of  photons  that   land  on  each  of  millions   of  pixels.     A  sequencer  records  the   number  of  sequences   that  land  in  each   possible  sequence.     I  actually  think  of  a  sequencer  like  a     mul0channel  gene0c  spectrometer.  
  • 28. The  gene0c  spectrometer   With  my  1012-­‐channel   gene0c  spectrometer,  I   am  trying  to  ar0culate   the  diversity  of  what  the   sequencer  sees.     Species  diversity   ATCGCGAAAAGTCCC 2! AAAAAAAAAAAAAAA 459! AAAAAAAAAAAAAAC 71! AAAATAAAAAAAATA 1! AAAAAAAAAAAAAAG 36! ACATGAAAAACAACT 1! AAAAAAAAAAAAAAT 23! AAAAAAAAAAAAACA 95! GTAGGAAAAGCCCAC 1! AAAAAAAAAAAAACC 7! AAAAAAAAAAAAACG 8! AAAAAAAAAAAAACT 9! AAAAAAAAAAAAAGA 36! AACAAGAAAAACAAA 1! AAAAAAAAAAAAAGC 10! AAATAAAAAAAATAG 1! AACAGAAAAAACACG 1! AAAAAAAAAAAAAGG 2! AAAAAAAAAAAAAGT 6!
  • 29. The  gene0c  spectrometer   With  my  1012-­‐channel   gene0c  spectrometer,  I   am  trying  to  ar0culate   the  diversity  of  what  the   sequencer  sees.     Species  diversity     Gene  diversity     ATCGCGAAAAGTCCC 2! AAAAAAAAAAAAAAA 459! AAAAAAAAAAAAAAC 71! AAAATAAAAAAAATA 1! AAAAAAAAAAAAAAG 36! ACATGAAAAACAACT 1! AAAAAAAAAAAAAAT 23! AAAAAAAAAAAAACA 95! GTAGGAAAAGCCCAC 1! AAAAAAAAAAAAACC 7! AAAAAAAAAAAAACG 8! AAAAAAAAAAAAACT 9! AAAAAAAAAAAAAGA 36! AACAAGAAAAACAAA 1! AAAAAAAAAAAAAGC 10! AAATAAAAAAAATAG 1! AACAGAAAAAACACG 1! AAAAAAAAAAAAAGG 2! AAAAAAAAAAAAAGT 6!
  • 30. The  gene0c  spectrometer   With  my  1012-­‐channel   gene0c  spectrometer,  I   am  trying  to  ar0culate   the  diversity  of  what  the   sequencer  sees.     Species  diversity     Gene  diversity     Sequence  diversity   ATCGCGAAAAGTCCC 2! AAAAAAAAAAAAAAA 459! AAAAAAAAAAAAAAC 71! AAAATAAAAAAAATA 1! AAAAAAAAAAAAAAG 36! ACATGAAAAACAACT 1! AAAAAAAAAAAAAAT 23! AAAAAAAAAAAAACA 95! GTAGGAAAAGCCCAC 1! AAAAAAAAAAAAACC 7! AAAAAAAAAAAAACG 8! AAAAAAAAAAAAACT 9! AAAAAAAAAAAAAGA 36! AACAAGAAAAACAAA 1! AAAAAAAAAAAAAGC 10! AAATAAAAAAAATAG 1! AACAGAAAAAACACG 1! AAAAAAAAAAAAAGG 2! AAAAAAAAAAAAAGT 6!
  • 31. Rarefac0on  of  a  photograph   Sampling  only  a  few   sequences  is  like   exposing  the  camera   for  too  short  a  0me.         Not  enough  photons   to  make  out  the   picture.  
  • 32. Rarefac0on  of  a  photograph   some  parts  seem  to  be  dark.  
  • 33. Rarefac0on  of  a  photograph  
  • 34. Rarefac0on  of  a  photograph   This  looks  like  a  portrait  
  • 35. Rarefac0on  of  a  photograph  
  • 36. Rarefac0on  of  a  photograph   Start  to  see  the  mood  
  • 37. Rarefac0on  of  a  photograph  
  • 38. Rarefac0on  of  a  photograph   A  0ny  bit  of  graininess  leg  
  • 39. Rarefac0on  of  a  photograph   “shot  noise”  in  electrical   engineering  
  • 40. Rarefac0on  of  a  photograph   A  studio  portrait  of  Jane  Goodall  
  • 41. A  scien0fic  image   This  is  a  famous    scien0fic  image.   Anybody  recognize  it?  
  • 42. A  scien0fic  image   Does  this  help?  
  • 43. A  scien0fic  image   There  are  small  patches  of  brightness    
  • 44. A  scien0fic  image   Were  you  expec0ng  x-­‐ray  diffrac0on?  
  • 45. A  scien0fic  image   At  longer  exposures  
  • 46. A  scien0fic  image   more  objects,  smaller  and  dimmer,  appear.  
  • 47. A  scien0fic  image   This  is  a  part  of  the  Hubble  Deep  Field  image  
  • 48. Image  /  sequencing  analogy   Analogy  to  sequencing:   •  Most  of  field  is  black   •  Bright  objects  have   halos   •  Contains  camera   ar0facts   •  We  can’t  know  what   we  didn’t  see   without  longer   exposures.  
  • 49. Opportunity  cost  of  deep  sequencing   This  took  two  weeks   to  acquire  on  a  one-­‐ of-­‐a-­‐kind  telescope.     Consider  the   opportunity  cost  of   studying  a  single   sample  for  two   weeks.   STSI  did  only  four  long  exposures  like  this  in  23  years.  
  • 50. Image  /  sequencing  analogy   Analogy  to  sequencing:   •  Most  of  field  is  black   •  Bright  objects  have   halos   •  Contains  camera   ar0facts   •  We  can’t  know  what   we  didn’t  see   without  longer   exposures.   Sampling  effort  interacts  with  sequence  diversity  to   produce  a  “horizon”     Inferences  are  supported  on  the  bright  parts  first,  on   the  dim  parts  only  at  higher  depth.     Not  all  the  sequences,  abundant  or  rare,     are  real.     Dim  targets  come  at  great  cost  in  sample  number.  
  • 51. How  much  novelty  is  in  my  dataset?   How  many  sequences  do  you  need  to  see  before  you  start  seeing     the  same  ones  over  and  over  again?  
  • 52. How  much  novelty  is  in  my  dataset?   How  many  sequences  do  you  need  to  see  before  you  start  seeing     the  same  ones  over  and  over  again?     Ini0ally,  everything  is  novel,  but  there  will  come  a  point  at  which     less  than  half  of  your  new  observa0ons  are  already  in  the  catalog.  
  • 53. How  much  novelty  is  in  my  dataset?   Luis Rodriguez-Rojas and Kostas Konstantinidis developed a subset-against-all alignment approach to address the question “how quickly do we encounter novelty in shotgun datasets?” Nonpareil I found a way to answer almost the same question 300x faster. Nonpareil-k
  • 54. Nonuniqefraction(✏; {r}, {n}) = X i ni · ri P j nj · rj (1 Poisscdf (✏ · ri, 1)) (1 Poisscdf (✏ · ri, 0)) How  much  novelty  is  in  my  dataset?   Nonpareil-k  
  • 55. Nonpareil: model of sequence coverage Georgia Tech
  • 56. Nonpareil: model of sequence coverage Georgia Tech Nonpareil-k: kmer rarefaction Argonne + Georgia Tech summary of sequence diversity
  • 57. Nonpareil-­‐k:  stra0fy  datasets  by   coverage  distribu0on   most  of  dataset   likely  contained  in     assembly     assembly  is  likely   to  miss  or     alenuate  the     large  unique     frac0on  of  dataset.    
  • 59. Looking  for  abundance  palerns   Let’s  look  at  the   greyscale  histogram  
  • 61. Looking  for  abundance  palerns   Shadows   Background  Jacket   Face  and     hands   We  can  even  tease  out     a  few  palerns  in  the  histogram  
  • 62. Kmers  can  tell  you  genome  size  and   coverage  depth  
  • 63. Kmers  can  tell  you  genome  size  and   coverage  depth  
  • 64. Redundancy  is  good   •  OMG!      Check  out  these  three  sequences!    I’ve   found  the  fourth,  figh,  and  sixth  domains  of  life.             •  OMG!    I  see  this  sequence  10  million  0mes.       •  OMG!    There  are  more  than  10  billion  dis0nct   31mers  in  my  dataset.    I  only  have  128  Gbases  of   memory.   •  Error  correc0on  /  clustering  /  assembly  works  on   subsets  of  the  data  with  high  sequence  depth.  
  • 65. Redundancy  is  good   •  OMG!      Check  out  these  three  sequences!    I’ve   found  the  fourth,  figh,  and  sixth  domains  of  life.             •  OMG!    I  see  this  sequence  10  million  0mes.       •  OMG!    There  are  more  than  10  billion  dis0nct   31mers  in  my  dataset.    I  only  have  128  Gbases  of   memory.   •  Error  correc0on  /  clustering  /  assembly  works  on   subsets  of  the  data  with  high  sequence  depth.   Abundance-­‐based  inferences   are  beler  in  the  high-­‐ abundance  part  of  the  data.  
  • 66. But  I  want  to  sequence  everything!   Ok,  we  can  count  kmers  in  everything  too..   kmerspectrumanalyzer  summarizes  distribu0on,  es0mates     genome  size,  coverage  depth,  …  but  what  it’s  really  good  at  
  • 67. Kmers  show  problems  in  datasets   •  Amok  PCR  –  seemingly  random  sequences   •  Amok  MDA  –  10  Gbases  of  sequence,  one  gene   •  PCR  duplicates:  en0re  sequencing  run  was  50x   exact-­‐  and  near-­‐exact  duplicate  reads   •  Unusually  high  error  rate:  indicated  by  low  frac0on   of  “solid”  kmers  (for  isolate  genomes)   •  Contaminated  samples:  95%  E.  coli  5%  E.  faecalis   •  Many  datasets  have  as  much  as  5-­‐45%  of  the   sequence  yield  in  adapters.      
  • 68. Generali0es  from  the     kmer  coun0ng  mines   •  FEW  DATASETS  have  well-­‐separated   abundance  peaks  (of  the  sort  metavelvet  was   engineered  to  find)       •  Diverse  datasets  have  a  featureless,   geometric  rela9onship  between  kmer  rank   and  kmer  abundance  (but  I’m  not  about  to   write  a  paper  fipng  kmers  to  the  Yule,   Mandelbrot,  Levy,  or  Pareto  distribu0ons)  
  • 69. HMP  /  quan0le  norm  /  euclidean  /  colored  by  alpha       MG-­‐RAST  API   R-­‐package  matR   Hey  kid,  you  want  some  unlabeled  data?   Kevin  Keegan,  Argonne  Na0onal  Laboratory  
  • 70. HMP  /  quan0le  norm  /  euclidean  /  colored  by  alpha       MG-­‐RAST  API   R-­‐package  matR   Hey  kid,  you  want  some  unlabeled  data?   Kevin  Keegan,  Argonne  Na0onal  Laboratory   I’m  not  sure  how  to  do   science  with  an  unlabeld  pile   of  datasets.  
  • 71. Figure'2a! Hey  kid,  you  want  some  prely  ordina0ons?   Kevin  Keegan,  Argonne  Na0onal  Laboratory  
  • 72. Observa0on:  Most  scien0sts  seem  to   be  self-­‐taught  in  compu0ng.     Observa0on:    Most  scien0sts  waste  a     lot  of  0me  using  computers   inefficiently.   Rachel  and  I  volunteer  with    
  • 73. We  teach  scien0sts    how  to  get  more  done   Woods  Hole   Tugs   U.  Chicago   U.  Chicago   UIC  
  • 74.
  • 75. Metagenomic  annota0on  group     Folker  Meyer   Elizabeth  Glass   Narayan  Desai   Kevin  Keegan     Adina  Howe   Wolfgang  Gerlach   Wei  Tang   Travis  Harrison   Jared  Bishof   Dan  Braithwaite   Hunter  Malhews   Sarah  Owens   Formerly  of  Yale:   Howard  Ochman     David  Williams     Georgia  Tech:   Kostas  Konstan0nidis   Luis  Rodriguez-­‐Rojas