2. Acknowledgements
Lab members involved Collaborators
Adina Howe (w/Tiedje) Jim Tiedje, MSU
Jason Pell
ArendHintze Billie Swalla, UW
RosangelaCanino- Janet Jansson, LBNL
Koning
Qingpeng Zhang Susannah Tringe, JGI
Elijah Lowe
LikitPreeyanon Funding
JiarongGuo
Tim Brom USDA NIFA; NSF IOS;
KanchanPavangadkar BEACON.
Eric McDonald
3.
4. “Be the change you want to see”
We are aggressivelyopen…
Everything discussed here:
Code: github.com/ged-lab/ ; BSD license
Blog: http://ivory.idyll.org/blog („titus brown blog‟)
Twitter: @ctitusbrown
Grants on Lab Web site:
http://ged.msu.edu/interests.html
(What‟s a good license??)
Preprints: on arXiv, q-bio:
„kmer-percolation arxiv‟
„diginormarxiv‟
5. The data catastrophe!
Data set sizes growing faster than compute capacity
(esp RAM).
Many biological algorithms don‟t scale all that well,
anyway.
Algorithmically, we want:
Single-pass.
Compression approaches (lossy or otherwise).
Low-memory data structures
I, personally, think the last thing in the world we need
is another standalone package: pre-filtering
approaches.
“Run our nifty approaches first, then feed into the
6. Digital normalization
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
8. Digital normalization algorithm
for read in dataset:
if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; fixed memory.
9. Digital normalization is efficient &
effective
• Single pass algorithm
• Fixed memory;
Algorithmic nerdvana! • Cheaper than assembly;
• Reduces assembly time;
• Scales assembly memory.
Brown et al., in review, PLoS On
13. Other key points
Virtually identical contigassembly; scaffolding works
but is not yet cookie-cutter.
Digital normalization changes the way de Bruijn graph
assembly scales from the size of your data set to
the size of the source sample.
Alwayslower memory than assembly: we never
collect most erroneous k-mers.
Digital normalization can be done once– and then
assembly parameter exploration can be done.
14. Quotable quotes.
Comment: “This looks like a great solution for
people who can’t afford real computers”.
OK, but:
“Buying ever bigger computers is a great
solution for people who don’t want to think
hard.”
To be less snide: both kinds of scaling are needed,
of course.
15. Why use diginorm?
Use the cloud to assemble any microbial
genomes incl. single-cell, many eukaryotic
genomes, most mRNAseq, and many
metagenomes.
Seems to provide leverage on addressing many
biological or sample prep problems (single-cell &
genome amplification MDA; metagenome;
heterozygosity).
And, well, the general idea of locus specific
graph analysis solves lots of things…
16. Some interim concluding
thoughts
Digital normalization-like approaches provide a
path to solving the majority of assembly scaling
problems, and will enable assembly on current
cloud computing hardware.
This is not true for highly diverse metagenome
environments…
For soil, we estimate that we need 50 Tbp / gram
soil. Sigh.
Biologists and bioinformaticianshate:
Throwing away data
Caveats in bioinformatics papers (which reviewers
like, note)
17. Streaming error correction.
We can do error trimming of genomic, MDA, transcriptomic,
metagenomic data in < 2 passes, fixed memory.
We have just submitted a proposal to adapt Euler or
Quake-like error correction (e.g. spectral alignment
problem) to this framework.
18. Side note: error correction is the
biggest “data” problem left in
sequencing.
Both for mapping & assembly.
19. Replication fu
In December 2011, I met Wes McKinney on a
train and he convinced me that I should look at
IPython Notebook.
This is an interactive Web notebook for data
analysis…
Hey, neat! We can use this for replication!
All of our figures can be regenerated from scratch,
on an EC2 instance, using a Makefile (data
pipeline) and IPython Notebook (figure generation).
Everything is version controlled.
Honestly not much work, and will be less the next
time.
20.
21. So… how‟d that go?
People who already cared thought it was nifty.
http://ivory.idyll.org/blog/replication-i.html
Almost nobody else cares ;(
Presub enquiry to editor: “Be sure that your paper can
be reproduced.” Uh, please read my letter to the end?
“Could you improve your Makefile? I want to
reimplementdiginorm in another language and reuse
your pipeline, but your Makefile is a mess.”
Incredibly useful, nonetheless. Already part of
undergraduate and graduate training in my lab;
helping us and others with next parpes; etc. etc. etc.
Life is way too short to waste on unnecessarily
replicating your own workflows, much less other
people’s.
22. Acknowledgements
Lab members involved Collaborators
Adina Howe (w/Tiedje) Jim Tiedje, MSU
Jason Pell
ArendHintze Billie Swalla, UW
RosangelaCanino- Janet Jansson, LBNL
Koning
Qingpeng Zhang Susannah Tringe, JGI
Elijah Lowe
LikitPreeyanon Funding
JiarongGuo
Tim Brom USDA NIFA; NSF IOS;
KanchanPavangadkar BEACON.
Eric McDonald
23. Advertisement!
Qingpeng Zhang (QP) will talk about our very
useful „khmer‟ software for efficiently counting k-
mers.
Want a simple Python lib for reading & indexing
FASTA/FASTQ? Check out screed.
“Better science through superior software.”