This document discusses new directions for the khmer bioinformatics platform, including developing semi-streaming algorithms for sequence analysis using k-mers. Digital normalization is presented as an initial approach that compresses sequencing data, though it discards information. Later work introduced a two-pass semi-streaming framework using saturation detection to enable error correction and variant calling using minimal memory. Current work includes developing a pair-HMM-based graph aligner and applying it to tasks like variant calling. The khmer platform provides implementations of these streaming algorithms to enable analysis of large genomic and metagenomic datasets.
Building a platform for bioinformatics: exciting new directions for khmer
1. Building a platform for
bioinformatics: some exciting
new directions for khmer.
C. Titus Brown
ctbrown@ucdavis.edu
March 12, 2015
2. Hello!
Associate Professor (#tenure!);
School of Veterinary Medicine
University of California, Davis.
More information at:
• ged.msu.edu/ ( URL needs to be updated :)
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
3. Warnings
This talk contains information that may constitute
“forward-looking statements.” Generally, the words
“believe,” “expect,” “intend,” “estimate,”
“anticipate,” “project,” “will” and similar expressions
identify forward-looking statements, which generally
are not historical in nature.
I have been advised to put this disclaimer in as well:
Dr. Brown is not currently under treatment for any
disorders related to megalomania.
5. De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
6. K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
7. K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
8. The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
11. One big challenge: scalability!
De Bruijn graph size scales with # errors.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
13. Goals
• Initial goal: can we assemble large data sets??
• Longer-term goal: can we find efficient (De Bruijn?)
graph-based approaches to sequence analysis?
14. First attempt: compressible
De Bruijn graphs
1% 5%
15%10%
Pell et al., 2012
Can use Bloom
filters to store
De Bruijn graph
structures.
=> Overall
structure
remains as you
squish graphs
down.
15. Technical challenges met (and defeated)
• Exhaustive in-memory traversal of graphs containing
5-15 billion nodes.
• Sequencing technology introduces false
connections in graph.
• Implementation lets us scale ~20x over other
approaches.
Pell et al., 2012
16. Technical challenges met (and defeated)
• Exhaustive in-memory traversal of graphs containing
5-15 billion nodes.
• Sequencing technology introduces false
connections in graph.
• Implementation lets us scale ~20x over other
approaches, but this is not enough.
• Although, see Minia assembler (Chikhi et al.)
Pell et al., 2012
27. Contig assembly now scales with underlying genome size
• Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
• Memory efficient is improved by use of CountMin
Sketch.
Brown et al., 2012, arXiv.
29. Diginorm is only a good
start:
• Diginorm alters the coverage of the data
set.
• Diginorm also discards lots of data!
• Various other infelicities…
o Repeats go away!
o Coverage estimation approach ~poor.
30. Diginorm is a good start:
• Diginorm works on genomes,
metagenomes, and transcriptomes;
• Diginorm is streaming and uses
sublinear space.
31. Third attempt: a semi-streaming
framework for sequence analysis
https://github.com/ged-lab/2014-streaming/
34. e.g. E. coli analysis => ~1.2 pass,
sublinear memory
Zhang et al., submitted.
35. => Efficient k-mer error
trimming.
Zhang et al., submitted.
(This all works on metagenomes & transcriptomes, too.)
36. Moving some sequence
analysis to streaming.
~1.2 pass, sublinear memory
Zhang et al., submitted.
First pass: digital normalization - reduced set of k-mers.
Second pass: spectral analysis of data with reduced k-mer set.
First pass: collection of low-abundance reads + analysis of saturated reads.
Second pass: analysis of collected low-abundance reads.
First pass: collection of low-abundance reads + analysis of saturated reads.
(a)
(b)
(c)
two-pass;
reduced memory
few-pass;
reduced memory
online; streaming.
37. Sublinear time/space read
error analysis --
Zhang et al., submitted.
Read error profile from mouse mRNAseq (c.f. Grabherr et al., 2011).
39. So, that’s pretty cool,
right?
• We provide simple time- and memory-efficient
approaches for k-mer spectral analysis of large data
sets.
• These semi-streaming approaches provide a general
framework for applying k-mer spectral approaches to all
(deep) sequencing data, including genomes,
metagenomes, and RNAseq.
• The khmer software provides a functional and
reasonably efficient reference implementation, freely
available under the BSD license and actively developed
at github.com/ged-lab/.
42. But that’s not all!
Buy now, and you can also get sequence-to-graph
alignment for the low, low price of free!*
graph = khmer.new_counting_hash(…)
aligner = khmer.ReadAligner(graph, trusted=5)
score, graph_align, read_align, is_truncated =
aligner.align(seq)
* Terms and conditions may apply. Not all source code fully works :)
45. This is a general API…
Many potential uses:
• Error correction;
• Variant calling;
• Counting (to replace mapping) & allelic counts;
• Align to multiple references;
• Tackle strain variation and polyploidy;
• Building consensus graphs from shallow population
sequencing;
• Consensus graph building from multiple read types;
• Protein-guided graph search (BlastGraph & Xander)
48. Graphalign is still alpha.
• We don’t understand parameters well.
• Unoptimized.
• Not yet competitive with existing approaches.
• Broadly applicable!
• Hope to engage w/broader community, soon.
49. Concluding thoughts #1
• None of our theory is particularly limited to De Bruijn
graphs, although our implementation is deeply tied
to them at the moment.
• We view these ideas (streaming; graphs) as a
potentially substantial improvement over current
mainstream approaches.
• We are not alone – there is a larger community
exploring these approaches! (GA4GH, esp.)
50. Concluding thoughts #2
• Our implementations are usable but not yet terribly
optimized.
• We are moving khmer towards a platform for providing
reference implementations of these approaches, as well
as for research and development.
• We are interested in providing components with decent
performance & statistical guarantees, for fun and profit.
• Python and C++ FTW!
A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
High coverage is essential.
High coverage is essential.
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.