SlideShare una empresa de Scribd logo
1 de 51
Building a platform for
bioinformatics: some exciting
new directions for khmer.
C. Titus Brown
ctbrown@ucdavis.edu
March 12, 2015
Hello!
Associate Professor (#tenure!);
School of Veterinary Medicine
University of California, Davis.
More information at:
• ged.msu.edu/ ( URL needs to be updated :)
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
Warnings
This talk contains information that may constitute
“forward-looking statements.” Generally, the words
“believe,” “expect,” “intend,” “estimate,”
“anticipate,” “project,” “will” and similar expressions
identify forward-looking statements, which generally
are not historical in nature.
I have been advised to put this disclaimer in as well:
Dr. Brown is not currently under treatment for any
disorders related to megalomania.
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGACC
ACTGGACCGA
De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
The opportunity:
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
The graph contains information about errors
(can be used for error trimming in reads).
The graph also contains information about
variants (can be used for variant calling).
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
One big challenge: scalability!
De Bruijn graph size scales with # errors.
One big challenge: scalability!
De Bruijn graph size scales with # errors.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
One big challenge: scalability!
De Bruijn graph size scales with # errors.
Goals
• Initial goal: can we assemble large data sets??
• Longer-term goal: can we find efficient (De Bruijn?)
graph-based approaches to sequence analysis?
First attempt: compressible
De Bruijn graphs
1% 5%
15%10%
Pell et al., 2012
Can use Bloom
filters to store
De Bruijn graph
structures.
=> Overall
structure
remains as you
squish graphs
down.
Technical challenges met (and defeated)
• Exhaustive in-memory traversal of graphs containing
5-15 billion nodes.
• Sequencing technology introduces false
connections in graph.
• Implementation lets us scale ~20x over other
approaches.
Pell et al., 2012
Technical challenges met (and defeated)
• Exhaustive in-memory traversal of graphs containing
5-15 billion nodes.
• Sequencing technology introduces false
connections in graph.
• Implementation lets us scale ~20x over other
approaches, but this is not enough.
• Although, see Minia assembler (Chikhi et al.)
Pell et al., 2012
Second attempt: diginorm
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
But! Shotgun sequencing is very redundant!
Lots of the high coverage simply isn’t needed.
(unnecessary data)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Contig assembly now scales with underlying genome size
• Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
• Memory efficient is improved by use of CountMin
Sketch.
Brown et al., 2012, arXiv.
Diginorm is simple:
Diginorm is only a good
start:
• Diginorm alters the coverage of the data
set.
• Diginorm also discards lots of data!
• Various other infelicities…
o Repeats go away!
o Coverage estimation approach ~poor.
Diginorm is a good start:
• Diginorm works on genomes,
metagenomes, and transcriptomes;
• Diginorm is streaming and uses
sublinear space.
Third attempt: a semi-streaming
framework for sequence analysis
https://github.com/ged-lab/2014-streaming/
Diginorm can detect
graph saturation
Zhang et al., submitted.
This generically permits semi-
streaming approaches.
Zhang et al., submitted.
e.g. E. coli analysis => ~1.2 pass,
sublinear memory
Zhang et al., submitted.
=> Efficient k-mer error
trimming.
Zhang et al., submitted.
(This all works on metagenomes & transcriptomes, too.)
Moving some sequence
analysis to streaming.
~1.2 pass, sublinear memory
Zhang et al., submitted.
First pass: digital normalization - reduced set of k-mers.
Second pass: spectral analysis of data with reduced k-mer set.
First pass: collection of low-abundance reads + analysis of saturated reads.
Second pass: analysis of collected low-abundance reads.
First pass: collection of low-abundance reads + analysis of saturated reads.
(a)
(b)
(c)
two-pass;
reduced memory
few-pass;
reduced memory
online; streaming.
Sublinear time/space read
error analysis --
Zhang et al., submitted.
Read error profile from mouse mRNAseq (c.f. Grabherr et al., 2011).
Another simple algorithm.
Zhang et al., submitted.
So, that’s pretty cool,
right?
• We provide simple time- and memory-efficient
approaches for k-mer spectral analysis of large data
sets.
• These semi-streaming approaches provide a general
framework for applying k-mer spectral approaches to all
(deep) sequencing data, including genomes,
metagenomes, and RNAseq.
• The khmer software provides a functional and
reasonably efficient reference implementation, freely
available under the BSD license and actively developed
at github.com/ged-lab/.
Stream all the things! (1/2)
Stream all the things! (2/2)
But that’s not all!
Buy now, and you can also get sequence-to-graph
alignment for the low, low price of free!*
graph = khmer.new_counting_hash(…)
aligner = khmer.ReadAligner(graph, trusted=5)
score, graph_align, read_align, is_truncated = 
aligner.align(seq)
* Terms and conditions may apply. Not all source code fully works :)
Pair-HMM-based graph
alignment
Jordan Fish and Michael Crusoe
(Full model)
Jordan Fish and Michael Crusoe
This is a general API…
Many potential uses:
• Error correction;
• Variant calling;
• Counting (to replace mapping) & allelic counts;
• Align to multiple references;
• Tackle strain variation and polyploidy;
• Building consensus graphs from shallow population
sequencing;
• Consensus graph building from multiple read types;
• Protein-guided graph search (BlastGraph & Xander)
Whole-genome variant calling
Graphalign is still alpha.
• We don’t understand parameters well.
• Unoptimized.
• Not yet competitive with existing approaches.
• Broadly applicable!
• Hope to engage w/broader community, soon.
Concluding thoughts #1
• None of our theory is particularly limited to De Bruijn
graphs, although our implementation is deeply tied
to them at the moment.
• We view these ideas (streaming; graphs) as a
potentially substantial improvement over current
mainstream approaches.
• We are not alone – there is a larger community
exploring these approaches! (GA4GH, esp.)
Concluding thoughts #2
• Our implementations are usable but not yet terribly
optimized.
• We are moving khmer towards a platform for providing
reference implementations of these approaches, as well
as for research and development.
• We are interested in providing components with decent
performance & statistical guarantees, for fun and profit.
• Python and C++ FTW!
Thanks!
Please contact me at ctbrown@ucdavis.edu!

Más contenido relacionado

Destacado

Usability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challengesUsability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challengesbolk
 
Integrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UABIntegrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UABSophia Banton
 
BPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline frameworkBPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline frameworkMohamed Nadhir Djekidel
 
Multi-omics Pathway Visualization
Multi-omics Pathway VisualizationMulti-omics Pathway Visualization
Multi-omics Pathway VisualizationAnwesha Bohler
 
The Ondex Data Integration Framework
The Ondex Data Integration FrameworkThe Ondex Data Integration Framework
The Ondex Data Integration Frameworkbosc
 
Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisCOST action BM1006
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...pratikomics
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessProf. Dr. Basavaraj Nanjwade
 
BOSC 2012 panel discussion
BOSC 2012 panel discussionBOSC 2012 panel discussion
BOSC 2012 panel discussionc.titus.brown
 
Presentation Teknisa
Presentation TeknisaPresentation Teknisa
Presentation Teknisaguestf98a87
 
Cell :: Properties
Cell :: PropertiesCell :: Properties
Cell :: Propertiesrejita
 
De Menskant Van Strategic Sourcing
De Menskant Van Strategic SourcingDe Menskant Van Strategic Sourcing
De Menskant Van Strategic SourcingElitas Groep BV
 
Experiential Thinking
Experiential ThinkingExperiential Thinking
Experiential ThinkingLive Union
 
Lakewood Lodge Information
Lakewood Lodge Information Lakewood Lodge Information
Lakewood Lodge Information Takahe One
 
Actions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market LinkedinActions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market Linkedingoldglove41
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1bamadogg
 

Destacado (20)

Usability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challengesUsability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challenges
 
Integrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UABIntegrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UAB
 
BPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline frameworkBPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline framework
 
Multi-omics Pathway Visualization
Multi-omics Pathway VisualizationMulti-omics Pathway Visualization
Multi-omics Pathway Visualization
 
The Ondex Data Integration Framework
The Ondex Data Integration FrameworkThe Ondex Data Integration Framework
The Ondex Data Integration Framework
 
Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysis
 
integration_Aug2015
integration_Aug2015integration_Aug2015
integration_Aug2015
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And Process
 
Coke
CokeCoke
Coke
 
BOSC 2012 panel discussion
BOSC 2012 panel discussionBOSC 2012 panel discussion
BOSC 2012 panel discussion
 
Presentation Teknisa
Presentation TeknisaPresentation Teknisa
Presentation Teknisa
 
Cell :: Properties
Cell :: PropertiesCell :: Properties
Cell :: Properties
 
De Menskant Van Strategic Sourcing
De Menskant Van Strategic SourcingDe Menskant Van Strategic Sourcing
De Menskant Van Strategic Sourcing
 
Experiential Thinking
Experiential ThinkingExperiential Thinking
Experiential Thinking
 
Lakewood Lodge Information
Lakewood Lodge Information Lakewood Lodge Information
Lakewood Lodge Information
 
Actions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market LinkedinActions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market Linkedin
 
3835 N Greenview #1
3835 N Greenview #13835 N Greenview #1
3835 N Greenview #1
 

Similar a Building a platform for bioinformatics: exciting new directions for khmer

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
The deep bootstrap framework review
The deep bootstrap framework reviewThe deep bootstrap framework review
The deep bootstrap framework reviewtaeseon ryu
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Sri Ambati
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practicesc.titus.brown
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior softwareMichael R. Crusoe
 

Similar a Building a platform for bioinformatics: exciting new directions for khmer (20)

2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
The deep bootstrap framework review
The deep bootstrap framework reviewThe deep bootstrap framework review
The deep bootstrap framework review
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practices
 
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 

Más de c.titus.brown

Más de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 

Último

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermicultureTakeleZike1
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 

Último (20)

Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermiculture
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 

Building a platform for bioinformatics: exciting new directions for khmer

  • 1. Building a platform for bioinformatics: some exciting new directions for khmer. C. Titus Brown ctbrown@ucdavis.edu March 12, 2015
  • 2. Hello! Associate Professor (#tenure!); School of Veterinary Medicine University of California, Davis. More information at: • ged.msu.edu/ ( URL needs to be updated :) • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
  • 3. Warnings This talk contains information that may constitute “forward-looking statements.” Generally, the words “believe,” “expect,” “intend,” “estimate,” “anticipate,” “project,” “will” and similar expressions identify forward-looking statements, which generally are not historical in nature. I have been advised to put this disclaimer in as well: Dr. Brown is not currently under treatment for any disorders related to megalomania.
  • 4. Introducing k-mers CCGATTGCACTGGACCGA (<- read) CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
  • 5. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
  • 6. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG
  • 7. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG CATGGACCGATTGCACTGGACCGATGCACGGACCG (with no accounting for mismatches or indels)
  • 8. The problem with k-mers CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG Each sequencing error results in k novel k-mers!
  • 9. The opportunity: CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG The graph contains information about errors (can be used for error trimming in reads). The graph also contains information about variants (can be used for variant calling).
  • 10. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com One big challenge: scalability! De Bruijn graph size scales with # errors.
  • 11. One big challenge: scalability! De Bruijn graph size scales with # errors. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 12. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com One big challenge: scalability! De Bruijn graph size scales with # errors.
  • 13. Goals • Initial goal: can we assemble large data sets?? • Longer-term goal: can we find efficient (De Bruijn?) graph-based approaches to sequence analysis?
  • 14. First attempt: compressible De Bruijn graphs 1% 5% 15%10% Pell et al., 2012 Can use Bloom filters to store De Bruijn graph structures. => Overall structure remains as you squish graphs down.
  • 15. Technical challenges met (and defeated) • Exhaustive in-memory traversal of graphs containing 5-15 billion nodes. • Sequencing technology introduces false connections in graph. • Implementation lets us scale ~20x over other approaches. Pell et al., 2012
  • 16. Technical challenges met (and defeated) • Exhaustive in-memory traversal of graphs containing 5-15 billion nodes. • Sequencing technology introduces false connections in graph. • Implementation lets us scale ~20x over other approaches, but this is not enough. • Although, see Minia assembler (Chikhi et al.) Pell et al., 2012
  • 17. Second attempt: diginorm Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 18. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  • 19. Actual coverage varies widely from the average. Low coverage introduces unavoidable breaks.
  • 20. But! Shotgun sequencing is very redundant! Lots of the high coverage simply isn’t needed. (unnecessary data)
  • 27. Contig assembly now scales with underlying genome size • Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results. • Memory efficient is improved by use of CountMin Sketch. Brown et al., 2012, arXiv.
  • 29. Diginorm is only a good start: • Diginorm alters the coverage of the data set. • Diginorm also discards lots of data! • Various other infelicities… o Repeats go away! o Coverage estimation approach ~poor.
  • 30. Diginorm is a good start: • Diginorm works on genomes, metagenomes, and transcriptomes; • Diginorm is streaming and uses sublinear space.
  • 31. Third attempt: a semi-streaming framework for sequence analysis https://github.com/ged-lab/2014-streaming/
  • 32. Diginorm can detect graph saturation Zhang et al., submitted.
  • 33. This generically permits semi- streaming approaches. Zhang et al., submitted.
  • 34. e.g. E. coli analysis => ~1.2 pass, sublinear memory Zhang et al., submitted.
  • 35. => Efficient k-mer error trimming. Zhang et al., submitted. (This all works on metagenomes & transcriptomes, too.)
  • 36. Moving some sequence analysis to streaming. ~1.2 pass, sublinear memory Zhang et al., submitted. First pass: digital normalization - reduced set of k-mers. Second pass: spectral analysis of data with reduced k-mer set. First pass: collection of low-abundance reads + analysis of saturated reads. Second pass: analysis of collected low-abundance reads. First pass: collection of low-abundance reads + analysis of saturated reads. (a) (b) (c) two-pass; reduced memory few-pass; reduced memory online; streaming.
  • 37. Sublinear time/space read error analysis -- Zhang et al., submitted. Read error profile from mouse mRNAseq (c.f. Grabherr et al., 2011).
  • 38. Another simple algorithm. Zhang et al., submitted.
  • 39. So, that’s pretty cool, right? • We provide simple time- and memory-efficient approaches for k-mer spectral analysis of large data sets. • These semi-streaming approaches provide a general framework for applying k-mer spectral approaches to all (deep) sequencing data, including genomes, metagenomes, and RNAseq. • The khmer software provides a functional and reasonably efficient reference implementation, freely available under the BSD license and actively developed at github.com/ged-lab/.
  • 40. Stream all the things! (1/2)
  • 41. Stream all the things! (2/2)
  • 42. But that’s not all! Buy now, and you can also get sequence-to-graph alignment for the low, low price of free!* graph = khmer.new_counting_hash(…) aligner = khmer.ReadAligner(graph, trusted=5) score, graph_align, read_align, is_truncated = aligner.align(seq) * Terms and conditions may apply. Not all source code fully works :)
  • 44. (Full model) Jordan Fish and Michael Crusoe
  • 45. This is a general API… Many potential uses: • Error correction; • Variant calling; • Counting (to replace mapping) & allelic counts; • Align to multiple references; • Tackle strain variation and polyploidy; • Building consensus graphs from shallow population sequencing; • Consensus graph building from multiple read types; • Protein-guided graph search (BlastGraph & Xander)
  • 46.
  • 48. Graphalign is still alpha. • We don’t understand parameters well. • Unoptimized. • Not yet competitive with existing approaches. • Broadly applicable! • Hope to engage w/broader community, soon.
  • 49. Concluding thoughts #1 • None of our theory is particularly limited to De Bruijn graphs, although our implementation is deeply tied to them at the moment. • We view these ideas (streaming; graphs) as a potentially substantial improvement over current mainstream approaches. • We are not alone – there is a larger community exploring these approaches! (GA4GH, esp.)
  • 50. Concluding thoughts #2 • Our implementations are usable but not yet terribly optimized. • We are moving khmer towards a platform for providing reference implementations of these approaches, as well as for research and development. • We are interested in providing components with decent performance & statistical guarantees, for fun and profit. • Python and C++ FTW!
  • 51. Thanks! Please contact me at ctbrown@ucdavis.edu!

Notas del editor

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  3. High coverage is essential.
  4. High coverage is essential.
  5. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.